Identifying the Computing Resources Used by a Linux Job

When you submit a job to the SSCC's Slurm cluster, you must specify how many cores and how much memory it will use. Doing so accurately will ensure your job has the resources it needs to run successfully while not taking up resources it does not need and preventing others from using them. This article will show you how to identify what your computing resources your job uses. Job lengths for Linstat are limited to 30 days. If your job is expected to take longer than that we recommend using our Slurm cluster to run jobs needing longer than 30 days.

You may be able to predict how many cores and how much memory your job will use, but the way to be sure is to run a test job (or part of a job) on Linstat and use the tools below to see what it actually uses.

Predicting Cores Used

Most jobs fall into one of three categories:

  • Jobs that use a single core
  • Jobs that use a fixed number of cores (for example, the SSCC's Stata MP license limits it to 32 cores)
  • Jobs that use all the cores in the server.

Some functions that take advantage of multiple cores ask you to specify the number of cores to use. Sometimes it will refer to workers, threads, or processes, and these are equivalent. If your job can use all the cores in the server, do so so it gets done as quickly as possible. (Be considerate of others by taking turns using a server rather than by sharing it.)

However, do not assume that if a function does not ask how many cores to use it will not use multiple cores. Some functions automatically try to use all the cores in the server. Occasionally researchers get themselves into trouble by trying to parallelize functions that have already been parallelized for them.

If you're not sure how many cores a function uses, run it and then monitor the job with top. You definitely don't want to claim all the cores in a server for a job that can only use one!

Predicting Memory Used

Most statistical software loads the data it will use into memory (SAS is a notable exception). The amount of memory needed by a data set is usually similar to or smaller than the amount of disk space the data set uses (in Windows, right-click on the file and choose Properties to see how big it is). Keep in mind what you will then do with the data: making a second copy of the data will double the memory used, for example. Combining data sets will often result in a data set that's bigger than the sum of the sizes of the original data.

The amount of memory you need to tell Slurm to reserve is the highest amount your job ever uses. Consider adding a safety margin, perhaps 25% of the total, as your job will only be able to use the memory you reserve and will crash if it runs out.

If your job will use all the cores in a server you can just tell it to use all the memory in the server too. No one else can use a server's memory if you're using all its cores.

Many tools report memory usage in bytes. A gigabyte or GB is defined as either 1 billion bytes or 230 bytes (1,073,741,824, which is close enough to 1 billion as not to matter for our purposes). The SSCC's Linux servers have hundreds of gigabytes of memory, so anything less than a gigabyte is trivial and can be rounded up to 1GB. If a tool reports memory in bytes and the number has fewer than ten digits, it's trivial. If the number has ten digits or more, you can ignore the last nine but round up (e.g. if you see 14,437,512,394 bytes think 15GB).

Monitoring Resource Use with top

The top command lists the top jobs on the server and the resources they are using. Monitoring top while your job is running is the easiest way of knowing what resources it uses, but keep in mind that your job's needs may change as it runs. Here is an example of what you'll see if you run top (usernames have been changed for privacy, and the list has been truncated):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
911054 user1  20   0   22.6g  21.9g   4384 R  4288   2.9  81566:20 a.out  
275522 user2   20   0 3738648 999460  55320 R 314.5   0.1  42163:08 python  
1028196 user3  20   0   10.4g   1.4g 333524 S  99.3   0.2   1484:59 MATLAB  
1028236 user3  20   0   10.4g   1.3g 332884 S  99.3   0.2   1487:13 MATLAB  
1028412 user3   20   0   10.4g   1.3g 335700 S  99.3   0.2   1487:55 MATLAB  
1028484 user3    20   0   10.4g   1.3g 334660 S  99.3   0.2   1480:15 MATLAB  
1028543 user3    20   0   10.4g   1.3g 334216 S  99.3   0.2   1477:58 MATLAB  
1028728 user3    20   0   10.3g   1.4g 333268 S  99.3   0.2   1478:36 MATLAB  
1028312 user3    20   0   10.2g   1.3g 332724 S  98.7   0.2   1486:45 MATLAB  
1028198 user3    20   0   10.4g   1.3g 334844 S  98.4   0.2   1485:36 MATLAB  

The most important columns are RES (resident memory) and %CPU.

A 'g' at the end of the RES column indicates that the memory is given in gigabytes--if it doesn't end with 'g' then the amount used is trivial. Thus the job being run by user1 is using 21.9GB of memory, while the job being run by user2 is using 999,460 bytes of memory, or 0.001GB.

%CPU is measured in terms of cores, so a %CPU of 100 means a process is using one core all of the time. However, you'll rarely see 100% because processes frequently have to wait for data to be retrieved from memory or disk. Also, if a server is busy processes will have to share cores. 

Some jobs that use multiple cores will show up as multiple processes, each using a single core. The job being run by user3 is using 8 cores, though each process is spending roughly 1% of its time waiting for data. Each of those processes is using 1.3-1.4GB of memory, so the the job's total memory usage is about 10.6GB. 

Other jobs that use multiple cores show up as a single process with more than 100% CPU utilization (often much more). Divide by 100 to estimate the number of cores used, but this will probably be low. For example user1's job is probably running on more than 43 cores, but not getting exclusive use of all of them. Most likely it would use all the cores in the server if it could.

Having Your Job Report Memory Usage

You can also have your job report its memory usage as it runs and store it in the job's log. That way you do not need to spend time watching top. Just put the following commands in your code at various points and read the results from your log.

Stata

The memory command will tell you how much memory Stata is using. The key number is total allocated:

Memory usage
                                         Used                Allocated
----------------------------------------------------------------------
Data                                    3,182               67,108,864
strLs                                       0                        0
----------------------------------------------------------------------
Data & strLs                            3,182               67,108,864

----------------------------------------------------------------------
Data & strLs                            3,182               67,108,864
Variable names, %fmts, ...              4,178                   68,030
Overhead                            1,081,344                1,082,136

Stata matrices                              0                        0
ado-files                             189,424                  189,424
Stored results                              0                        0

Mata matrices                         136,000                  136,000
Mata functions                         50,320                   50,320

set maxvar usage                    5,281,738                5,281,738

Other                                   6,222                    6,222
----------------------------------------------------------------------
Total                               6,568,816               73,922,734

This job is using 73,922,734 bytes of memory or .074GB. 

R

The lobstr package can tell you a variety of things about your data, including memory used. Install it with install.packages("lobstr") and load it with library(lobstr). Then run:

lobstr::mem_used()
847,626,280 B

This job is using 847,626,280 bytes of memory or 0.847GB.

Python

The psutil package can tell you a variety of things about your job, including memory used. Load it with import psutil. Then run:

psutil.Process().memory_info().rss
448147456

This job is using 448,147,456 bytes of memory, or 0.448GB.

Slurm Efficiency Reports

After you run a job in Slurm, the seff command can tell you how the resources it actually used compared to what you reserved. Using the SSCC Slurm Cluster has more details.


Keywords:
cores cpu memory ram slurm resources linux sscc 
Doc ID:
115421
Owned by:
Russell D. in Social Science Computing Cooperative
Created:
2021-12-29
Updated:
2022-12-19
Sites:
Social Science Computing Cooperative