Identifying the Computing Resources Used by a Linux Job
You may be able to predict how many cores and how much memory your job will use, but the way to be sure is to run a test job (or part of a job) on Linstat and use the tools below to see what it actually uses.
Predicting Cores Used
Most jobs fall into one of three categories:
- Jobs that use a single core
- Jobs that use a fixed number of cores (for example, the SSCC's Stata MP license limits it to 32 cores)
- Jobs that use all the cores in the server.
Some functions that take advantage of multiple cores ask you to specify the number of cores to use. Sometimes it will refer to workers, threads, or processes, and these are equivalent. If your job can use all the cores in the server, do so so it gets done as quickly as possible. (Be considerate of others by taking turns using a server rather than by sharing it.)
However, do not assume that if a function does not ask how many cores to use it will not use multiple cores. Some functions automatically try to use all the cores in the server. Occasionally researchers get themselves into trouble by trying to parallelize functions that have already been parallelized for them.
If you're not sure how many cores a function uses, run it and then monitor the job with top
. You definitely don't want to claim all the cores in a server for a job that can only use one!
Predicting Memory Used
Most statistical software loads the data it will use into memory (SAS is a notable exception). The amount of memory needed by a data set is usually similar to or smaller than the amount of disk space the data set uses (in Windows, right-click on the file and choose Properties to see how big it is). Keep in mind what you will then do with the data: making a second copy of the data will double the memory used, for example. Combining data sets will often result in a data set that's bigger than the sum of the sizes of the original data.
The amount of memory you need to tell Slurm to reserve is the highest amount your job ever uses. Consider adding a safety margin, perhaps 25% of the total, as your job will only be able to use the memory you reserve and will crash if it runs out.
If your job will use all the cores in a server you can just tell it to use all the memory in the server too. No one else can use a server's memory if you're using all its cores.
Many tools report memory usage in bytes. A gigabyte or GB is defined as either 1 billion bytes or 230 bytes (1,073,741,824, which is close enough to 1 billion as not to matter for our purposes). The SSCC's Linux servers have hundreds of gigabytes of memory, so anything less than a gigabyte is trivial and can be rounded up to 1GB. If a tool reports memory in bytes and the number has fewer than ten digits, it's trivial. If the number has ten digits or more, you can ignore the last nine but round up (e.g. if you see 14,437,512,394 bytes think 15GB).
Monitoring Resource Use with top
The top
command lists the top jobs on the server and the resources they are using. Monitoring top
while your job is running is the easiest way of knowing what resources it uses, but keep in mind that your job's needs may change as it runs. Here is an example of what you'll see if you run top
(usernames have been changed for privacy, and the list has been truncated):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
911054 user1 20 0 22.6g 21.9g 4384 R 4288 2.9 81566:20 a.out
275522 user2 20 0 3738648 999460 55320 R 314.5 0.1 42163:08 python
1028196 user3 20 0 10.4g 1.4g 333524 S 99.3 0.2 1484:59 MATLAB
1028236 user3 20 0 10.4g 1.3g 332884 S 99.3 0.2 1487:13 MATLAB
1028412 user3 20 0 10.4g 1.3g 335700 S 99.3 0.2 1487:55 MATLAB
1028484 user3 20 0 10.4g 1.3g 334660 S 99.3 0.2 1480:15 MATLAB
1028543 user3 20 0 10.4g 1.3g 334216 S 99.3 0.2 1477:58 MATLAB
1028728 user3 20 0 10.3g 1.4g 333268 S 99.3 0.2 1478:36 MATLAB
1028312 user3 20 0 10.2g 1.3g 332724 S 98.7 0.2 1486:45 MATLAB
1028198 user3 20 0 10.4g 1.3g 334844 S 98.4 0.2 1485:36 MATLAB
The most important columns are RES
(resident memory) and %CPU
.
A 'g' at the end of the RES
column indicates that the memory is given in gigabytes--if it doesn't end with 'g' then the amount used is trivial. Thus the job being run by user1 is using 21.9GB of memory, while the job being run by user2 is using 999,460 bytes of memory, or 0.001GB.
%CPU
is measured in terms of cores, so a %CPU
of 100 means a process is using one core all of the time. However, you'll rarely see 100% because processes frequently have to wait for data to be retrieved from memory or disk. Also, if a server is busy processes will have to share cores.
Some jobs that use multiple cores will show up as multiple processes, each using a single core. The job being run by user3 is using 8 cores, though each process is spending roughly 1% of its time waiting for data. Each of those processes is using 1.3-1.4GB of memory, so the the job's total memory usage is about 10.6GB.
Other jobs that use multiple cores show up as a single process with more than 100% CPU utilization (often much more). Divide by 100 to estimate the number of cores used, but this will probably be low. For example user1's job is probably running on more than 43 cores, but not getting exclusive use of all of them. Most likely it would use all the cores in the server if it could.
Having Your Job Report Memory Usage
You can also have your job report its memory usage as it runs and store it in the job's log. That way you do not need to spend time watching top
. Just put the following commands in your code at various points and read the results from your log.
Stata
The memory
command will tell you how much memory Stata is using. The key number is total allocated:
Memory usage Used Allocated ---------------------------------------------------------------------- Data 3,182 67,108,864 strLs 0 0 ---------------------------------------------------------------------- Data & strLs 3,182 67,108,864 ---------------------------------------------------------------------- Data & strLs 3,182 67,108,864 Variable names, %fmts, ... 4,178 68,030 Overhead 1,081,344 1,082,136 Stata matrices 0 0 ado-files 189,424 189,424 Stored results 0 0 Mata matrices 136,000 136,000 Mata functions 50,320 50,320 set maxvar usage 5,281,738 5,281,738 Other 6,222 6,222 ---------------------------------------------------------------------- Total 6,568,816 73,922,734
This job is using 73,922,734 bytes of memory or .074GB.
R
The lobstr
package can tell you a variety of things about your data, including memory used. Install it with install.packages("lobstr")
and load it with library(lobstr)
. Then run:
lobstr::mem_used()
847,626,280 B
This job is using 847,626,280 bytes of memory or 0.847GB.
Python
The psutil
package can tell you a variety of things about your job, including memory used. Load it with import psutil
. Then run:
psutil.Process().memory_info().rss
448147456
This job is using 448,147,456 bytes of memory, or 0.448GB.
Slurm Efficiency Reports
seff
command can tell you how the resources it actually used compared to what you reserved. Using the SSCC Slurm Cluster has more details.