Getting Started in Linux
A guide to running the Linux OS with a focus on using Linstat
Table of Contents
- SSCC Linux Computers
- The Linux operating system
- How to Formulate a Linux Command
- A Few Simple Useful Utilities
- How Linux Stores Files: The Linux File System
- File Names under Linux
- Home Directories and the Present Working Directory
- Manipulating The File System
- Viewing The Contents of Files
- Using Pipes to View The Output of Commands
- Using Pipes to Print the Output of Commands
- Command Shortcuts
- Getting Help
- In Case of Emergency: What to Try When Things Go Wrong
- Managing Disk Space
- Choosing the Proper Linux Computer
- Running Jobs
- Summary of Commands
- Other Sources of Information
This handbook will introduce you to the Linux operating system, with a focus on using SSCC's Linstat servers. It is intended for those who want to use Linux for more than just a way to run statistical jobs. If your goal is just to run jobs on Linstat, Using Linstat and Managing Jobs on Linstat will probably teach you everything you need to know.
Linstat is the SSCC's cluster of servers running Linux. When you connect to Linstat, you'll be directed to one of the three Linstat servers (linstat1, linstat2 and linstat3) automatically. This will spread users among the three servers and help avoid situations where one server is much busier than another.
Linux is designed for remote logins and can be used very successfully from anywhere in the world. To connect to a Linux server you will need a client program capable of using a secure protocol, ideally SSH. X-Win32 is our suggestion for PC's. For details on downloading and using X-Win32, see Connecting to SSCC Linux Computers using X-Win32. For other options see the Connecting to Linstat section of Using Linstat.
When you are finished with your login session, be sure to log off by typing exit at the Linux prompt.
3. The Linux Operating System
Linux is a very powerful, flexible operating system. In a few minutes, it is possible to learn enough to get into the system, run statistical programs like Stata, and get out again. On the other extreme, those who have worked on Linux for years are still learning every day. This reflects both the power and the complexity of the operating system.
When you log in to a Linux computer, a prompt will appear on the screen, waiting for you to enter a command. At this point you can enter any valid Linux command and the computer will run it.
The syntax of a Linux command is very simple: first, enter the command name, followed by any options and any other parameters. Spaces separate the command name from the options and the options from the parameters. Once the command has been completely formed, press Enter. When you press Enter, the command is executed.
For instance, if you want to know the current date and time, use the date command. Then press Enter. The current date and time will appear, followed by another prompt. Your login session will look like this:
Mon Feb 18 10:52:55 CST 2008
When the prompt appears (the prompt here is linstat2.ssc.wisc.edu>), the computer is ready for you to enter another command.
Note that the prompt will vary depending on the machine on which you are working. You can also customize the prompt to be anything you like.
Unlike some other operating systems, Linux is case sensitive. The command date is not the same as the command DATE. You must always use the proper case when running Linux commands. Fortunately, this is simple, as virtually all Linux commands are lower case.
Below are some simple, useful commands that you can run right away. Try these:
displays the calendar for the current month. To see a calendar for the whole year, try:
> cal 1997
In this example, "1997" is a parameter to the command cal: it is telling cal to give information for all of 1997, instead of giving the default information for the current month. Be sure to use the "19" or cal will display the calendar for the year 97, not the year 1997.
> cal 12 1997
displays the calendar for the month of December, 1997. Here, cal is taking two parameters. The first parameter is the month and the second parameter is the year.
displays a list of users currently logged into that computer, also giving the time that the user logged in.
This extremely useful command tells the current time, how long the computer has been up, how many users are currently logged on, and how busy the computer has been for the last one, five, and 15 minutes. This is the "load average", the average number of jobs that were waiting to run in that time increment. To understand how to interpret the load average, see the System Load section later in this handbook.
displays the name of the computer on which you are working.
clears your screen and puts a prompt on the top line of the screen.
> sscwho Andy Arnold
The sscwho command looks into the SSCC directory and displays information about the person you are looking up. In the command above, information about SSCC Director Andy Arnold will be displayed, as well as information about any other SSCC member with these names.
Most of the above commands were simple commands to run. Only one of them required parameters (sscwho) and none required options. Later, commands will be introduced that require options to provide important information. The critical point about these commands can be seen from these examples: the command comes first; spaces separates parameters from the command and parameters from each other.
All computers store files in some type of file system. These file systems largely resemble each other: individual files are referenced through folders or directories, terms that can be used interchangeably. The term "directory" is preferred by Linux users.
Two features distinguish the Linux file system from Windows:
1. Linux uses a forward slash, instead of a backslash to indicate the existence of a directory. For example, Windows might refer to a file as:
but Linux would refer to a file as:
The items "home", "r", "rdimond", and "saswork" are all directories, but the names are separated by forward slashes in Linux, not backslashes, as in Windows.
2. Linux does not start a file name with the name of a disk. On a Windows machine, the start of any file name is a disk name, such as C: for the main hard disk or A: for the floppy. Linux attempts to hide disks from the user. For instance, a directory might be called:
This path name refers to a directory called rdimond. The rdimond directory is in the directory called r; the r directory is in the directory called home; the home directory is in the directory called root, and displayed as a preceding forward slash, the "/" at the beginning of the name. The root directory is the starting directory on Linux, from which all other files and directories are descended. All files and directories on Linux exist at some place relative to the root directory. The full path name of a file always begins with a forward slash, with a reference to the root.
File and directory names under Linux are quite freeform. (In this section, we will use the expression "file names" to mean "file or directory names".) All numbers and letters of the alphabet are allowed in file names, as are several special characters such as "." (dot) and "_" (underscore). Linux has no naming regulations, such as the requirement that a dot appear in the name. However, despite having few formal rules, the following guidelines will assist you in working with files.
- The first character of a file name should be a letter of the alphabet or a number. Do not use a special character, such as a dot, a plus sign or a minus sign. Any of these could lead to difficulties when attempting to manipulate the file or directory.
- Do not use spaces or tabs in file names.
- File names with multiple periods such as filename.ext.ext are valid.
- Keep in mind that Linux is case sensitive: the names outfile, Outfile and OutFile represent three different files. However, it is not wise to create files in which the only difference among names is the case, as this can confuse PCs if you ever map your Linux home directory as a network drive on a PC.
- Although virtually all file names are legal, there are a few names that should be avoided: core and .rhosts. The system uses the name core for a dump of certain data when a command fails. (If you ever see one of these files in one of your directories, the file can be safely removed.) If you create a file called .rhosts you may unintentionally permit others to access your home directory. Of course, this is an uncommon name, and one that you are not likely to create accidentally.
- Filenames starting with a period are special files called "hidden files" and will only be displayed in a directory listing if you use ls with the -f or -a option.
File naming conventions are only conventions and are not used to distinguish file type. Some commonly-used conventions are:
|.do (Stata command files)|
.dta (data files stored in Stata format)
.gif (graphics file)
.gz (compressed file)
.htm (Web page)
.html (Web page)
.jpg (graphics file)
.jpeg (graphics file)
.log (SAS or Stata log file)
.lst (SAS listing)
|pdf (Adobe pdf file)|
.ps (PostScript file)
.sas (SAS source file)
.sas7bdat (data files stored in SAS format)
.sps (SPSS source file)
.tar (archive file)
.tex (TeX file)
.zip (compressed file)
.Z (compressed file)
All user accounts have a part of the file system that is their own. This is called their home directory. When you first log in, Linux makes your home directory your present working directory. Your present working directory is the directory where files and directories will be listed, created, changed, or removed by default, unless you instruct the computer to perform the action in another location (examples to follow, below).
Home directories are located in a subdirectory of the directory called /home. /home consists of a series of directories, one for each letter of the alphabet. Home directories are under the letter of the alphabet corresponding to the first letter of your login name. For instance, the home directory of the user account named swald is at /home/s/swald and the home directory of the user account named mcdermot is at /home/m/mcdermot.
Home directories are the place for you to put your files. You can control access permissions for files in your home directory, allowing others to see files, or to change files, or denying them these privileges.
The Linux tools used most often by users are the commands that allow users to manipulate files and directories. These commands include:
|ls||display the tables of a directory|
|pwd||display the full path name of the present working directory|
|cd||change present working directory|
|mkdir||create a new directory|
|rmdir||remove a directory|
|cp||copy a file|
|mv||move or rename a file|
|rm||remove a file|
To determine your present working directory, use the pwd command:
To change your present working directory, use the cd command. For example, to change to the /tmp directory (the system directory for temporary files):
> cd /tmp
Remember that a space separates the command (cd) from the parameter (/tmp). If the command is successful, it will not display any information; it will simply return a command prompt. To confirm that you really did change to the /tmp directory, issue the pwd command. For instance:
> cd /tmp
To return back to your home directory from any other directory, enter the cd command without a parameter. For instance:
Once you change directories, one of the first things you will want to do is look at the tables of the directory. To do this, use the ls command. For instance:
There are two items in the present working directory, called bin and README. To determine if these items are files or directories, you must ask for a long listing. To do this, use the -l option (long listing) to the ls command. Options in Linux begin with minus signs and are usually one letter long. For instance:
> ls -l total 52 drwxrwx--- 2 guest12 guest12 4096 Feb 18 11:09 bin -rw-r----- 1 guest12 guest12 38331 Feb 18 11:13 README
The dash "-" in the first column of the README line indicates that this is a file. The "d" in the first column of the bin line indicates that this is a directory. The "total" line indicates how many blocks are taken up by items in this directory. It is not usually useful and can be safely ignored.
Let's look at the long listing of the README file more closely:
-rw-r----- 1 guest12 guest12 38331 Feb 18 11:13 README 1 2 3 4 5 6 7 8
The long listing provides a lot of information about the file in a single line. As stated, the first character is the file type (labeled 1 above). Generally, this will either be a dash or a d, indicating that it is an ordinary file or a directory. Following the file type are nine characters (labeled 2 above) indicating the file permissions (file permissions will be discussed in a later section). The number following this (labeled 3 above) can be ignored; it is for use by advanced Linux users. The next two fields (labeled 4 and 5 above) are the owner of the file and the group affiliation of the file. All files on the Linux file system are owned by someone and have some group affiliation. Next is the size of the file in bytes (labeled 6 above). A byte is the equivalent of a single character. Next comes the date and time that the file was modified (labeled 7 above). Finally comes the file name (labeled 8 above).
You can also list the tables of a directory without changing to it. To do this, give the directory name that you want listed as a parameter to the ls command. For instance:
> ls -l /tmp total 629 -rw------- 1 rdimond system 147456 Aug 6 22:16 Ex25804 -rw------- 1 rdimond system 81920 Aug 6 22:15 Rx25804 -rw-r--r-- 1 root system 59 Aug 6 13:34 lpq.00125519 -rw------- 1 flory system 825012 Aug 5 11:54 ng5chi.dat -rw-r--r-- 1 tpan system 3086 Aug 6 10:43 rrn.16443 -rw-r--r-- 1 tpan system 355337 Aug 6 10:43 rrnact.16443 drwxr-xr-x 2 pkovatch system 512 Aug 1 04:20 spss_125
Other useful options for the ls command are listed below:
|ls -a||(all) Include "dot" files, those beginning with a dot|
|ls -F||(File types) Identify file types with codes; / for directories, * for executables, and @ for symbolic links|
|ls -R||(Recursive) Recursively list all subdirectories|
|ls -r||(reverse) Sort in reverse order|
|ls -s||(size) Display the size in kilobytes|
|ls -t||(time) Sort by time modified|
|ls -u||(used) Show time of last access|
Within your home directory, you have the ability to organize your files as you please. This means that you can create subdirectories within your home directory. To do this, use the mkdir command. For instance:
> ls -l total 52 drwxrwx--- 2 guest12 guest12 4096 Feb 18 11:09 bin -rw-r----- 1 guest12 guest12 38331 Feb 18 11:13 README > mkdir homework > ls -l total 56 drwxrwx--- 2 guest12 guest12 4096 Feb 18 11:09 bin drwxrwx--- 2 guest12 guest12 4096 Feb 18 11:24 homework -rw-r----- 1 guest12 guest12 38331 Feb 18 11:13 README
In this example, a new directory was created called "homework". Use cd to change to the homework directory. For instance:
> pwd /home/g/guest12 > cd homework > pwd /home/g/guest12/homework
If you decided that this directory was not needed after all, you could remove the directory using the rmdir command. For instance:
> ls -l total 56 drwxrwx--- 2 guest12 guest12 4096 Feb 18 11:09 bin drwxrwx--- 2 guest12 guest12 4096 Feb 18 11:24 homework -rw-r----- 1 guest12 guest12 38331 Feb 18 11:13 README > rmdir homework > ls -l total 52 drwxrwx--- 2 guest12 guest12 4096 Feb 18 11:09 bin -rw-r----- 1 guest12 guest12 38331 Feb 18 11:13 README
The homework directory is now gone. This only works if the directory is empty, that is, it has no files or directories within it.
Files are created in a number of ways. You can use an editor, such as EMACS or PICO to create a file; statistical programs, such as SAS or SPSS create files; you might create files using a PC application like TextPad, with your Linux home directory as a network drive. In any case, once files are created, it is often necessary to copy, move, rename, or remove them.
To copy a file, use the cp command. For instance, if you have a file called README and you wish to copy it to readme.new, you would do this:
> cp README readme.new > ls -l total 92 drwxrwx--- 2 guest12 guest12 4096 Feb 18 11:09 bin -rw-r----- 1 guest12 guest12 38331 Feb 18 11:13 README -rw-r----- 1 guest12 guest12 38331 Feb 18 11:32 readme.new
The original file has not been changed in any way, but a new file has been created. This new file is a copy of the original, with a different name. Also, because Linux is case sensitive, the file names were specified with the appropriate cases. The new file name has a dot in the name, and a suffix. As stated earlier, suffixes to Linux are entirely unimportant (although they may be important to particular applications!). There may be as many letters before or after the dot as desired. Finally, note that the last modification date on the new file is different from the last modification date on the old file. The new file's modification date is the creation date.
Now, let's create a directory called Documentation and move the new file to that directory using the mv command:
> mkdir Documentation > mv readme.new Documentation > ls -l Documentation total 40 -rw-r----- 1 guest12 guest12 38331 Feb 18 11:32 readme.new
The readme.new file is now in the Documentation directory (again, notice that the D in Documentation is capitalized).
The cp command can also be used to make a copy of a file, using the same file name as the original, but placing it in a different directory. For instance:
> cp README Documentation > ls -l Documentation total 80 -rw-r----- 1 guest12 guest12 38331 Feb 18 11:35 README -rw-r----- 1 guest12 guest12 38331 Feb 18 11:32 readme.new
In this example, the file called README is copied to the directory called Documentation, the name not changing.
The mv command can be used to rename a file. For instance:
> mv readme.new oldreadme > ls -l total 80 -rw-r----- 1 guest12 guest12 38331 Feb 18 11:32 oldreadme -rw-r----- 1 guest12 guest12 38331 Feb 18 11:35 README
A note of caution about using the cp and mv commands: If you copy or move a file to a file name that already exists, the existing file will be overwritten without notice.
Now the Documentation directory has two copies of the same file with two different names. You can remove a file using the rm command. For instance:
> cd Documentation > rm oldreadme > ls -l total 40 -rw-r----- 1 guest12 guest12 38331 Feb 18 11:35 README
You can also remove the Documentation directory and all of its tables, but you cannot use the rmdir command, which is only for removing empty directories. To remove a directory, including all of its tables, use the -r option to the rm command. For example:
> cd > rm -r Documentation > ls -l total 52 drwxrwx--- 2 guest12 guest12 4096 Feb 18 11:09 bin
This will remove the Documentation directory and all of its tables with no questions asked. This is somewhat dangerous. A better way to use rm is to use the -i option also, which forces you to confirm that you really want to remove each file or directory. For example:
> cd > rm -r -i Documentation rm: remove Documentation/README? y rm: remove Documentation? y > ls -l total 52 drwxrwx--- 2 guest12 guest12 4096 Feb 18 11:09 bin
The rm command now asks you to confirm that you really want to remove each item. You can answer y or Y (or any other answer that begins with a y or Y, such as yes, yep or yessireebob) and the item will be removed. Any other answer and the item will not be removed.
One warning about removing Linux files: once a file is removed, it may be gone forever. When a user accidentally removes a file, SSCC staff can sometimes restore the file from the nightly backups, but this is not always possible. Use the -i option when using the rm command to protect your data.
To view the tables of a file, you can use the more command:
> more filename
Replace "filename" with the name of the file you wish to view. The file will be displayed one screenful at a time. There are many subcommands within more, but the following are the most useful:
|space||scroll down a full screen|
|Enter||scroll down a single line|
|b||scroll up a full screen|
|q||quit out of more and return to the command line|
To use a subcommand, simply type in the command when the system pauses after displaying a screen of information.
Very often, the information scrolling across the screen is not the tables of a file, but other information, such as the long listing of a directory. You can still use the more command to view the output, but you use it through a special Linux feature called a pipe. To use a pipe, type the command as you usually would, but after the command, instead of pressing Enter, place the pipe symbol "|", depicted on your keyboard as a solid or broken line and then type the more command. This will take the output of the ls command and place it in the more command. For instance:
> ls -l /tmp | more
This can be used with any command that displays more than a screen full of information. For example:
> cal 1997 | more
This command would display the calendar for 1997, but it would be displayed within the more command, allowing you to scroll up or down, as desired.
Pipes are one of the most powerful features of Linux.
Linux pipes give the user the ability to print any data that can be displayed on the screen. For instance, if you wish to print out a listing of your home directory, do the following:
> ls -l | enscript
In this example, no listing is printed to the screen; the computer returns a prompt to you without showing you the listing. The output of the ls command is sent to the default printer.
Once users begin to use Linux commands with some regularity, they rapidly start to desire certain shortcuts for some operations. Linux provides shortcuts and alternative methods for performing actions in abundance. This section introduces some relatively simple shortcuts that are not necessary for users to perform their work, but may be useful to beginning level students.
Wildcard characters allow you to specify many files at once, or to specify a single file concisely. The wildcard characters are the asterisk (*), the question mark (?), and the square brackets (). You can use wildcard characters with commands like ls, cp, mv and rm to perform an action on several files. Below are examples of the use of wildcard characters with the ls command:
> ls R*
The asterisk means "zero or more of any character." In this example, the ls command listed two files beginning with an R.
> ls *.old
Wild card characters can appear anywhere in a file name: in the beginning, middle, or end. In this example, the ls command listed two files ending with .old.
> ls *old*
Multiple wild card characters can be used. In this example, the ls command listed three files that had old somewhere within the file name.
> ls hmwork?
A question mark stands for one character within the list or range shown. In this example, the ls command listed four files that started with hmwork and then had a single character following.
> ls hmwork[2-4]
The ls command listed three files that started with hmwork and then had a single character following in the range of 2 to 4. This range might have been a to z (including all lower case letters), or N to m (including the second half of capitalized letters and the first half of lower case letters).
Any of these wild card characters can be used multiple times, and in combination with each other.
As configured for new SSCC users, Linux allows you to use the tilde (~) as an abbreviation for your home directory. In any command where you want to specify your home directory, you may use the tilde instead. For example:
> cd ~/data
> ls ~
The user changes to the data subdirectory of her home directory and then listed the tables of her home directory.
The tilde followed immediately by a user's login name is an abbreviation for that user's login directory. For example:
> ls ~smith
> cd ~jones/sas
This will list the directory called /home/s/smith and then change to the directory called /home/j/jones/sas provided the proper permissions are set on the directories.
Path Abbreviations: The . and ..
Two other abbreviations, the .. and the . are shortcuts that can save you keystrokes. .., also called dot-dot, can be used to refer to the directory up one level from the current directory. For example:
> pwd /home/g/guest12/homework > cd .. > pwd /home/g/guest12 > cd .. > pwd /home/g > cd > pwd /home/g/guest12
Each cd .. command moved the present working directory up one level. The cd command without a parameter moved the present working directory back to the home directory, as we saw before.
., also called dot, is a shortcut used to refer to the current directory. For example:
> mv /project/sandefur/wave9/ameier/2003/readme.new .
moves the file readme.new from the location specified to the users current working directory.
As configured for new SSCC users, Linux allows users to edit the command line. This can be as simple as rerunning the previous command to making modifications in the command currently on the screen. This is performed using the arrow keys. Use the up arrow to display previous commands. Each strike of the up arrow key will step backwards through the list of previous commands. When you find the command that you want to rerun, simply press Enter. If you go past the command, use the down arrow to step forward through commands.
If you find a command that you want to rerun, but it is slightly off, use the left and right arrows to move across the command line, use the backspace key to remove a character, and add any character you wish. When the command is properly displayed, press Enter to execute the command.
The exclamation point can also run a previous command. Type an exclamation point followed by the first letters of a command and the last command that began with those letters will be rerun. For example:
This will run the last emacs command. This might be quite useful if, for instance, the last emacs command was something like:
> emacs ~jones/progs/oldstuff/dissert.dat
On-line help is available on Linux through the command called man, which is short for manual pages. The man command displays reference pages on the screen. These pages can be written obscurely. If you do not understand a reference page, contact SSCC's help desk for assistance.
If you don't know exactly what command you need to use, you can find a command using the -k option to the man command. The -k option searches for key words in the NAME section of the man page. For example:
> man -k compare
will list on the screen Linux commands that can be used to compare files.
Sometimes the system just stops working properly for no reason apparent to the new user. When this happens, here are a few keystrokes that might help you.
The < Ctrl-C> keystroke is the interrupt command. It should cancel the current operation and return the prompt to the screen.
The < Ctrl-S> keystroke stops items from displaying on the screen temporarily. This is not useful to a beginning Linux user, but users may accidentally type this, perhaps when intending to type an upper case S. The < Ctrl-Q > keystroke will override the < Ctrl-S> keystroke, allowing the screen to begin displaying again.
Some times, the computer is taking input and waiting for the end of the input. A <Ctrl-D> is the end of file (or end of input) character. Type this keystroke if the system is awaiting input from you and you have given it all the input. This may happen when, for instance, you use the cat command, but forget to give the file name. The system will wait for you to type in what you want printed to the screen. It will take as many characters as you can type, including returns and will not return the prompt to you until it gets the end of file character, the <Ctrl-D>.
In this section you will learn about the disk space available to you at SSCC and how to manage it.
SSCC provides two categories of storage space for individual users: home directory space and short term disk space. Both types of individual disk space are described in the SSCC's Member Handbook including quotas and backup policies.
If you are working on a research project with a group of people, we can provide you with separate storage space on Windows or Linux that you can all share. If you'd like project space you may fill out the online form. If you need your account added to a research project space, ask the person who set up the project (usually a faculty member) to contact SSCC's Help Desk on your behalf.
Please help keep costs down by using disk space wisely:
- Compress large files.
- Remove unneeded files.
- Move files to project disks, if appropriate.
- Do not make copies of standard data files archived by CDE or other agencies or individuals.
To determine how much disk space you are using, use the quota command. For example
> quota Disk quotas for user rdimond (uid 1931): Filesystem blocks quota limit grace files quota limit grace griffon:/home/t 936904 1024000 1024000 8119 0 0
In the column labeled "Used" is the amount of disk space you are using, in kilobytes. The quota column tells what your current disk quota is.
Often, this is not sufficient information. You want to know specifically which directories are using the disk space. To determine this, use the du command, which will tell you how many kilobytes are in each of your subdirectories. For example:
> du -k ~ 29414 /home/s/somerset/data 8 /home/s/somerset/News 240 /home/s/somerset/Stuff 224 /home/s/somerset/Personal/gifs 77 /home/s/somerset/Personal/letters 2329 /home/s/somerset/Personal 164 /home/s/somerset/docs/reqs 703 /home/s/somerset/docs/faqs 13 /home/s/somerset/docs/tmp 42 /home/s/somerset/docs/soc361 1569 /home/s/somerset/docs/soc365 339 /home/s/somerset/docs/olddocs/homework 19878 /home/s/somerset/docs/olddocs 9049 /home/s/somerset/docs/travel 35343 /home/s/somerset/docs 202 /home/s/somerset/jobsearch/apps/old 221 /home/s/somerset/jobsearch/apps 238 /home/s/somerset/jobsearch 8336 /home/s/somerset/saslib 155 /home/s/somerset/practice 80024 /home/s/somerset
This user is using 80 MB of disk space. Most of the disk space usage is in the docs subdirectory, particularly in the olddocs subdirectory of the docs directory. Also, a lot of disk space is being used by the data directory.
You can also get a complete listing of the sizes of all files using the -a option to the du command. For example, below might be the output of the du -ak command, after the output has been sorted (numerically, and in descending order) and the first ten lines requested (the head command):
> du -ak ~ | sort -n -r | head 80024 /home/s/somerset 35343 /home/s/somerset/docs 29414 /home/s/somerset/data 19878 /home/s/somerset/docs/olddocs 11088 /home/s/somerset/docs/olddocs/thesis 9049 /home/s/somerset/docs/travel 8336 /home/s/somerset/saslib 7712 /home/s/somerset/data/brazil 6208 /home/s/somerset/saslib/course.ssd04 5264 /home/s/somerset/docs/olddocs/diagrams
This output includes both files and directories. A comparison with the output from the du -k, above, shows that the largest files are ~somerset/docs/olddocs/thesis, ~somerset/data/brazil, ~somerset/saslib/course.ssd04, and ~somerset/docs/olddocs/diagrams. In the interest of conserving disk space, user somerset may want to delete or compress some of these files.
To determine the amount of disk space available on a project disk, use the df command. For example, if you own a directory called /project/irp/bozeman, you can determine the total amount of free space by running this df command:
> df -k /project/irp/bozeman
Filesystem 1024-blocks Used Available Capacity Mounted on
irp1#irp 8220960 1692974 6507568 21% /project/irp
In this example, about 6.5 GB of disk space is available. Again, the units are kilobytes, which was requested when the -k flag was used.
A good way to save disk space is to compress files. A compression savings rate of 75% is typical and even 95% is achievable, particularly for ordinary data files.
Two compression programs are commonly used on Linux: compress and gzip. The syntax for both is basically the same: issue the command, followed by the name of the file you wish to compress. The -v option is useful, as the compression commands will tell you the percentage of file space you saved by compressing the file. For example:
> compress -v vt20.alpha.tar
vt20.alpha.tar:Compression:74.18% - replaced with vt20.alpha.tar.Z
> gzip -v vt20.alpha.tar
vt20.alpha.tar: 89.2% -- replaced with vt20.alpha.tar.gz
The compression commands will change the names of the files, the compress command adding a ".Z" suffix, and the gzip command adding a ".gz" suffix.
To uncompress files, use the commands uncompress or gunzip:
> uncompress vt20.alpha.tar.Z
> gunzip vt20.alpha.tar.gz
The compressed file will be replaced by an uncompressed file without the suffix.
Once compressed, files can be uncompressed and then used. However, it is inefficient, both with respect to SSCC computing resources and your time, to constantly uncompress and then recompress files, particularly large data files. There are two ways to use compressed files without uncompressing them. First, some data analysis programs allow you to read in compressed data. Second, some programs that cannot use compressed data can read data from a special type of file called a named pipe.
Programs such as SAS, SPSS, and STATA allow data to be read from the output of commands. Using the zcat command or the gunzip -c command, the compressed file can be printed to standard output so that software programs can read the files. For instructions on how to use compressed data with commercial software programs, see SSCC Knowledge Base articles on the use of these programs available on SSCC's web site.
In addition to the three Linstat servers, SSCC also has a Condor Flock and High Performance Computing cluster for running large jobs. When selecting a Linux computer on which to run a job, you must consider which machines have the software that you want to use and which machines have the computing resources necessary for your project. Visit our Computing Resources at the SSCC web page for details.
SSCC has a cluster of Linux servers for running large STATA, SAS, R, MatLab, Fortran, and C/C++ programs. This cluster has a powerful batch pooling utility installed called Condor which was developed at UW-Madison's Computer Science Department. For more information on Condor, refer to the SSCC Knowledge Base article, An Introduction to Condor .
The SSCC has a High Performance Computing cluster called FLASH. See Using the SSCC's High Performance Computing Cluster for instructions on using these machines. If you have parallelized C/C++, Fortran, or R programs you'd like to run on this cluster, please contact Ryan Horrisberger.
Almost all the software installed on Linstat is installed on all three Linstat servers. The two exceptions (due to licensing restrictions) are SPSS and Stat/Transfer. They are installed on Linstat1. If you run SPSS or Stat/Transfer on another Linstat server they will automatically connect to Linstat1 and run your job there, but if you need to manage that job later you'll need to log in to Linstat1 to do so.
Software availability information for all of SSCC's computers can be found on SSCC's Software Availability web page.
The three Linstat servers have very similar processors. However, for large jobs that will take more than a few minutes to run, Condor is ideal. Please see An Introduction to Condor .
If you are going to use the computer intensively, for a Stata program, for example, then you should look for a machine that is not busy. There are several ways to determine if a machine is busy, and, if it is busy, what it is doing.
The Linux operating system provides its own set of commands to get the same information. For example:
> uptime 13:33 up 1 day, 2:34, 4 users, load average: 3.36, 3.31, 3.47
The uptime command tells the current time, the length of time the computer has been running (in this example, one day, two hours, 34 minutes), how many users are currently logged onto the system and the load average for the past one, five, and 15 minutes. The load average is the average number of jobs waiting to run over the particular time increment. The higher the number, the busier the system. A Linstat server is busy if its load average exceeds four and is very busy if their load average exceeds six.
To find out how busy Condor is, use the condor_status command.
Another excellent command for monitoring system activity is the top command. The top command lists jobs currently running, ordered by CPU usage, with the command using the greatest amount of CPU time on top of the list. The output of the top command looks like this:
load averages: 0.16, 0.24, 0.23 15:33:03 94 processes: 1 running, 1 waiting, 15 sleeping, 75 idle,2stopped Cpu states: 10.0% user, 0.0% nice, 7.9% system, 82.0% idle Memory:Real:471M/767M act/tot Virtual:16M/2243M use/tot Free: 181M PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND 9124 esimpson 42 0 8192K 1327K WAIT 4:27 10.50% sas 10235 odrucker 42 0 7736K 4128K sleep 0:03 1.80% stata 9387 mcdermot 44 0 2504K 393K run 0:00 0.40% top 896 root 44 0 1704K 229K sleep 0:01 0.10% telnetd 77 root 42 0 1600K 57K sleep 19:30 0.00% update 488 root 44 0 1728K 122K sleep 0:38 0.00% snmpd 365 root 44 0 2032K 335K sleep 0:22 0.00% rpc.lockd 561 root 44 0 1992K 106K sleep 0:17 0.00% httpd 8463 swald 42 0 4488K 180K sleep 0:12 0.00% xterm 484 root 44 0 2432K 204K sleep 0:11 0.00% os_mibs 1 root 44 0 440K 40K sleep 0:07 0.00% init 32490 root 44 0 1704K 40K sleep 0:05 0.00% telnetd 150 root 44 0 1656K 122K sleep 0:02 0.00% syslogd 452 root 32 -12 2072K 270K sleep 0:02 0.00% xntpd 8459 mcdermot 44 0 4464K 729K sleep 0:01 0.00% xterm
The listing is updated every few seconds. The load averages are on the first line. Next is a list of how many processes are currently running (94 in this example). The third line shows the percent of time the CPU is spending in various modes. The most important item in this line is the idle percentage. If the idle percentage is non-zero, then the computer is not busy at all. The fourth line shows how much memory is in use.
The table at the bottom is the most interesting part of top output. It lists jobs that are currently running. In this snapshot, user esimpson is running SAS, using about 10% of CPU time. User odrucker is running Stata. He is taking about 2% of the CPU time. The top command is taking about half a percent and other commands are taking trivial amounts.
Because the Linstat servers have multiple CPUs, the percent of CPU used may total as much as 4800%. Enter q to exit the top command.
Any time you give Linux something to do, you've created a job. Of course many Linux commands execute almost instantly ( cd, ls, etc.), but others may run for hours, days, or even longer. In these cases, how a job is run will impact both what you can do and how the system performs for all other users. The SSCC's Linux servers are a shared resource, and it is up to each member to share nicely.
Linux was designed to have many tools that do specialized tasks. In the Linux model, data flows from one command to another command, each command doing what it does best. To implement this model, every Linux command has three files associated with it. These files are called:
- standard input
- standard output
- standard error
Standard input is the place from which commands get their data. By default, this is the keyboard. Standard output is the place that commands put their output. By default, this is the screen. Standard error is the place that commands put their error messages. By default, this is the screen, also. But it is important to note that standard output and standard error are not the same thing. It just happens that, by default, they send data to the same place. Collectively, these are called standard input and output, or standard I/O, abbreviated stdio.
Standard I/O can be redirected so that it comes from, or goes to, any place. Standard input can come from the keyboard, or from a file, or from another command. Standard output can be sent to the screen, or to a file, or into another command (as standard input to that command). This is the power of the standard I/O system.
The symbols used to redirect output are:
|>||redirect stdout from command to a file|
|>>||redirect stdout from command to a file, appending|
|>&||redirect stdout and stderr from command to a file|
|>>&||redirect stdout and stderr from command to a file, appending|
|<||redirect stdin from file to a command|
||||pipe the stdout of one command into the stdin of another command|
One of the most common ways to manipulate standard I/O is to redirect standard output from a command into a file. For example, if you want to save a long listing of one of your directories, you can do this:
% ls -l Documentation > doc.list
The "greater than" sign (>) redirects data from the ls command to a file called doc.list. Without the redirection, the listing would appear on the screen, but with the redirection, the command only returns the prompt, with no listing. If the file doc.list already exists, then it will be overwritten by the data from the ls command. To append data to the file, instead of overwriting the current data, use two "greater than" signs:
% ls -l Documentation > doc.list
% ls -l Programs >> doc.list
In this example, the first command redirected the listing of the Documentation directory into the file called doc.list, creating a new file or overwriting an existing file. Then, the second command appended the listing of the Programs directory into the doc.list file. The doc.list file contains listings for both directories, now.
Some commands can take information from sources other than the keyboard. They use standard input. For instance, if you wanted to mail the doc.list file to someone, you could use the Mail command to do so, instead of invoking pine or another mailer:
% Mail -s "Documentation Listing" odrucker < doc.list
In this example, the Mail command is used. Mail is sent to odrucker with the subject line "Documentation Listing" (the parameter to the -s option) and the tables of the mail message is the doc.list file.
The most common use of redirection of standard I/O is with pipes, which take the output of one command and give it to the input of another command. Some common uses are exemplified below:
% ls -l Documentation | enscript
In this example, the listing of the Documentation directory is sent directly to the enscript command so that the file can be printed. The listing is never saved on disk or displayed on the screen.
% ls -lR | more
The -R option to ls instructs ls to recursively list all directories and subdirectories. This could lead to a very long list. In this example, the output of the ls command is piped through the more command, allowing you to read the listing one screen at a time.
Normally when you type a command, it is processed and you see the results (if any) before the cursor returns and you can type a new command. These jobs are said to be running in the foreground, and that may be exactly what you want if your job will run very quickly or you cannot proceed until you have your results. But you can tell Linux not to wait. When you put a job in the background, the cursor returns immediately and you can keep giving commands and doing other work while the your job is running. When it finishes, a message will appear on your screen.
To run a job in the background, simply add an ampersand (&) at the end of the command line. For example:
> stata -b do myprogram
Stata will start and run myprogram.do in the foreground. Thus the session will be unavailable until the job is done. On the other hand,
> stata -b do myprogram &
will start Stata in the background. The cursor returns immediately, and the user can edit other programs, organize files, etc. while waiting for the job to finish. When it is done you will see:
 Done stata -b do myprogram
Note that a job which creates a separate window (emacs, for example) will be completely functional in the background. What makes it a background process is that your shell (the main session window) is ready for more commands. On the other hand if a program without a window is running in the background and needs input from you (for example if SAS runs out of resources), it will halt until you put in the foreground and give it the input it needs.
Note that a job running in the background will keep running even if you log out, so it is quite possible to start a long job before you leave in the evening, log out, and get the results the next morning. Remember that Linstat is actually a cluster of three servers and when you log in you're assigned to a server randomly (to try to balance the load between them). However, you can choose to connect to a specific server to monitor a job you started previously or if the server you're assigned to turns out to be particularly busy.
To switch to a different server, type:
where server can be linstat1, linstat2 or linstat3. Alternatively you can set up your client program to log in to one of those three servers directly.
Switching Between Foreground and Background
If you have a job running in the foreground and you want to do something else, simply press CTRL-z (note that if the current job has opened a window of some sort, you must return to your shell window before pressing CTRL-z). The current job will be suspended and you will get your cursor back. If you want the job to run while you are doing other things, type bg to put it in the background. You can also type fg to move it back to the foreground, either from being suspended or from the background.
Managing Background Jobs
It can be very easy to lose track of jobs you have running in the background, but there are several commands that can tell you about them.
jobs will list all the jobs you started this session that are not yet complete. For example:
> jobs  - Running emacs  + Suspended emacs
The number in brackets is the job number, and you can use that number preceded by a percent sign (%) to refer to the job. Naming a job will move it to the foreground, so in this case %2 is similar to fg (except you don't have to keep track of which job is considered the "current" job). Adding an ampersand moves it to the background, so %2 & is similar to bg.
You can list jobs started in a previou session using the ps command (think processes). The syntax is ps x -u username. For example:
> ps -u rdimond PID TTY TIME CMD 29413 pts/30 00:00:00 tcsh 1601 pts/30 00:00:00 emacs 1602 pts/30 00:00:00 emacs 1605 pts/30 00:00:00 ps
Note how the bracketed numbers have been replaced by the PID (Process IDentification) and the list is more complete, including your shell (in this case the tcsh shell), and the ps command itself. Note that PID's cannot be used to move things from foreground to background. On the other hand this is the only way to check on jobs from previous sessions.
Sometimes you will change your mind about a job, and occasionally things even go wrong. In these cases, the kill command can be invaluable. Simply type kill and then the job number or PID. For example:
> kill %2
> kill 1602
This doesn't actually stop the job, it merely requests that it shut down, giving the program an opportunity to clean up temporary files and such. Unfortunately both SAS and SPSS will not do so, so if you kill one of these jobs, please go to the /tmp directory and manually delete all files and directories belonging to you. On the other hand, adding the -9 signal to the kill command will kill a program immediately with or without its consent. Thus:
> kill -9 1602
will kill process 1602.
Running Multiple Jobs
Linux will allow you to put as many jobs as you want in the background, and it will try to work on them all at once. This means it is quite possible for a single user to run so many jobs that everyone else is "crowded out." If necessary SSCC staff will intervene to stop this. On the other hand, Condor handles multiple jobs very efficiently and has plenty of available capacity. So if you are planning on doing any resource intensive computing, you really should check out Condor.
The general rule on the interactive (non-Condor) Linstat servers is that you should only have one major job running at a time on each server. Text editors, email, etc. are not a problem, but Stata, SAS, SPSS, and most user-written programs are resource intensive and will affect others. Keep in mind that Linux will split the available CPU time among all the running jobs. So if you run three jobs simultaneously, they will each take three times as long to run, saving you no time but making much less CPU time available for others (the one exception to this would be if the server has an idle CPU, but you shouldn't count on this).
If you have multiple jobs to run, please read SSCC's CPU Usage Policy .
Condor is designed to process large numbers of jobs. For full details please see An Introduction to Condor, but the essence of Condor is that we have a pool of Linux servers which only run jobs submitted to them through the Condor program. Unlike standard Linux jobs, Condor jobs never interfere with each other, since each job gets exclusive use of a CPU. Thus if you submit your jobs to Condor, they will not slow down the server for anyone else (or be slowed down by anyone else).
The price is that it takes about 30 seconds for Condor to process a job and assign it to a machine. Thus if you are running a 20 second job and will be waiting for the results, it would be counterproductive to use Condor. But if you have many jobs to run, or a single big job, Condor is a great tool. It's not quite a panacea since it can only be used for Stata, R, MatLab, and most user-written C/C++ and FORTRAN code, but that covers the bulk of the computing done at the SSCC.
We have written several scripts which make submitting Stata jobs to Condor almost identical to running them as usual. The standard command for running a Stata do file in batch mode is stata -b do dofile (where dofile would be replaced by the name of the do file you want to run. To submit the job to Condor instead, simply replace stata with one of the following:
> condor_stata -b do dofile
condor_stata is the command you'll normally use. It will send your job to a multi-processor machine if one is available, but if not it will send your job to the first available machine.
If you want to run programs other than Stata using Condor, or want to submit many jobs at once, please see An Introduction to Condor.
Consider the following two scripts. Both run three SAS jobs. The one on the left will tie up the server it is run on, the one on the right will not. And it will execute in about the same amount of time:
|Bad Script||Good Script|
sas prog1 &
The bad script places all three jobs in the background, so they all run at the same time and compete for resources. The good script runs them in the foreground, so they will run one at a time. However you do not need to wait for them: simply run the script itself in the background and your shell will be available for other work.
Of course if you could use Condor those three SAS programs would be run on three different CPUs and thus execute in one third the time.
Running a Job Later
The at command allows you to run a job at a time you specify. For example, you could run a big, resource intensive job at 1:00 AM when no one is likely to be on. There are several ways to use at .
If you want to just type in the job you want to run later, type
> at time
and you can then enter the command(s) at the prompt (at>). When you are done, press CTRL-D. The time parameter will understand just about any reasonable format, including at 1:00, at 1:00am, at 1am, at 13:00 (1:00pm), at noon, at midnight, or at teatime (4:00pm). Note that if you do not specify am or pm, it is assumed you are using 24-hour time.
You can also put the commands you want executed in a file. To do this type:
> at time -f file
To list the jobs currently waiting to run, type:
To remove a job, type:
> atrm job
where job is an ID obtained by listing your jobs.
Note that if you submit your jobs to Condor, they will not affect other users and will get plenty of resources no matter when you run them.
The table below is a quick reference for the most common Linux commands. Following the link will take you to a more in-depth explanation of the command.
|Command Name||Command Description|
|at||run a job at a specified time|
|clear||clear the terminal screen|
|compress, uncompress||compress and expand file|
|condor_status||lists state of SSCC's Condor flock|
|cp||copy files and directories|
|df||report file system disk space usage|
|du||estimate file space usage|
|gzip, gunzip||compress and expand file|
|hostname||display name of computer logged into|
|jobs||display status of jobs in the current session|
|kill||terminate a job|
|ls||list directory tables|
|man||display the on-line help pages|
|mkdir||create a directory|
|more||display a file one screenful at a time|
|mv||move or rename files|
|ps||display job status|
|pwd||display present working directory|
|quota||display disk usage and limits|
|rm||remove (delete) files or directories|
|rmdir||remove (delete) directories|
|soft||list SSCC software availability|
|sscwho||list information about an SSCC member|
|ssh||remote login and remote execution of commands|
|top||display top CPU processes|
|uptime||tell how busy the system is|
Many resources are available to learn about the Linux operating system, both at SSCC and at your local book store. SSCC staff maintain numerous on-line Knowledge Base articles on Linux topics including the use of editors, such as EMACS and PICO, and use of statistical software like SAS, STATA, and SPSS.
SSCC also teaches mini-courses, ranging from one-hour courses, to classes that meet for half a day, or for an hour a week for several weeks. See BROADCAST, SSCCNEWS, or SSCC's training web pages for registration and other information about these courses.