How to run and manage jobs on LINCOMM
The main reason to use LINCOMM (Linux Community Servers) is to run jobs that take a long time. This guide shows you how to start a job, keep it running even if your connection drops, check on it, and stop it. For a quick list of the commands used here, see the LINCOMM command reference.
Prerequisites
- You are connected to a LINCOMM node. Your job runs on the node you start it on, so note which one with
hostname. To manage the job later, reconnect to that same node. - Comfort with starting and reattaching a tmux session. If that's new, work through Keep work running with tmux first.
Step 1: Start the job inside tmux
Run long jobs inside tmux so a dropped connection doesn't stop them. Start a session, then launch your program. For example, to run an R script:
tmux
Rscript analysis.RIf your connection drops, the job keeps running. Reconnect to the same node and run tmux attach to get back to it.
Step 2: Keep file-share access alive for long jobs
A job that reads or writes the AAE file share for a long time can hit an ACCESS DENIED error partway through, when your access quietly expires. Wrap the command in krenew to renew that access automatically while the job runs:
krenew -K 60 Rscript analysis.RThe -K 60 tells it to check every 60 seconds. Use this for any job expected to run longer than a few minutes against the file share.
Step 3: Run in the foreground or background
A foreground job holds your prompt until it finishes — you wait and watch. To get your prompt back so you can do other things while the job runs, put it in the background by adding an ampersand (&):
stata -b do analysis.do &When a background job finishes, a message appears in your session. Avoid starting a second heavy job while one is already running — you share the node's processing time with everyone signed in.
Step 4: Move a job between foreground and background
To move a running foreground job into the background, press Ctrl+Z to suspend it, then type bg to resume it in the background. Bring it back with fg. List the jobs in your session with jobs.
Step 5: Monitor your jobs
See the processes running in your current session:
psTo see all of your jobs on the node, including ones from other sessions, filter the full list by your NetID (replace jdoe):
ps aux | grep jdoeTo watch live resource use, run top. Confirm your job is actually working by checking that its %CPU is above zero. top also shows how busy the node is overall. Press q to quit top.
Step 6: Stop a job
Each job has a process ID (PID), shown by ps and top. To stop a job, pass its PID to kill:
kill 1602This politely asks the program to shut down and clean up. If it won't stop, force it with -9:
kill -9 1602Verify it worked
After starting a job, ps aux | grep jdoe should list it, and top should show it using CPU. After a kill, the same commands should no longer show it.
If something went wrong
- Your job vanished after a disconnect: It wasn't inside tmux. Restart it under tmux (Step 1) so it survives next time.
- You can't find a job you started earlier: Check you're on the same node you started it on. Run
hostname; a job on one node isn't visible from another. ACCESS DENIEDpartway through a long run: Your file-share access expired. Run the job underkrenew(Step 2). To restore access right now, runkinitand enter your NetID password.- The node feels slow: Processing time is shared among everyone signed in. See What's changed in LINCOMM 2.0 for how CPU and memory are shared.