Condor: Getting Started
What Condor is and how to use it.
Condor is a software system that creates a High-Throughput Computing environment. It effectively utilizes the computing power of workstations that communicate over a network. Condor can manage a dedicated cluster of workstations. Its power comes from the ability to effectively harness non-dedicated, preexisting resources under distributed ownership. (excerpt from the Condor documentation)
Why Use Condor at CAE?
- Policy: CAE does not allow users to run jobs in the background. If you have computation-intensive jobs that you want to run on CAE workstations for extended periods of time then Condor is the only alternative that CAE offers.
- Convenience: With Condor you will not have to babysit you jobs nor will you have to look for idle workstations. You can just submit jobs to Condor and leave. When the job completes, Condor sends will send an email message.
- Throughput: There are literally thousands of machines available for running Condor jobs around campus. You may be able to submit dozens of jobs to Condor and then have all the jobs run overnight without worrying about any of the details.
- Checkpointing: In some curcumstances Condor can checkpoint jobs as they run and migrate them from machine to machine. If you have jobs that will run for days or weeks then this is the best way to ensure that they are able to complete.
For very simple commands please have a look at the
condor_run command which will wrap a simple shell command into a condor job and return when it has completed. For more sophisticated uses of condor, you will need to implement
condor_compile as a part of your job.
A universe in Condor defines an execution environment. For most users at CAE this will be one of the following:
- Vanilla Universe: Programs that cannot be relinked.
For example Matlab or Fluent jobs would fall into this category. There
is no checkpointing for Vanilla universe jobs and so they are simply
restarted if they are migrated between machines. This is the simplest way to use HTCondor and is the easiest method to use if you want to use other compute resources at the UW beyond CAE.
- Standard Universe: If you write programs yourself, or compile from source code, and need checkpointing or remote I/O you may want to think about using the Standard Universe. Running in the Standard universe involves relinking the code with the Condor libraries using
N.B. Standard Universe uses code in HTCondor that is of limited functionality and as a result cannot use features such as the HTCondor Connection Broker, and Shared port Daemon. As a result standard universe jobs can have troubles when a firewall is involved. In short, if you want to use compute resources beyond CAE, it is advisable to avoid standard universe. There are other methods that can be used to obtain checkpointing and remote I/O.
This is the condor manual section about submitting jobs.
Briefly, a job is submitted for execution to Condor using the condor_submit command. condor_submit takes as an argument the name of a file called a submit description file. This file contains commands and keywords to direct the queuing of jobs. In the submit description file, Condor finds everything it needs to know about the job. Items such as the name of the executable to run, the initial working directory, and command-line arguments to the program all go into the submit description file
You can view your job status using the
condor_q command. See the link above for more information.
The condor project has excellent online documentation.