SAIL Compute Cluster


The Stanford AI Lab cluster aggregates research compute nodes from various groups within the lab and control them via a central batch queueing system that coordinates all jobs running on the cluster. The nodes should not be accessed directly, as the scheduler will allocate resources such as CPU, Memory and GPU exclusively to each job.

Once you have access to use the cluster, you can submit, monitor, and cancel jobs from the headnode, or via its web-based dashboard at This machine should not be used for any compute-intensive work, however you can get a shell on a compute node simply by starting an interactive job.

You can use the cluster by starting batch jobs or interactive jobs. Interactive jobs give you access to a shell on one of the nodes, from which you can execute commands by hand, whereas batch jobs run from a given shell script in the background and automatically terminate when finished.

If you encounter any problems using the cluster, please send us a request via and be as specific as you can when describing your issue.

Usage Policy

To gain access to the cluster, please submit and request via and state the following: (i) your CS login ID, (iii) name of professor you're working with (and put him under cc on the form) 

If we have any trouble with your job, we will try to get in touch with you but we reserve the right to kill your jobs at any time.

If you have questions about the cluster, send a request on

Job Submissions

Use of the cluster is coordinated by a batch queue scheduler, which assigns compute nodes to jobs in an order that depends on the time submitted, the number of nodes requested, and the availability of the resources being requested (etc. GPU, Memory).

You can submit two kinds of jobs to the cluster- interactive and batch.

Interactive jobs give you access to a shell on one of the nodes, from which you can execute commands by hand, whereas batch jobs run from a given shell script in the background and automatically terminate when finished.

Generally speaking, interactive jobs are used for building, prototyping and testing, while batch jobs are used thereafter.

Batch Jobs

Batch jobs are the most common way to interact with the cluster, and are useful when you do not need to interact with the shell to perform the desired task. Two clear advantages are that your job will be managed automatically after submission, and that placing your setup commands in a shell script lets you efficiently dispatch multiple similar jobs. To start a simple batch job on a partition (group you work with, see bottom of the page), ssh into sc and type:


There are many parameters you can define based on your requirement. You can reference to a sample submit script I have via /sailhome/software/

For further documentation on submitting batch jobs via Slurm, see the online sbatch documentation via SchedMD.
Our friends at the Stanford Research Computing Center who runs the Sherlock cluster via Slurm, also has a wonderful write-up and they largely applies to us too. Sherlock Cluster

Interactive Jobs

Interactive jobs are useful for compiling and prototyping code intended to run on the cluster, performing one-time tasks, and executing software that requires runtime feedback. To start an interactive job, ssh into sc and type:

srun --partition=mypartition --pty bash

The above will allocate a node in mypartition and drop you into a bash shell. You can also add other parameter as necessary.

srun --partition=mypartition --nodelist=node1 --gres=gpu:1 --pty bash

The above will allocate node1 in mypartition with 1 GPU and drop you into a bash shell.

For further documentation on the srun command, see the online srun documentation via SchedMD.

Managing Jobs

You can view a list of all jobs running on the cluster by typing:


Or via the online-dashboard at

You can view detailed information for a specific job by typing:

scontrol show job jobid

Or via the online-dashboard at and click on the job

To cancel a job you started, type:

scancel "jobid"

A good comparison between torque/pbs command vs. Slurm, please head to


There are several storage options for the scail cluster,

Home directory: /sailhome/csid

All sc cluster nodes mount a common network volumes for your home directory. This is a good option for submission scripts, outputs ...etc, there is a quota of 20GB for each user.

Scratch Storage via NFS

/scail/scratch and /scail/data - old/general network filesystem across SAIL, to be deprecated soon.

/atlas - Prof. Stefano Ermon

/cvgl, /cvgl2 - Prof. Silvio Savarese

/deep - Prof. Andrew Ng

/vision - Prof. FeiFei Li and Juan Carlos Niebles

All NLP NFS filesystems, including /u/nlp, are automounted on sc and the NLP nodes - Prof. Chris Manning


These are the partitions currently enabled on sc (list will grow soon as we are migrating more production GPU nodes), please only submit jobs to the partitions in which your group owns.

tibet - tibet10-15 - each node has 4 K40 GPUs
napoli-cpu - napoli[1-7,9-16] - cpu-only node for CVGL
visionlab - visionlab[1-25] - cpu-only node for FeiFei/Juan Carlos, docker/docker-compose available.