Interactive Apps with Open OnDemand

Open OnDemand’s File Explorer, the FastX Web interface, and various command-line interfaces, can be used to prepare work for the cluster. This includes transferring and editing files, looking at output, and so forth. However, all production work must be run on the compute nodes, not on the frontends.

A large, multi-user system like UVA’s HPC cluster must be managed by some form of resource manager to ensure equitable access for all users. Research Computing uses the Slurm resource manager. Resource managers are also often called queueing systems. Users submit jobs to the queueing system. A process called a scheduler examines the resource requests in each job and assigns a priority. The job then waits in a queue, which Slurm calls a partition, until the requested resource becomes available. A partition is a set of compute nodes with a particular set of resources and limits. There are partitions for single-node jobs, multiple-node jobs, GPU jobs, and some other dedicated partitions. A list of the different queues and resources are listed on the Research Computing website.

Open OnDemand offers an easy way to run interactive jobs. With an interactive job, you are logged in directly to a compute node and can work as if it were a frontend. Please keep in mind that an interactive job terminates when the time limit you specify expires, unless you explicitly end the session.

Open OnDemand Interactive Apps Video Transcript

Narrator: Hello and welcome back to the University of Virginia’s High Performance Computing tutorial series. In this module, we will be covering the various interactive apps that you can access through Open OnDemand.

These are all GUI apps like JupyterLab and RStudio Server that you can run directly on a compute node rather than a login node, allowing you to do computational work interactively. The login nodes are more for setup or pre-production work, whereas the interactive use of the compute nodes is for compute jobs. That’s where these interactive apps fit in.

First, it’s important to point out how the resources on the cluster are managed. Getting access to the compute nodes is not as simple as just logging into a login node and starting to run work. You have to go through our resource manager, called Slurm. You can think of it as a queuing system.

When you submit a job, you’ll specify the resources you need, such as the number of nodes, core count, memory, and time. All of that gets bundled into a request that is then sent to Slurm. A process within Slurm called the scheduler looks at all the resources you’ve requested and assigns your job some amount of priority. The priority determines where your job will sit within the queue.

If the resources you need are immediately available and there aren’t many other jobs waiting, your job might submit immediately. However, if you’re requesting a larger amount of resources or the queue is very busy at that time, your job might have to wait. Slurm will use the job’s priority to place it in a queue, starting your job once it’s next in line and the resources are ready.

When you submit a job, you’re submitting it to one of Slurm’s different partitions. Partitions are different chunks of compute nodes with various limitations, each meant for different kinds of jobs. For example, there is the standard partition where you’re only allowed to run single node jobs, the parallel partition where you can request up to 64 nodes, the GPU partition for GPU jobs, and other dedicated partitions with specific purposes. A full list of all the different queues and resources is also available on our website.

To use the interactive apps, you’ll use the Interactive Apps menu dropdown. This drop-down lists all the different apps we offer. The main ones this video will focus on are JupyterLab, RStudio Server, the desktop app, and Matlab. But all the other apps have similar setup processes.

I’ll start by requesting a JupyterLab session. When I click on JupyterLab, it takes me to a different page with a form to fill out, where I can specify the resources I want to request for my interactive job. The first thing you’ll be asked is which partition you want to run on. You can select between interactive, standard, GPU, or any other partitions you may have access to.

After that, you will specify the amount of time you want for your session. You can use the slider or input the number you need. Keep in mind that this time limit is a hard limit. For example, if you set the session for an hour, once the hour has passed, Slurm will cut any ongoing processes and you’ll be disconnected from the session. You’ll have to start a new one to continue working on a compute node. This time limit comes without any warnings, so make sure you’re requesting enough time and maybe add an extra hour if you think you might need it.

Next, there’s the number of cores, which is only relevant if you’re running code that can take advantage of multiple cores. For example, if your code is multi-threaded, you’ll want to request more cores. If you’re unsure or if your code doesn’t require multiple cores, stick with one core.

Then there’s the memory request. This is where you’ll request more RAM if your job needs it. It can be difficult to determine how much memory you need beforehand. A general rule of thumb is to request about two to three times the amount of RAM as the amount of data you’re working with in gigabytes. For example, if you’re working with a 10 gigabyte data file, you might want to request between 20 and 30 gigabytes of memory. This is just for loading the data. Processing the data might require more. It can be a bit of a guessing game.

Next is the working directory. For JupyterLab, this will be the folder from which you can open notebooks. I’ll select home here. Then there’s the dropdown for your allocation. It should be auto-filled with one already, but you can click on it and select between the different allocations you’re a part of if you’re a member of multiple. This allocation is where the service units you’re using get billed to.

In the interactive partition, there are a couple of nodes that have GPUs in them. You can request these using the optional dropdown, and you can request up to two GPUs. When switching to the GPU partition, this option is replaced with GPU type. If you select the default option, it will give you whichever GPU is next available without any preference. If you click the dropdown, you can request a specific GPU like the A100, V100, A40, etc. You can request up to four GPUs using the number input.

Under Show Additional Options, if you select Yes, there is a field where you can add any other Slurm options. There are many different Slurm options you can use. I recommend looking at Slurm’s documentation or the Slurm page on our website. For example, if you have a reservation for a class, you can use –reservation= followed by whatever your professor gives you. If you want to exclude certain nodes from your job search, you can use the exclude option and specify the nodes you don’t want your job to run on. These are just a couple of examples.

If you’re a member of over 16 groups, there are some permission issues that can pop up because the system only allows access to 16 groups. If you need access to a storage share or certain software, and you are in more than 16 groups, you might need to specify a group here. The last checkbox allows you to receive an email when your session starts. If you’re waiting a long time and want to know when you can start working, this could be a good option to select. This form saves your preferences for future use so that you don’t have to fill out everything on the form again in the future.

Once you’re done filling out the form, click the Launch button to submit your job to Slurm. It will then take you to the My Interactive Sessions page. As you can see, my job is queued up. I have requested an hour and there are some session details here if I click on it. The resources I’ve requested, one hour on one core, is a very small request. Generally, you should expect a job like that to start almost immediately because there is usually one core available somewhere on the system, especially in the interactive or standard partition.

If, however, I’d requested a full node, like 40 to the 96 cores available on the node, for the maximum amount of time on that partition, that job will probably be queued for a longer period of time, because the resources might not be available. Additionally, a bigger job will have less priority than a smaller request like this one. Generally, smaller requests equal faster submission, while larger requests mean a longer wait time.

Once your job starts, you’ll see the status change from Queued to Running. Once the resources are ready, the time requested switches to time remaining. To actually connect to the session, I’ll click Connect to Jupyter, which will open another tab. As you can see, I’ve got JupyterLab open, where I can run notebooks and code on the compute nodes, rather than the login nodes.

There are a couple of different tiles here, like a base Python 3 tile for running Python 3.11 code. There are pre-built tiles for popular packages like PyTorch and TensorFlow, a Rapids tile, and an R tile. You also have the ability to create custom tiles. If you scroll further down, you can open a terminal using this button. This is useful if you need to pip install or conda install any packages. On the left is the file system, specifically my home directory. You can’t navigate further back to open files from scratch or project directly, but the session does have access to the full file system. So my code can still read or modify data in my scratch folder. I just can’t open a Jupyter notebook for my scratch folder.

The session is tied to my browser window. If I exit out of it, I can always reconnect by clicking Connect, which will reopen the session. However, JupyterLab is sensitive to the browser and the local internet connection. If I have code running and exit the browser, the code will stop. Similarly, if my Wi-Fi cuts out, the code will also stop. This limitation is specific to JupyterLab and doesn’t affect other interactive apps.

Once I’m done with my session, I can close the browser and end my session by clicking the big red Delete button. It’s best practice to end any sessions early when you’re done, instead of letting them sit idle. We only charge SUs based on time used, not the time requested. So if I end the session early, I’ll only be charged for time used, which can matter for your allocation. Additionally, if I don’t delete my session, the resources will sit idle, which isn’t ideal. We want unused resources to be available for others. To delete the session, I’ll click Delete, confirm the action, and then my session will be successfully deleted.

Another app available on Open OnDemand is RStudio. Starting an RStudio session is similar to JupyterLab, but the form is slightly different. You’ll be prompted to select an R version. In addition, there is no option for work directory, since you can navigate to the whole file system from RStudio. All other options in the form are the same. This is what RStudio itself looks like. You’ve got the console on the left and your file system on the right. RStudio can continue any active processes even if your connection drops or you close out of the browser. If you exit out and want to continue the session or check if the code is still running, you can always relaunch the session using the button that will pop up.

Starting up the other interactive apps is similar. So here’s a quick overview of some of them and what they’re useful for. One of the most useful apps is the Desktop interactive app. This will open a desktop that is identical to the FastX one, but instead of being on a login node it will run on a compute node. However, this means that the desktop app should not be used to submit other jobs. The desktop app can be useful for users who want to use GUI-based software. You can use the desktop app to run computationally intensive work on a compute node rather than using FastX on the login node.

Additionally, if you have a large amount of data to download from cloud services like Google Drive, UVA Box, or Microsoft OneDrive, you might find better performance on a compute node rather than a login node. You can start these downloads and then leave your computer for a bit. The downloads will continue even if you close the browser or turn off your computer. The desktop app has this functionality because it doesn’t rely on your local internet connection or the browser being open. This is a great way to run something on the cluster in the background and come back to it later.

Another available app is Matlab. Similar to RStudio, you will request what version of Matlab you want to use when launching the session. Launching it will open the Matlab desktop app within a desktop environment, but closing the Matlab app will exit the session. If your files are not visible on the desktop, you can access them from the Places menu or the Caja app, which is the filing cabinet icon at the top of the screen. Both Matlab and the desktop app persist if your network is disconnected or if your browser closes. You can always relaunch those sessions to reconnect and your code will still be running.

Finally, if you need help with UVA’s HPC system, there are multiple ways to get assistance. You can visit our Zoom-based office hours sessions on Tuesdays from 3-5pm and Thursdays from 10am to noon. If you can’t make it to these office hours or have a more specific request, you can submit a support ticket and we’ll get back to you. Links to both are on the RC Learning website. The main Research Computing website is also a valuable resource. We add a lot of documentation here and keep it updated. If you have a basic question or think it might be covered already, we have an FAQ section. We also have a list of how to’s on various topics. If you can’t find what you’re looking for, we have a search feature where you can search our site for different information.

This concludes the Open OnDemand Interactive Apps tutorial at the University of Virginia. Thank you.

Last updated on Oct 1, 2022