JupyterLab and HTCondor with SLATE
Note: This blog post was first published on Jun 11,2020 and last updated on Apr 04, 2023.
JupyterLab is a great tool for data analysis, visualization, machine learning and much more. It allows users to run code interactively via its web notebook interface and thus iterate changes quickly. Often users need to scale up their work and thus require submission to a backend cluster from the notebook. We show how this can be done with HTCondor using SLATE.
First, we’ll install a test HTCondor pool. Later, we’ll deploy a JupyterLab instance and submit jobs to this pool from the notebook, from both the Unix command line (within the notebook) and with HTCondor Python bindings. We assume you have a SLATE account and client installed on your laptop (c.f. the SLATE quickstart) and access to a SLATE registered Kubernetes cluster.
Table of Contents
Deploy a HTCondor Pool
First, get the configuration template of the central manager:
$ slate app get-conf condor-manager > manager.conf
Edit the file by adding an instance name of your choice:
Instance: 'blogpostdemo'
Then install the central manager using the below SLATE command (Please change <your-group> in the command to your SLATE group name, and <a-cluster> to the target cluster name that you want to use for your deployment):
$ slate app install condor-manager --group <your-group> --cluster <a-cluster> --conf manager.conf
Successfully installed application condor-manager as instance your-group-condor-manager with ID instance_1sGae98se
Note: If deployment fails due to an instance name that’s already been chosen by another user, please choose a different instance name and try running the above command again.
Once the application is successfully installed, SLATE will give you an instance ID, in our example instance_1sGae98se, which you will need in following steps. Now we need to learn the address, and tokens for the deployed application instance. Inspect the instance’s info:
$ slate instance info instance_1sGae98se
Name Started Group Cluster ID
condor-manager-blogpostdemo 2020-Apr-27 20:25:14.445249 UTC <your-group> <a-cluster> instance_1sGae98se
Services:
Name Cluster IP External IP Ports URL
condor-manager-blogpostdemo 10.96.78.245 155.12.34.140 9618:32384/TCP 155.12.34.140:32384
...
Pods:
condor-manager-blogpostdemo-768c87f4d4-z66k7
Status: Running
Created: 2020-04-27T20:25:26Z
Host: ...
Host IP: ...
Conditions: [2020-04-27T20:25:26Z] Initialized
[2020-04-27T20:25:26Z] PodScheduled
[2020-04-27T20:25:28Z] Ready
[2020-04-27T20:25:28Z] ContainersReady
Events:
Scheduled: Successfully assigned slate-group-slate-dev/condor-manager...
...
[2020-04-27T20:25:27Z] Created: Created container condor-manager
[2020-04-27T20:25:27Z] Started: Started container condor-manager
Containers:
condor-manager
State: running since 2020-04-27T20:25:27Z
Ready: true
Restarts: 0
Image: slateci/condor-manager:latest
Configuration:
...
Note the external IP address and port number, in this case 155.12.34.140 and 32384. You should confirm that the above output shows that the condor-manager
container is running and ready before proceeding to the next steps.
Now, check the application’s log to learn the tokens issued for the other condor components:
$ slate instance logs instance_1sGae98se
Fetching instance logs...
...
========================================
Pod: condor-manager-blogpostdemo-669f7b858-tdkbn Container: condor-manager
2020-04-22 02:54:10,760 INFO supervisord started with pid 1
2020-04-22 02:54:11,762 INFO spawned: 'generate_tokens' with pid 8
2020-04-22 02:54:11,763 INFO spawned: 'htcondor' with pid 9
2020-04-22 02:54:11,764 INFO spawned: 'crond' with pid 10
generate_tokens: Waiting for condor collector to become available
2020-04-22 02:54:11,769 INFO success: generate_tokens entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2020-04-22 02:54:12,823 INFO success: htcondor entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-04-22 02:54:12,823 INFO success: crond entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
**** Condor Submit Token ****
eyJhbGciOiJIUzI1NiIsImtpZCI6IlBPT0wifQ.eyJpYXQiOjE1ODc1MjQwNTQsImlzcyI6ImNvbmRvci1tYW5hZ2VyLWJsb2dwb3N0ZGVtby02NjlmN2I4NTgtdGRrYm4iLCJzdWIiOiJzdWJtaXRAcG9vbCJ9.I5qvyMdtRvFunqNvX5-foq2Pq7xp1zsJJAuEtgrMN0w
**** Condor Worker Token ****
eyJhbGciOiJIUzI1NiIsImtpZCI6IlBPT0wifQ.eyJpYXQiOjE1ODc1MjQwNTQsImlzcyI6ImNvbmRvci1tYW5hZ2VyLWJsb2dwb3N0ZGVtby02NjlmN2I4NTgtdGRrYm4iLCJzdWIiOiJ3b3JrZXJAcG9vbCJ9.jzWdHy3NRwfedViPSCUR1jfwkucFzanwL9IiX2ypPL8
Copy the one-line token after **** Condor Submit Token ****
, paste it into a file (such as submit-token
) and add one trailing new line character (\n) to it. For example:
$ echo "eyJhbGciOiJIUzI1NiIsImtpZCI6IlBPT0wifQ.eyJpYXQiOjE1ODc1MjQwNTQsImlzcyI6ImNvbmRvci1tYW5hZ2VyLWJsb2dwb3N0ZGVtby02NjlmN2I4NTgtdGRrYm4iLCJzdWIiOiJzdWJtaXRAcG9vbCJ9.I5qvyMdtRvFunqNvX5-foq2Pq7xp1zsJJAuEtgrMN0w" > submit-token
Now we need to repeat the same process for the other token. Copy the one-line token after **** Condor Worker Token ****
, paste it into another new file, (e.g. worker-token
); and add one trailing new line character (\n) to it.
$ echo "eyJhbGciOiJIUzI1NiIsImtpZCI6IlBPT0wifQ.eyJpYXQiOjE1ODc1MjQwNTQsImlzcyI6ImNvbmRvci1tYW5hZ2VyLWJsb2dwb3N0ZGVtby02NjlmN2I4NTgtdGRrYm4iLCJzdWIiOiJ3b3JrZXJAcG9vbCJ9.jzWdHy3NRwfedViPSCUR1jfwkucFzanwL9IiX2ypPL8" > worker-token
Then, create new SLATE secrets using the two files from above:
$ slate secret create submit-auth-token --group <your-group> --cluster <a-cluster> --from-file condor_token=submit-token
Successfully created secret submit-auth-token with ID secret_dHiGnjAgR2A
$ slate secret create worker-auth-token --group <your-group> --cluster <a-cluster> --from-file condor_token=worker-token
Successfully created secret worker-auth-token with ID secret_Hhjy43uyNsP
You’re now ready to set up a worker node. Download the base configurations for the condor-worker application:
$ slate app get-conf condor-worker > worker.conf
Edit the application’s configuration to use a name of your choice as the Instance, the manager’s external IP address as the CollectorHost, the manager’s external port as the CollectorPort, and the worker secret, in our example “worker-auth-token”, as the AuthTokenSecret. In our example here, the configuration will be:
Instance: 'blogpostdemo'
CondorConfig:
Instances: 1
CollectorHost: 155.12.34.140
CollectorPort: 32384
AuthTokenSecret: worker-auth-token
You can also customize the number of instances, the number of CPU cores requested for each instance, RAM requested for each instance, etc. Once you’re done customizing the worker application, you can install it:
$ slate app install condor-worker --group <some group> --cluster <a cluster> --conf worker.conf
Successfully installed application condor-worker as instance some-group-condor-worker with ID instance_nsWh3hNs2Gb
Deploy JupyterLab
Download the base configurations:
$ slate app get-conf jupyterlab > jupyter.conf
Generate a random token:
$ openssl rand -base64 32
mO6KJvhomZ733r/UUW6i1VXuuWgXV/gVN3VrXOgNwEg=
Edit the JupyterLab application configuration file, in this case jupyter.conf
, so that it has an Instance name and NB_USER of your choice, and a Token set to the token you just generated in the previous step. In our example, those values are:
Instance: 'blogpostjupyter'
Jupyter:
NB_USER: 'slate'
Token: 'mO6KJvhomZ733r/UUW6i1VXuuWgXV/gVN3VrXOgNwEg='
Choose a subdomain for the ingress (This will be used in the application’s URL):
Ingress:
Subdomain: 'blogpostnotebook'
Update the CondorConfig to be enabled, and use the the manager’s external IP address as the CollectorHost, the manager’s external port as the CollectorPort and the submit secret, in our example “submit-auth-token”, as the AuthTokenSecret. In our example, the configuration will be:
CondorConfig:
Enabled: true
CollectorHost: 155.12.34.140
CollectorPort: 32384
AuthTokenSecret: submit-auth-token
The last change is for the SSH service. Enable the service and add the SSH public key you want to use to SSH into the JupyterLab instance:
SSH:
Enabled: true
SSH_Public_Key: 'ssh-rsa AAAAB3NzaC1yc2......i0pRTQgD5h1l+UvL/udO+IUYvvi slate'
You’re now ready to install the JupyterLab application on SLATE:
$ slate app install jupyterlab --group <your-group> --cluster <a-cluster> --conf jupyter.conf
Note: If deployment fails due to an instance name that’s already been chosen by another user, please choose a different instance name and try running the above command again.
Inspect the instance’s info to see the allocated URL and address for SSH service:
$ slate instance info <instance-ID>
Services:
Name Cluster IP External IP Ports URL
your-group-jupyterlab 10.96.150.245 <IP-address> 8888:30712/TCP http://blogpostnotebook.slate-dev.slateci.net/
your-group-jupyterlab 10.96.150.245 <IP-address> 22:30033/TCP <ip-address>:<port-number>
In the above example, the JupyterLab application can be accessed at this address blogpostnotebook.slate-dev.slateci.net using the token you generated above. The second line has the SSH access info under URL, so you can use it with the ssh command like this:
ssh -p <port-number> <username>@<ip-address>
where <username> is what you chose above for the NB_USER configuration variable.
Testing
To test your deployed applications, log into your JupyterLab application\instance and submit a test job to the HTCondor pool from the terminal. First, create a file ‘job.sub’:
nano job.sub
Copy the below job into it and save the changes:
Executable = /bin/echo
Arguments = Hello World
Universe = vanilla
Log = job.log
Output = job.out
Error = job.err
Queue
Then, submit the job:
$ condor_submit job.sub
Submitting job(s).
1 job(s) submitted to cluster 1.
A successful run will create in your local directory the file job.out with the “Hello World” message in it.
You can also use HTCondor Python Bindings instead of condor_submit
to communicate with an HTCondor cluster. To do that, let’s use a simple job example from the HTCondor Python Bindings Tutorial for our testing purpose.
Open a new Python notebook, and import the below two modules:
import htcondor
import classad
Create a Submit
object for your job:
hostname_job = htcondor.Submit({
"executable": "/bin/hostname", # The program to run on the execute node
"output": "hostname.out", # Anything the job prints to standard output will end up in this file
"error": "hostname.err", # Anything the job prints to standard error will end up in this file
"log": "hostname.log", # This file will contain a record of what happened to the job
"request_cpus": "1", # How many CPU cores we want
"request_memory": "128MB", # How much memory we want
"request_disk": "128MB", # How much disk space we want
})
print(hostname_job)
Then, create a transaction and add your job to the transaction’s queue:
schedd = htcondor.Schedd() # get the Python representation of the scheduler
with schedd.transaction() as txn: # open a transaction, represented by `txn`
cluster_id = hostname_job.queue(txn) # queues one job in the current transaction; returns job's ClusterID
print(cluster_id)
Once the job is run successfully, you should find the file hostname.out
in your local work directory containing the worker node’s hostname.
Uninstall
If you need to uninstall an application you previously deployed on SLATE, run this command:
$ slate instance delete <instance-ID>
Summary
In summary, we were able to successfully deploy a test HTCondor pool, a JupyterLab instance, and demonstrate job submission to the HTCondor pool in two ways. The setup can be easily configured to work with any HTCondor pool, for example a production-scale high-throughput cluster using HTCondor or the Open Science Grid.
Questions?
As always, we encourage you to try this out and let us know what’s working, what’s not, what can be improved and so on. For discussion, news and troubleshooting, the SLATE Slack workspace is the best place to reach us!