Deploying and Testing an OSG Hosted CE via SLATE
An OSG Compute Element (CE) is an application that allows a site to contribute HPC or HTC compute resources to the Open Science Grid. The CE is responsible for receiving jobs from the grid and routing them to your local cluster(s). Jobs from the Open Science Grid are preemptible, and can be configured to run only when resources would have otherwise been idle. Resource providers can use OSG to backfill their cluster(s) to efficiently utilize resources and contribute to the shared national cyberinfrastructure.
The simplest way to start contributing resources to the OSG, for many sites, is via the “Hosted” CE. In the hosted case, installation and setup of the Compute Element is done by the OSG team, usually on a machine outside of your cluster, and uses standard OpenSSH as a transport for submitting jobs to your resources. With SLATE, we have simplified Hosted CE installation and made a shared operations model possible. Now the Compute Element can be hosted on your Kubernetes infrastructure on-prem and cooperatively managed by OSG and your local team. In this article, we’ll go through the steps for connecting a SLURM cluster to the Open Science Grid with our newly stabilized OSG Hosted CE Helm Chart.
Table of Contents
- Generating and storing the key
- Configuring the CE from SLATE
- Finalizing the configuration
- Testing the HostedCE
- Next Steps
- Other resources and getting help
To install this application you will need a functional Kubernetes cluster with SLATE. More information can be found here.
You must have a functional batch system on a cluster with which you would like to support the OSG, and inbound SSH access. For more information on the requirements of the Hosted CE, please see here.
The OSG HostedCE uses SSH to submit pilot jobs that will connect back to the OSG central pool. In order for the HostedCE to work, you’ll first need to create at least one local service account for OSG to submit jobs. This should be done according to whatever process you normally use to create accounts for users.
Generating and storing the key
Once the account has been created, you’ll want to create a new SSH key pair. The private part of the key will be stored within SLATE, and the public part of the key will be installed into
authorized_keys file of the OSG user on your cluster. To generate the key, you’ll need to run the following on some machine with OpenSSH installed:
ssh-keygen -f osg-keypair
Note that you will need to make this key passphraseless, as the HostedCE software will consume this key. Once you’ve created the key, you’ll want to store the public part of it (osg-keypair.pub) into the
authorized_keys file on the OSG account for your cluster. For example, if your OSG service account is called
osg, you’ll want to append the contents of
The private part of the keypair will need to be stored on a SLATE cluster for use by the CE. In this particular example, I’ll be operating under the
slate-dev group and using the
uutah-prod cluster to host a CE pointed at a SLURM cluster at University of Utah. To do that:
slate secret create utah-lp-hostedce-privkey --from-file=bosco.key=osg-keypair --group slate-dev --cluster uutah-prod
utah-lp-hostedce-privkey will be the name of the secret, and
osg-keypair is the path to your private key (assumed to be the current working directory).
Configuring the CE from SLATE
You’ll want to download the application configuration template:
slate app get-conf --dev osg-hosted-ce > hosted-ce.yaml
There are several things that you’ll need to edit here.
First, for the site section you’ll want to put in appropriate values that will ultimately map onto your OSG Topology entry. Let’s take a look at this section first:
Site: Resource: SLATE_US_UUTAH_LONEPEAK ResourceGroup: CHPC Group Sponsor: osg:100 Contact: Mitchell Steinman ContactEmail: firstname.lastname@example.org City: Salt Lake City Country: United States Latitude: 40.7608 Longitude: -111.891
Sponsor is safe to leave as defaults if you plan to support OSG users with your CE. From the OSG Resource Registration Documentation, here’s the definition of Resource and ResourceGroup:
|Resource Group||A logical grouping of resources at a site. Production and testing resources must be placed into separate Resource Groups.|
|Resource||A host belonging to a resource group that provides grid services, e.g. Compute Elements, storage endpoints, or perfSonar hosts. A resource may provide more than one service.|
These will ultimately get mapped into the OSG Topology. Following those, you’ll need to update the Contact and location information as appropriate. The contact information will be used to reach you in case there are any problems with your CE or site.
Next we’ll go through the Cluster section, this section defines some of the hardware specifications on the remote side.
Memory should be the total per node memory available on the remote cluster. By default this is interpreted as megabytes but you can format this as
24G if you wish. You can tailor the amount of memory to the lowest common denominator if you have a remote cluster with different kinds of nodes. The same principle applies for the CoresPerNode field.
MaxWallTime is the maxmimum allowed walltime for the job and is expressed in minutes by default.
You can read more on the OSG Docs
VOs are virtual orginizations within the OSG and they correspond to different research groups. AllowedVOs allows us to specify which groups will run jobs through our CE.
Finally you’ll need to refer to the private key we stored in SLATE previously. I had called mine
utah-lp-hostedce-privkey, so my configuration ends up looking like this:
Cluster: PrivateKeySecret: utah-lp-hostedce-privkey # maps to SLATE secret Memory: 24576 CoresPerNode: 4 MaxWallTime: 4320 AllowedVOs: osg, cms, atlas, glow, hcc, fermilab, ligo, virgo, sdcc, sphenix, gluex, icecube, xenon
For the Storage section, you’ll want to define the path where the OSG Worker Node Client is installed as well as the location of temp/scratch space for your workers.
The OSG Worker Node Client is expected to be installed in the osguser’s home directory in a subdirectory called
bosco-osg-wn-client. You need to expand the absolute path of the user’s home directory. In my case, this is under a shared file system. The HostedCE SLATE application will install this Worker Node Client directory for you.
For temp, I have a specific scratch directory I want the OSG jobs to use.
/tmp is also a typical location for this.
Storage: GridDir: /uufs/chpc.utah.edu/common/home/osguserl/bosco-osg-wn-client WorkerNodeTemp: /scratch/local/.osgscratch
If you have a Squid service running, you can ensure that your workers will communicate with your local squid here. I will point my cluster to the local squid I have deployed with SLATE:
Squid: Location: sl-uu-es1.slateci.io:31726
(It’s also possible to launch a Squid service through SLATE, see here)
The HostedCE application requires both forward and reverse DNS resolution for its publicly routable IP. Most SLATE clusters come pre-configured with a handful of “LoadBalancer” IP addresses that can be allocated automatically to different applications. You must set-up the DNS records for this address so it is a good idea to request a specific address from the pool. This is my network configuration:
Networking: Hostname: "sl-uu-hce2.slateci.io" RequestIP: 184.108.40.206
It simply consists of a RequestIP and corresponding hostname.
The HTCondorCeConfig file contains additional configuration for the CE itself. Most importantly it contains the required JOB_ROUTER_ENTRIES section. This is the configuration that allows the CE to route jobs to your remote cluster. It has the following format:
JOB_ROUTER_ENTRIES @=jre [ GridResource = "batch <YOUR BATCH SYSTEM> <REMOTE USER>@<REMOTE ENDPOINT>"; Requirements = (Owner == "<REMOTE USER>"); ] @jre
My remote user is
osguserl and jobs will be routed to the
lonepeak1.chpc.utah.edu endpoint and the remote cluster is running SLURM. So my HTCondorCeConfig looks like this:
JOB_ROUTER_ENTRIES @=jre [ GridResource = "batch slurm email@example.com"; Requirements = (Owner == "osguserl"); ] @jre
There is more extensive documentation on the Job Router
The BoscoOverrides section provides a mechanism to override default configuration placed on the remote cluster for the OSG Worker Node Client. This can include things like the local path to the batch system’s executables and additional job submission parameters for the batch system.
This will vary depending on your batch system. I did need to override the slurm_binpath and add some SBATCH parameters. All the overrides are expected to be placed in a git repository with a subdirectory format that matches
<RESOURCE NAME>/bosco-override in my case
It may take some trial and error to get the correct overrides in place. The general proccess for this is to deploy the CE, then check the logs on the application’s HTTP Log exporter to see what must be changed. Finally re-dpeploy with the updated overrides.
There is a template bosco override repository that you can fork and tailor to your needs.
Once you’ve customized your fork, you can simply provide that repository as the
GitEndpoint for this section.
You can use a private git repo and provide the key to the application as a SLATE secret. My configuration with a public git repo is:
BoscoOverrides: Enabled: true GitEndpoint: "https://github.com/slateci/utah-bosco.git" RepoNeedsPrivKey: false GitKeySecret: none
This allows you to turn toggle HTTP logging side car. When it is enabled, it will allow you to view the CE logs from your browser.
HTTPLogger: Enabled: true
You can get the endpoint for your logger by running
slate instance info <INSTANCE ID>, and the randomly generated credentials will be written to the sidecar’s logs.
slate instance logs <INSTANCE ID>
Each VO that is enabled must be mapped to a user on the remote cluster. It is standard to create a user for each VO you intend to support. It is possible to map each VO to the same remote user, which I have done here:
VomsmapOverride: |+ "/osg/Role=NULL/Capability=NULL" osguserl "/GLOW/Role=htpc/Capability=NULL" osguserl "/hcc/Role=NULL/Capability=NULL" osguserl "/cms/*" osguserl "/fermilab/*" osguserl "/osg/ligo/Role=NULL/Capability=NULL" osguserl "/virgo/ligo/Role=NULL/Capability=NULL" osguserl "/sdcc/Role=NULL/Capability=NULL" osguserl "/sphenix/Role=NULL/Capability=NULL" osguserl "/atlas/*" osguserl "/Gluex/Role=NULL/Capability=NULL" osguserl "/dune/Role=pilot/Capability=NULL" osguserl "/icecube/Role=pilot/Capability=NULL" osguserl "/xenon.biggrid.nl/Role=NULL/Capability=NULL" osguserl
The GridmapOverride will allow you to add your own personal grid proxy to the CE. This is for the purpose of testing basic job submission.
You can obtain one with your institutional credential at cilogon.org
GridmapOverride: |+ "/DC=org/DC=cilogon/C=US/O=University of Utah/CN=HENRY STEINMAN A14364946" osguserl
Each time the CE is deployed it requests a new certificate from Lets Encrypt, which has rate limits to prevent DOS attacks. This means that if you are redeploying a CE frequently for troubleshooting purposes, you may experience the rate limit.
It is possible to save the certificate (hostkey.pem and hostcert.pem) and store these as a SLATE secret for re-use. This circumvents the rate limit.
I will leave it disabled, because my configuration is stable.
Certificate: Secret: null
Simply disable this. It is in place for the purpose of OSG Internal Testbed hosts, and is not intended for use with production CEs.
Developer: Enabled: false
Finalizing the configuration
Now that we’ve gone through the sections line-by-line, let’s look at our completed configuration:
Instance: "lonepeak" Site: Resource: SLATE_US_UUTAH_LONEPEAK ResourceGroup: CHPC Group Sponsor: osg:100 Contact: Mitchell Steinman ContactEmail: firstname.lastname@example.org City: Salt Lake City Country: United States Latitude: 40.7608 Longitude: -111.891 Cluster: PrivateKeySecret: utah-lp-hostedce-privkey # maps to SLATE secret Memory: 24000 CoresPerNode: 4 MaxWallTime: 4320 AllowedVOs: osg, cms, atlas, glow, hcc, fermilab, ligo, virgo, sdcc, sphenix, gluex, icecube, xenon Storage: GridDir: /uufs/chpc.utah.edu/common/home/osguserl/bosco-osg-wn-client WorkerNodeTemp: /scratch/local/.osgscratch Squid: Location: sl-uu-es1.slateci.io:31726 Networking: Hostname: "sl-uu-hce2.slateci.io" RequestIP: 220.127.116.11 HTCondorCeConfig: |+ JOB_ROUTER_ENTRIES @=jre [ GridResource = "batch slurm email@example.com"; Requirements = (Owner == "osguserl"); ] @jre BoscoOverrides: Enabled: true GitEndpoint: "https://github.com/slateci/utah-bosco.git" RepoNeedsPrivKey: false GitKeySecret: none HTTPLogger: Enabled: true VomsmapOverride: |+ "/osg/Role=NULL/Capability=NULL" osguserl "/GLOW/Role=htpc/Capability=NULL" osguserl "/hcc/Role=NULL/Capability=NULL" osguserl "/cms/*" osguserl "/fermilab/*" osguserl "/osg/ligo/Role=NULL/Capability=NULL" osguserl "/virgo/ligo/Role=NULL/Capability=NULL" osguserl "/sdcc/Role=NULL/Capability=NULL" osguserl "/sphenix/Role=NULL/Capability=NULL" osguserl "/atlas/*" osguserl "/Gluex/Role=NULL/Capability=NULL" osguserl "/dune/Role=pilot/Capability=NULL" osguserl "/icecube/Role=pilot/Capability=NULL" osguserl "/xenon.biggrid.nl/Role=NULL/Capability=NULL" osguserl GridmapOverride: |+ "/DC=org/DC=cilogon/C=US/O=University of Utah/CN=HENRY STEINMAN A14364946" osguserl Certificate: Secret: null Developer: Enabled: false
Before deploying the HostedCE, at this point you may want to test your private key and SSH endpoint before deploying the application to SLATE. You may also want to check that job submission is working as expected. For example, here I try to log in with the key (notice that it does not require a passphrase, this is a HostedCE requirement!), and I’m able to successfully run a job through my local batch system.
$ ssh -i osg-keypair firstname.lastname@example.org Last login: Mon Oct 14 12:51:23 2019 from cdf38.uchicago.edu [osguser@condor ~]$ condor_run /bin/hostname htcondor-river-v2-586998c97f-m4vkp
This is often the first place to start looking when your HostedCE isn’t working, so I would encourage you to test this appropriately at your site.
Once we’re happy with the configuration and we have tested basic access, we can ask SLATE to install it:
slate app install --cluster uchicago-prod --group slate-dev osg-hosted-ce --dev --conf hostedce.yaml
Testing the HostedCE
Obtain a Certificate
In order to test end-to-end job submission through the CE you will need a valid grid proxy and the condor tools. First you must obtain a certificate.
An easy way to do that is through CILogon.
Create a cert and download it. You’ll need to remember the password you set.
Convert the Certificate to PKCS12
Next you will need to convert it to PKCS12 format for voms. These commands will prompt for your password.
openssl pkcs12 -in usercred.p12 -nocerts -nodes -out hostkey.pem
openssl pkcs12 -in usercred.p12 -clcerts -nokeys -out hostcert.pem
Be sure that both files have the correct file permissions
chmod 600 hostkey.pem && chmod 600 hostcert.pem
Install HTCondorCE Client
You will need to install
htcondor-ce-client on the machine you would like to submit from.
On EL7 enable the OSG yum repos
yum install https://repo.opensciencegrid.org/osg/3.5/osg-3.5-el7-release-latest.rpm && yum update
Then install the tools
yum clean all; yum install htcondor-ce-client
Initialize Your Grid Proxy
You should be able to use your cert to initialize your grid proxy
voms-proxy-init -cert hostcert.pem -key hostkey.pem --debug
Here I use the
--debug flag because
voms-proxy-init won’t give us very helpful output, if it fails.
If that was successful you should be able to run a job trace against your CE, which will trace the end-to-end submission of a small test job.
Run the Trace
This command will output a great deal of helpful information about the job submission, it’s status and the eventual result. If the jobs sits idle on the remote cluster for too long, the command may time out.
Understanding the ouput
If authentication is successful you should see your job sbumit.
Testing HTCondor-CE authorization... Verified READ access for collector daemon at <18.104.22.168:9619?addrs=22.214.171.124-9619&noUDP&sock=collector> Verified WRITE access for scheduler daemon at <126.96.36.199:9619?addrs=188.8.131.52-9619&noUDP&sock=1326_1aab_3> Submitting job to schedd <184.108.40.206:9619?addrs=220.127.116.11-9619&noUDP&sock=1326_1aab_3> - Successful submission; cluster ID 3
This will be followed by the job’s ClassAd which is a large description of the job and its state. ClassAds are also used to describe other types of objects in the CE software.
After the ClassAd there should be some output describing the status of the job. The job will typically go form Held to Idle. It may stay idle for a long time waiting on available resource. Eventually the trace should report that the job was successful.
Spooling cluster 3 files to schedd <18.104.22.168:9619?addrs=22.214.171.124-9619&noUDP&sock=1326_1aab_3> - Successful spooling Job status: Held Job transitioned from Held to Idle Job transitioned from Idle to Completed - Job was successful
If the jobs stays idle for too long, the trace may time out. You can simply run it again.
You might see some output like this:
Testing HTCondor-CE authorization... Verified READ access for collector daemon at <126.96.36.199:9619?addrs=188.8.131.52-9619&noUDP&sock=collector> ******************************************************************************** 2020-01-24 08:58:41 ERROR: WRITE access failed for scheduler daemon at <184.108.40.206:9619?addrs=220.127.116.11-9619&noUDP&sock=1326_1aab_3>. Re-run with '--debug' for more information. ********************************************************************************
This error indicates that the CE is running and we can communicate with it, but it did not accept our credentials. This could be due to a number of reasons these are the most common:
1) Your grid proxy isn’t setup correctly
2) Your grid proxy is expired
3) You have not correctly added your identity to the CE GridmapOverride
You can use
voms-proxy-info to check the status of your grid proxy.
If the command hangs, this could indicate a connection problem or an issue with the CE.
You can run the command with the
--debug flag to see verbose output.
condor_ce_trace --debug sl-uu-hce2.slateci.io
Registering yourself as an OSG Contact
Head over to https://opensciencegrid.org/docs/common/registration/#registering-contacts for instructions on registering yourself as a contact with OSG.
This will be important for creating the resource record under OSG Topology, which is required for the accounting data.
It is possible to register an institutional contact such as a support list.
Registering the Resource in OSG Topology
Once you are registered as a contact with the OSG, you can head over to the contact database to grab your contact ID.
You will need this when you register the Resource(s) in OSG topology. Following the instructions available at:
Go to the OSG Topology GitHub repository and fork it. You will need to find your site within the repository, or create a new directory for it. Following the template fill in the details about your CE (primarily from the
Site section of the SLATE config).
Submit your updated fork as a Pull Request to the main OSG Topology repository.
Registering the CE with the OSG Factory
In order to recieve work from the OSG you must inform the factory operations team about your resource.
Send an e-mail to email@example.com with all of the details of your CE:
Other resources and getting help
The general HTCondor CE documentation is a good starting place for debugging:
If you would like OSG to set up a Hosted CE for you, you can follow the guide here:
You can also send an email to firstname.lastname@example.org with any questions or issues.
Last but not least, you can join the SLATE Slack and we can help with your issue, or find someone who can!