The rescaleqs.py multithreaded daemon script configures and submits XF simulations for remote execution via the Rescale high performance cloud computing platform. It also downloads simulation results and places them back in the appropriate location on the local filesystem upon completion, allowing XF users to submit simulations to Rescale and view the results without manually setting up both the jobs and compute clusters, or monitoring simulations for completion and then retrieving their output.

The following sections detail how the script achieves this, as well as its major stages of job processing and important mechanisms. An understanding of this information enables system administrators to use the script and tailor it to the needs of their organization by either adjusting or expanding upon the script's existing process.

rescaleqs.py is provided in {xf-install-dir}/remcom/bin. External Queue Integration (EQI) must be configured before using rescaleqs.py.

Process Overview

rescaleqs processes each job through six main stages, each of which is handled by a separate thread of control within the script:

  1. Main Processing Thread: reads and processes the control file.
  2. Compressor Thread: compresses the simulation's input files.
  3. Uploader Thread: uploads the input files to Rescale.
  4. Instance Manager Thread: creates and submits the job on Rescale.
  5. Downloader Thread: downloads the output files from Rescale.
  6. Extractor Thread: extracts the output files to the appropriate location.

Control is passed from one thread to the next as each stage is completed by changing the file extension of the simulation control file. Each thread controlling a stage of the process is configured to operate only on simulation control files with a specific file extension. For example, the thread handling the initial control file reading and processing looks for files with the extension *.xfsubmit, which are the control files initially written by EQI. When the thread finishes its task, it will rename the control file using the *.readytocompress extension, which is what the thread handling simulation input file compression looks for before beginning its task, and so on.

As a job progresses through the stages, additional information, such as file locations and the XF version needed to run the simulation, must be stored with the job for accessibility during later stages of job processing. The script will write this information to the control file in the same format as the other data initially written to that file by XF, that is, in simple, space separated key-value pairs. All job data that is read from the control file and additional job data generated in subsequent stages will be stored in memory in a global python dictionary (hash table) that is accessible to all threads. Threads attempt to retrieve any required job data from memory first, and if it cannot be found, it recreates the job data object by re-reading the control file, which includes additional job data generated during previous processing stages. In most cases, this allows the script the recover gracefully if interrupted or shut down during job processing.

In order to track the progress of a job queued with rescaleqs from within XF, the script writes to each simulation's project.clog file at various stages of processing. When each stage begins working on a new control file and has either retrieved or recreated the associated job data, it gets the path to the project.clog file and sets up a python logging file handler so that log messages are written to the specified project.clog file along with the standard output. When the file handler is done with the job, it resets in preparation for the next job.

If an error is encountered at any point during the job processing, whether due to a failure to connect to Rescale, a bad or unexpected response from the Rescale servers, or a local filesystem I/O or permissions error, the script aborts the job without either interrupting other jobs exiting the script entirely. This is achieved by launching a separate thread that is dedicated to performing cleanup actions for a given job and then removing the control file and terminating itself. Job cancellations and the cleanup of finished jobs are handled similarly.

Main Processing Thread

This thread is responsible for three main functions:

First, the thread reads incoming *.xfsubmit files and stores their contents in a newly created JobData object. It then writes the eqi-jobid.dat file to the local simulation run directory. In the case of the rescaleqs script, EQI's job ID is the title of the control file without the extension, which is unique. Next, it reads the project.info file for the simulation and adds its contents to the job data. The project.info file contains the XF version used to write the simulation. This file is then used, along with information from Rescale about the XF versions that are available on the servers, to determine which XF solver version to run on Rescale and execute the simulation. This version is added to the job data and the control file has its extension changed to *.readytocompress in order to hand over control to the next processing stage.

Users can request cancellations for jobs being processed with rescaleqs in the same way as for other simulations queued through EQI, which is by using the cancel external execution option in the XF simulations window. If a job has passed through the initial processing stage — specifically, if the script has written the eqi-jobid.dat file — XF can write a cancellation control file for the job. The thread looks for *.xfcancel control files that tell it which job to cancel. It then flags the job data object in memory, which signals to cancel the job and perform cleanup actions. Other threads check for this flag when they begin processing a job, and will spin off a separate thread that completes the various cleanup actions and terminates itself afterward.

Similar to job cancellation, a separate thread is launched for each *.finished control file that is dedicated to performing cleanup actions and then terminates itself when the cleanup is complete and it has eliminated the control file. If the DELETE_JOBS_AFTER_COMPLETION configurable option is set to True in the script configuration, the job and its output files are deleted from the Rescale servers in addition to standard cleanup actions. This option is available to help reduce the storage used on Rescale's servers, which are only available for a fee beyond a certain capacity.

Compressor Thread

This thread is responsible for compressing the required simulation input files for a job to a *.zip archive to be uploaded to Rescale.

The thread looks for *.readytocompress control files that have passed through the initial stage of processing, and looks in the simulation run directory to gather a list of input files and directories to add to the archive. Generally, this is everything in the run directory except for the project.clog file and the output directory, if it exists from a previous execution of the simulation. Then it constructs a 7-Zip command to compress the specified files to an archive that is placed within the run directory. The path to the archive is added to both the job data and the control file. A subprocess is launched to run the 7-Zip command, and after it is completed successfully, the control file extension changes to *.readytoupload.

Uploader Thread

This thread is responsible for uploading the compressed simulation input files to Rescale.

The thread looks for *.readytoupload control files and retrieves the path to the compressed input *.zip archive from the job data. It then connects to Rescale and begins uploading the *.zip archive. Upon completion, Rescale returns a file identification string that references the file on Rescale. This file ID is added to both the job data and the control file, and the local copy of the *.zip archive is deleted because it is no longer needed. Finally, the control file extension changes to *.readytosubmit.

Instance Manager Thread

This thread is responsible for creating jobs on Rescale, submitting them for execution, monitoring them for completion, and managing the lifetime of any active persistent compute clusters on Rescale.

The thread picks up control files with the *.readytosubmit extension and sets up a job on Rescale for each one. It uses the XF version number that will execute the simulation (as determined in the initial processing step) to retrieve the list of its compatible Rescale coretypes, or hardware configurations. It narrows this list is further by keeping only the coretypes that are also listed in the ALLOWED_CORETYPES configurable option available in the script. Then, using the required graphics processing unit (GPU) memory estimate provided in the project.info file for the simulation, as well as the data for each coretype retrieved from Rescale, it determines the total number of cores required to run the simulation for each eligible coretype.

If more than one coretype could run the simulation, the script chooses the coretype with the lower overall cost by default. This is calculated using {required number of cores} * {price per core hour}. Cheaper coretypes are also generally slower, so users may want to account for this when deciding which coretype to use.

Once the script has determined the ideal coretype and number of cores for the job, it checks for any active persistent compute clusters available to run the job. Persistent clusters do not automatically shut down after executing a job, and typically run until they are either shut down manually through the Rescale web interface, or they hit the maximum walltime specified when they are started. This script shuts down persistent clusters if they have been idle, or not running any jobs, for longer than the time specified for the MAX_CLUSTER_IDLE_TIME configurable option available in the script. Specifying a cluster as persistent is an available option when starting a job through the Rescale web interface. They can also be started without any associated jobs using the API. Unfortunately, there is no available option for starting a persistent cluster along with a job through the API, so they cannot be started by this script. Users may want to consider using a separate utility script for starting persistent clusters if that is the desired workflow.

The script looks for persistent clusters that are started but are not currently running any jobs, have the required version of XFdtd for the job installed, are using the correct coretype, and have a sufficient number of cores to run the job. By default, the script ensures that the cluster does not have more than 1.5x the required number of cores. If the cores exceed that amount, the script will not queue the job to the persistent cluster because rescaleqs assumes that a large job may come along that requires more of the resources. Users may want to tailor this area of the script to meet the specific needs of their organization.

Once the script has the necessary information, it sends the job's data to Rescale, which creates, but does not queue, the job and returns a job ID if the submission was successful. This job ID is stored with the job data and written to the control file. The job is then submitted for execution and the control file extension is changed to *.submitted.

After processing any *.readytosubmit control files, the instance manager thread looks for *.submitted control files. For each of these files, it uses the Rescale job ID stored with the associated job data to query Rescale for the job's status. If a Completed status code is returned, the control file extension is changed to *.readytodownload to forward the job to the next processing stage.

After checking for control files and handling the jobs, the thread checks for any active persistent clusters that should be shut down. As mentioned previously, if all of the jobs queued to the cluster are complete and the cluster has been idle for longer than the MAX_CLUSTER_IDLE_TIME specified in the script's configurable options, the script sends a kill command to the cluster to shut it down.

Downloader Thread

This thread is responsible for downloading the simulation output file archive from Rescale.

The thread retrieves the Rescale job ID from the associated job data for each *.readytodownload control file that it encounters. It uses the job ID to search for an output-archive.zip file generated by Rescale according to the output filter settings used when the instance manager thread created the job on Rescale. Once the file is found, it begins downloading the output archive to the local run directory. When the download is complete, the control file's extension changes to *.readytoextract in order to hand control to the final processing stage.

Extractor Thread

This thread is responsible for extracting the simulation output file archive that was downloaded from Rescale.

The thread searches for *.readytoextract control files and inspects the output-archive.zip file that was downloaded to the simulation run directory. It then builds a 7-Zip command to extract the archive contents and place them in the output subdirectory within the local run directory. It launches a subprocess to execute the command, and upon successful extraction, it deletes the downloaded output file archive and the control file extension changes to *.finished, which hands control back to the main processing thread in order to perform cleanup actions for the completed job.