In computational environments, a desktop workstation is used to prepare a project, create simulations, and view results, but the simulations may be performed on a high-performance computing (HPC) system. External Queue Integration (EQI) allows users to queue simulations directly with such systems rather than manually submit them for execution.
XF supports job submissions to queues and cloud-based systems:
- Local queue: simulation queue that is maintained in the simulations window of XF's user interface (UI).
- External queue: simulation queue existing on an HPC system for a team, department, or company that is maintained by IT professionals. It typically has an existing job control system installed on it, such as the Portable Batch System (PBS) or Simple Linux Utility for Resource Management (Slurm). Remcom provides the xfqs.template program to watch for and submit new XF simulations to the job control system.
- Rescale cloud computing: a cloud-based HPC system. Remcom provides the rescaleqs.py.template program to watch for and submit new XF simulations to Rescale.
EQI must be configured in order to utilize external queues and Rescale cloud computing. This is done by customizing either xfqs.template or rescaleqs.py.template.
Requirements
Two basic requirements must be met in order to use EQI:
- The workstation and HPC system(s) must share a common filesystem. The filesystem must contain an eqi-control folder that is visible to both systems, and an XF project must be saved on that filesystem at either the same directory tree level as the control folder or below it. All EQI users must have write access to both the control folder and the location where the projects are saved.
- The HPC system must run a process that watches the control folder for *.xfsubmit and *.xfcancel files, which provide information about simulations to submit and cancel, respectively. This process is performed using the provided template programs, and its purpose is to issue the appropriate commands to the job control system to either submit simulations for execution or cancel them, as appropriate. The process could either be started periodically as a cron job or continuously as a daemon.
For external queues, the control folder must be named eqi-control.
For example, an external queue HPC system on Linux may see the filesystem /data/simulation containing both the eqi-control and projects folders. The Windows workstation may see the S:\ filesystem, which maps to /data/simulation on the HPC system and both the S:\eqi-control and S:\projects folders. In this case, XF projects that utilize EQI for simulation are saved in the S:\projects folder.
Additional requirements must be met when using rescaleqs.py:
- 7-Zip must be installed on the HPC system.
- An account with Rescale must be created.
Individual simulations that utilize multiple compute resources necessitate the installation of the message passing interface (MPI) on an external queue HPC system.
Process Description
When a simulation is queued using EQI from within XF, XF begins looking for eqi-control at the project directory and moves up the directory tree. The eqi-control folder must be a sibling of one of the project's ancestors. If it does not find the folder, an error is issued to the user. If the folder is found, XF writes a control file with a unique name ending with .xfsubmit to that folder. The control file contains the run location within a simulation being submitted for execution, as well as specifications about how to execute the simulation. When creating simulations that contain multiple runs, XF writes a control file for each run. Once XF writes the complete simulation, it returns control of the UI to the user.
When either xfqs or rescaleqs is running on the HPC system, it checks the control folder(s) periodically for both *.xfsubmit and *.xfcancel files. When either one finds a submission file, it performs the following tasks:
- Reads the submission file to determine which simulation to submit to either the job control system or Rescale.
- Reads the submission file specifications provided in order to determine job control parameters.
- Reads the project.info file in the simulation directory for simulation requirements in order to customize the job control parameters that control resource allocation.
- Submits the simulation to the job control system or Rescale.
- Saves the job identifier provided by either the job control system or Rescale for future reference. Saving it as eqi-jobid.dat in the run directory is a convenient way to associate the job identifier with a particular run.
- Writes job submission information to the project.clog file in the simulation directory, which is displayed in XF's simulations window.
- Removes the submission file.
When either xfqs or rescaleqs finds a cancellation file, it performs the following tasks:
- Reads the cancellation file and determines its job ID and the simulation folder to cancel.
- Removes the specified job from either the job control system or Rescale and cancels execution if necessary.
- Writes information about the job cancellation to the project.clog file, which is displayed in XF's simulations window.
- Removes the cancellation file.
Control Files
XF's UI generates a *.xfsubmit file when creating a simulation and a *.xfcancel file when terminating a simulation. The format of the submission file is a set of lines, each of which contains one keyword and one value that are separated by a space.
Keyword | Value | Notes | Optional |
---|---|---|---|
simDir | string | The path to the simulation folder relative to the control folder. | No |
userName | string | The username of the user who wrote the submission file. | Yes |
priority | string | Priority guidance that will be either Low, Normal, or High. | Yes |
useXStream | number | XStream guidance. If it does not exist, XStream use is not indicated. If it does exist, use either 0 to indicate that the daemon determines the number of GPUs, or enter the specified number. | Yes |
useMPI | number | MPI guidance. If it does not exist, MPI use is not indicated. If it does exist, use either 0 to indicate that the daemon determines the number of GPUs, or enter the specified number. | Yes |
batchOptions | string | Text that passes directly to the job submission command on the command line, most likely at the end of all other options. | Yes |
The format of the cancellation file is a set of lines, each of which contains one keyword and one value that are separated by a space.
Keyword | Value | Notes | Optional |
---|---|---|---|
simDir | string | The path to the simulation folder relative to the control folder. | No |
userName | string | The username of the user who wrote the submission file. | Yes |
Directives Documentation
In the standard configuration, XF writes control files to the eqi-control folder, which is monitored by a single daemon and appears as the Local queue selection when creating a simulation. Multiple daemons may be active simultaneously by creating an eqi-control subfolder for each additional daemon instance. The daemons are each configured to watch the main eqi-control folder and one of its subfolders, which are available selections when creating a simulation and allow users to submit to one of several queues. The displayed eqi-control directory name can be customized by creating an eqi-control/eqi.txt file containing a single line of the text to be displayed.
For example, xfqs.template and rescaleqs.py.template configured to watch the eqi-control/ folder and eqi-control/Rescale folder, respectively. If an eqi-control/eqi.txt subfolder is then created that contains the text In-house Cluster, then Local, In-house Cluster, and Rescale will be the available queue options when submitting a simulation.
xfqs Documentation
System administrators can set up a xfqs.template bash script with external queues that is configured to monitor the eqi-control folder and submit commands to Slurm following the process description outlined above. Each HPC system is unique, so this script serves as a starting point for system administrators to implement EQI on their systems.
The xfqs.template script is provided in {xf-install-dir}/remcom/bin. The script itself is heavily commented and therefore will not be discussed in detail here.