Quick Update on Job Arrays

Well, it seems that you can learn a lot in just a few days of bashing your head against the keyboard. For a little context, a couple of days ago I posted a blog discussing my recent experience getting into job arrays for the Slurm workload manager and showing a sample Slurm submit script. If you need a refresher or if you’re just tuning in, I’ll link to it here.

Some of you keen-eyed individuals out there might notice that if I simply use ${SLURM_ARRAY_TASK_ID} as an indexer for my file names and log files, that would imply that I have some input files lying around that are numbered but not padded. If any of you have spent time in the computing sphere, you would know how gross a directory can start to look if things are labeled file1, file10, or file20. For those of you who don’t spend a lot of time “in the trenches,” let’s say you have 20 files labeled file1 - file10. If you were to list them out, it would look something like the following:

file1
file10
file2
...
file7
file8
file9

If that doesn’t infuriate you, I don’t know what will… In all seriousness, though, this can turn into a pretty big headache down the line if you constructed your files sequentially. For instance, if you know that file2 has a number that needs to be subtracted from file1, but they have the same leading string, using grep to grab these strings may quickly become cumbersome. That’s why it would be nice to use a numbering or naming convention that is more convenient for yourself later on. That being said, I would like to present two different naming conventions that I have found over the past couple of days that may alleviate some unnecessary pain.

1. Padded Array IDs

If you’d like to stick with a numerical naming convention but would like fewer headaches in the future, here is the code snippet for easily padding your array IDs.

#!/bin/bash
#SBATCH --job-name=padded_arrays
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --array=1-10
#SBATCH --cpus-per-task=1
#SBATCH --time=48:00:00 
#SBATCH --mem=9G
#SBATCH --output=file-%a.log

raw_id=$SLURM_ARRAY_TASK_ID
pad_id=$(printf "%02d" "$raw_id")

program.sh file-${pad_id}.com

Utilizing the printf function here allows us to take the array ID, which has been assigned to the new variable raw_id, and pad it with a couple of zeros. Now, if you have files that have a padded numerical index like file-01 or file-10, 1) they’ll list properly in the directory structure making later file parsing easier, and 2) they are still recognized by the Slurm array head job!

2. Strings as Slurm Array IDs

What if you don’t want to use a numerical index in the file name at all? What if your file generation script provides explicitly named files that will be easier to parse through later without numbers? Lucky for you, there is a pretty clever and cute way of getting around the need for numbers for this purpose. Here is the code snippet for using strings as Slurm array IDs:

#!/bin/bash
#SBATCH --job-name=padded_arrays
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --array=1-10
#SBATCH --cpus-per-task=1
#SBATCH --time=48:00:00 
#SBATCH --mem=9G
#SBATCH --output=file-%a.log

input="$( sed "${SLURM_ARRAY_TASK_ID}q;d" file_list )"

program.sh $input

Essentially, you can have whatever name you want for the files you want to throw at your job array. All you need to do is gather a list of the file names, e.g., ls [your file name here] > file_list, and you can use that clever sed command to use the current Slurm array ID as an indexer for grabbing a specific line out of file_list! How neat is that? In case you’re wondering, yes, it works. I revamped my whole project today just so I could use this naming convention for easier data handling later on. However, I must say that I did not find either of these out on my own and will definitely give credit where credit is due. I found this document very helpful in this endeavor.

Again, I hope this little blog and code snippet tutorial have been helpful for you in some way. Thanks for tuning in and subscribe to my RSS feed! You should find a link at the bottom of the page. Thanks, folks! Have a good one.