Re-executing Computations with datalad rerun
Table of Contents
The problem: provenance as dead documentation #
Many systems record provenance – the chain of steps that produced a result. Workflow management tools generate DAGs, lab notebooks describe procedures, README files list the commands that were run. But in most cases these records are inert: they describe what happened, but they cannot make it happen again.
Consider a Git commit message that says:
Normalize and filter raw survey responses
Ran: python code/clean.py --threshold 0.05 inputs/raw/survey.csv outputs/clean/survey.csv
This is useful documentation, but it is just text. To re-execute the step, a human must read the message, extract the command, check that the files are in place, and run it manually. If the message has a typo, or the file paths have changed, or the script has been updated, the re-execution will silently produce different results – or fail entirely.
The gap between “recorded provenance” and “executable provenance” is the gap between documentation and actionability.
The solution: datalad rerun turns records into actions #
When a computation is recorded with datalad run (see the
companion example), the
resulting commit contains a machine-readable run record – a JSON object
specifying the exact command, inputs, and outputs. datalad rerun reads
this record and re-executes the command automatically:
datalad rerun <commit-hash>
That single command does all of the following:
- Parses the run record from the specified commit.
- Gets the declared input files (fetching from a remote annex if necessary).
- Unlocks the declared output files so they can be overwritten.
- Executes the recorded command string.
- Saves the result as a new commit, linking back to the original run record.
No manual extraction of commands from commit messages. No guessing about file paths or flags. The provenance record is the execution plan.
Concrete example: verification workflow #
Suppose your dataset has a commit from datalad run that generated
statistical results:
git log --oneline
# a1b2c3d (HEAD -> main) [DATALAD RUNCMD] Compute group statistics
# f6e5d4c Add preprocessed data
# 9a8b7c6 Initial commit
You want to verify that the results are reproducible. First, inspect the run record to understand what was recorded:
git log -1 --format=%B a1b2c3d
Output:
[DATALAD RUNCMD] Compute group statistics
=== Do not change lines below ===
{
"cmd": "python code/analyze.py outputs/preprocessed/ outputs/statistics/results.json",
"inputs": ["outputs/preprocessed/"],
"outputs": ["outputs/statistics/results.json"],
"exit": 0,
"pwd": "."
}
^^^ Do not change lines above ^^^
Now re-execute:
datalad rerun a1b2c3d
DataLad fetches the inputs (if needed), runs the exact same command, and
commits the result. If the output is identical to what was there before,
you have confirmed reproducibility. If it differs, the git diff will
show you exactly what changed, pointing to non-determinism in the
computation or a change in the software environment.
Checking for differences #
After the rerun, compare the current output to the original:
# Did the rerun produce identical files?
git diff HEAD~1 -- outputs/statistics/results.json
If the diff is empty, the computation is reproducible. If not, you have a concrete starting point for investigation: the exact same command, on the exact same inputs, produced different outputs. That narrows the problem to the software environment (library versions, random seeds, floating-point ordering, etc.).
Concrete example: updating workflow #
A second powerful use of datalad rerun is propagating changes through a
pipeline. Suppose the raw input data is corrected (a data entry error is
fixed). You want to regenerate all downstream results:
git log --oneline
# b2c3d4e (HEAD -> main) [DATALAD RUNCMD] Generate figures
# a1b2c3d [DATALAD RUNCMD] Compute group statistics
# f6e5d4c [DATALAD RUNCMD] Preprocess raw data
# 9a8b7c6 Fix data entry error in raw/survey.csv
# 1234567 Add raw data
To re-execute the full pipeline from the preprocessing step onward:
datalad rerun --since 9a8b7c6
The --since flag tells DataLad to re-execute every datalad run commit
after the specified commit. It will:
- Rerun the preprocessing step (commit
f6e5d4c). - Rerun the statistics computation (commit
a1b2c3d). - Rerun the figure generation (commit
b2c3d4e).
Each step uses the (now corrected) outputs of the previous step as its inputs. The entire pipeline is re-executed in order, and the results reflect the corrected raw data.
The difference between datalad run and datalad rerun #
These two commands are complementary halves of the same workflow:
| Aspect | datalad run | datalad rerun |
|---|---|---|
| Purpose | Record a new computation | Re-execute a previously recorded computation |
| Input | A command typed by the user | A commit hash (or range) |
| Creates | A new commit with a run record | A new commit that re-executes an existing run record |
| When to use | First time you run a command | Verification, updating, or re-execution |
Think of datalad run as recording and datalad rerun as playback.
The recording captures the full specification; playback faithfully
reproduces it.
Connection to Actionability (A) #
The STAMPED Actionability principle states that dataset operations
should be executable, not just documented. datalad rerun is the
mechanism that makes this principle concrete:
- A commit message that says “we ran script X” is documentation.
- A
datalad runcommit that contains a structured run record is actionable documentation. datalad rerunis the action – it reads the documentation and executes it.
Without datalad rerun, run records would be valuable metadata but still
require manual interpretation. With it, the entire provenance chain
becomes a push-button operation.
Connection to Ephemerality (E) #
The Ephemerality principle states that derived and regenerable content
should be treated as ephemeral. datalad rerun is what makes this
practical:
- If every derived file was produced by a
datalad runcommit, then every derived file can be regenerated bydatalad rerun. - This means derived files do not need to be permanently stored – they
can be dropped from local storage (using
datalad drop) and regenerated on demand. - The repository stays lean: it stores the recipes (run records) rather than the products (large derived files).
The combination of datalad run (recording), datalad rerun
(re-execution), and datalad drop (reclaiming space) forms a complete
lifecycle for ephemeral data:
run drop rerun
raw --> derived (committed) --> pointer only -------> derived (regenerated)
[recipe recorded] [space reclaimed] [recipe re-executed]
Practical considerations #
Software environment matters #
datalad rerun re-executes the command, but it does not recreate the
software environment. If you ran the original command with Python 3.10
and scikit-learn 1.2, but your current environment has Python 3.12 and
scikit-learn 1.4, the results may differ.
For full reproducibility, combine datalad run with environment capture:
- Container images: Use
datalad containers-runto execute commands inside a Docker or Singularity container. The container image is recorded in the run record alongside the command. - Lock files: Track
requirements.txtorconda-lock.ymlin the repository so the exact package versions are part of the dataset’s version history.
Rerunning a single step vs. a range #
# Re-execute a single recorded step
datalad rerun a1b2c3d
# Re-execute all recorded steps after a given commit
datalad rerun --since 9a8b7c6
# Re-execute all recorded steps in the entire history
datalad rerun --since ""
Handling failures #
If a rerun fails (non-zero exit code), DataLad will not commit the broken output. The working tree will contain the partial results, and you can inspect what went wrong before deciding how to proceed.
Combining with --script #
You can extract the commands from a range of run records into a shell script without executing them:
datalad rerun --since 9a8b7c6 --script pipeline.sh
This produces a standalone script containing the exact commands in order. It is useful for review, for running on a cluster, or for porting to a system where DataLad is not installed.
Summary #
datalad rerun closes the loop between provenance and action. When
every data transformation is recorded with datalad run, the dataset’s
history is not just a log of what happened – it is a complete,
re-executable specification of how to produce the current state from
the original inputs. This turns provenance from passive metadata into
an active tool for verification, updating, and space management.