From Script to STAMPED Research Object

Table of Contents

The task #

Sum prices from a tiny CSV — a grocery receipt:

item,price
apples,1.50
bread,2.30
milk,3.20

Processing: awk -F, 'NR>1 {sum+=$2} END {printf "%.2f\n", sum}' prices.csv

Result: 7.00

This is trivially understandable, yet enough to demonstrate every STAMPED property across four progressive scenarios. Each scenario script follows the ephemeral shell reproducer skeleton for portability.

Scenario 1: Self-contained script (S, E, P) #

The simplest case: a single script that creates the data inline, sums the prices, and prints the total. Requires only POSIX sh and awk.

script#!/bin/sh
# Grocery receipt: sum prices from a CSV

set -eux
PS4='> '
cd "$(mktemp -d "${TMPDIR:-/tmp}/grocery-XXXXXXX")"

# --- generate data ---
cat > prices.csv <<'EOF'
item,price
apples,1.50
bread,2.30
milk,3.20
EOF

# --- process: sum the prices ---
export LC_ALL=C

awk -F, 'NR>1 {sum+=$2} END {printf "%.2f\n", sum}' prices.csv > total.txt

echo "=== Total ==="
cat total.txt

This script is self-contained (the data is generated inline) and ephemeral (runs in a fresh temp directory).

Why LC_ALL=C? The decimal point is locale-dependent. On a system with LC_ALL=de_DE.UTF-8, awk might interpret 1.50 as 1 (treating . as a thousands separator) or produce output with commas instead of periods. Setting LC_ALL=C forces POSIX numeric conventions — consistent behavior regardless of the host locale makes script more Portable (assuming availability of awk).

Scenario 2: Makefile as actionable specification (+ T, A) #

The same analysis, but now organized as a git repository with a Makefile that declares the dependency graph. This adds tracking (git records every change) and actionability (make re-derives results from source).

script#!/bin/sh
# Grocery receipt as a tracked, actionable git repository

set -eux
PS4='> '
cd "$(mktemp -d "${TMPDIR:-/tmp}/grocery-XXXXXXX")"

git init grocery-analysis
cd grocery-analysis
git config user.email "demo@example.com"
git config user.name "Demo User"

# --- data ---
cat > prices.csv <<'EOF'
item,price
apples,1.50
bread,2.30
milk,3.20
EOF

# --- analysis script ---
cat > sum-prices.sh <<'SCRIPT'
#!/bin/sh
set -eu
export LC_ALL=C
awk -F, 'NR>1 {sum+=$2} END {printf "%.2f\n", sum}' prices.csv > total.txt
SCRIPT
chmod +x sum-prices.sh

# --- Makefile: the actionable specification ---
cat > Makefile <<'MF'
.POSIX:

all: total.txt

total.txt: prices.csv sum-prices.sh
	./sum-prices.sh

clean:
	rm -f total.txt

.PHONY: all clean
MF

# --- README ---
cat > README.md <<'README'
# Grocery Receipt Analysis

Run `make` to produce `total.txt` from `prices.csv`.

Requires: POSIX sh, awk, make.
README

# --- .gitignore: outputs are derived, not tracked ---
cat > .gitignore <<'GI'
total.txt
GI

git add -A
git commit -m "Initial commit: grocery receipt analysis"

# --- run it ---
make
echo "=== Total ==="
cat total.txt
echo ""
echo "=== Provenance: the Makefile + git log ==="
git log --oneline

Browse grocery-analysis

The Makefile is the actionable specification: it declares that total.txt depends on prices.csv and sum-prices.sh, and make will only re-run the analysis when an input changes. Git tracks the full history.

This is a substantial improvement over a loose script: git clone + make is all anyone needs to reproduce the result. But make records what to run, not what environment to run it in — the host’s awk version is still implicit.

Scenario 3: Containerized execution with Alpine (+ P) #

To pin the computational environment, we run the analysis inside a minimal Alpine Linux container (~3 MB as a .sif image). Alpine includes BusyBox awk — exactly what our script needs, nothing more.

The examples below use Singularity to pull and execute the container. The same approach works with Apptainer (the open-source fork — just replace singularity with apptainer), or with Docker/Podman if you prefer an OCI-native workflow (docker run --rm -v "$PWD:$PWD" -w "$PWD" alpine:3.21 ./sum-prices.sh).

script#!/bin/sh
# Grocery receipt with containerized execution via Alpine

set -eux
PS4='> '
cd "$(mktemp -d "${TMPDIR:-/tmp}/grocery-XXXXXXX")"

# --- pull a minimal container image ---
singularity pull docker://alpine:3.21

git init grocery-analysis
cd grocery-analysis
git config user.email "demo@example.com"
git config user.name "Demo User"

# --- same data and script as Scenario 2 ---
cat > prices.csv <<'EOF'
item,price
apples,1.50
bread,2.30
milk,3.20
EOF

cat > sum-prices.sh <<'SCRIPT'
#!/bin/sh
set -eu
export LC_ALL=C
awk -F, 'NR>1 {sum+=$2} END {printf "%.2f\n", sum}' prices.csv > total.txt
SCRIPT
chmod +x sum-prices.sh

# --- Makefile: run inside the container ---
cat > Makefile <<'MF'
.POSIX:
SIF = ../alpine_3.21.sif

all: total.txt

total.txt: prices.csv sum-prices.sh $(SIF)
	singularity exec --cleanenv $(SIF) ./sum-prices.sh

clean:
	rm -f total.txt

.PHONY: all clean
MF

cat > .gitignore <<'GI'
total.txt
GI

cat > README.md <<'README'
# Grocery Receipt Analysis (containerized)

Run `make` to produce `total.txt` from `prices.csv`.

The analysis runs inside an Alpine Linux container to guarantee
identical results regardless of the host system's awk version.

Requires: POSIX sh, make, singularity (or apptainer).
The container image (`alpine_3.21.sif`) must be present in the
parent directory — see Makefile for details.
README

git add -A
git commit -m "Initial commit: containerized grocery receipt analysis"

# --- run it ---
make
echo "=== Total ==="
cat total.txt

Browse grocery-analysis

Now every collaborator gets the same BusyBox awk regardless of whether their host has gawk, mawk, or something else. This demonstrates portability: the script no longer depends on whatever happens to be installed on the host.

But the container reference docker://alpine:3.21 is not pinned — the 3.21 tag is mutable (Alpine publishes point releases under the same tag). And the script depends on Docker Hub being available: if the network is down or the registry is unavailable, the pull fails.

S is weakened — the container lives on Docker Hub, not in our repository.
T is weak — we know “Alpine 3.21” but not which exact build.

Scenario 3b: Pinning the container by digest (recovering T) #

A simple fix for the tracking problem: reference the image by its content-addressed digest rather than a mutable tag.

The only line that changes from Scenario 3:

snippet# Before (mutable tag — could change between builds):
singularity pull docker://alpine:3.21

# After (pinned digest — immutable):
singularity pull docker://alpine@sha256:a8560b36e8b8210634f77d9f7f9efd7ffa463e380b75e2e74aff4511df3ef88c

With a digest, two people running the script a year apart will pull byte-identical image content — the registry is physically unable to serve different bits for the same sha256. This recovers tracking: the provenance now records exactly which environment was used, down to every library version.

But self-containment is still missing. The image lives on Docker Hub, not inside our project. If the registry imposes pull rate limits, or the network is simply unavailable (an air-gapped HPC cluster), the script cannot obtain its dependency. The digest is a precise reference, not a local copy.

This is the gap that Scenario 4 closes.

Scenario 4: Container committed to git (recovering S, + M, D) #

The Alpine .sif image is only ~3 MB — small enough to commit directly to the git repository. Now the container travels with the code and data. No network access needed to reproduce.

script#!/bin/sh
# Grocery receipt: fully self-contained with container in git

set -eux
PS4='> '
cd "$(mktemp -d "${TMPDIR:-/tmp}/grocery-XXXXXXX")"

# --- build the container image from a pinned digest ---
singularity pull env.sif docker://alpine@sha256:a8560b36e8b8210634f77d9f7f9efd7ffa463e380b75e2e74aff4511df3ef88c

git init grocery-analysis
cd grocery-analysis
git config user.email "demo@example.com"
git config user.name "Demo User"

# --- commit the container image into the repository ---
cp ../env.sif .
git add env.sif
git commit -m "Add Alpine container image (3 MB, pinned by digest)"

# --- raw data as a git submodule (modularity) ---
(
  cd ..
  git init --bare raw-data.git
  git clone raw-data.git raw-data-work
  cd raw-data-work
  git config user.email "demo@example.com"
  git config user.name "Demo User"
  cat > prices.csv <<'EOF'
item,price
apples,1.50
bread,2.30
milk,3.20
EOF
  git add prices.csv
  git commit -m "Add grocery prices"
  git push
)
# In a real project, use a proper URL (https://... or git@...:...).
# For this local demo, we must allow the file:// transport
# (restricted by default since Git 2.38.1, CVE-2022-39253).
git -c protocol.file.allow=always submodule add ../raw-data.git raw-data

# --- analysis script ---
cat > sum-prices.sh <<'SCRIPT'
#!/bin/sh
set -eu
export LC_ALL=C
awk -F, 'NR>1 {sum+=$2} END {printf "%.2f\n", sum}' raw-data/prices.csv > total.txt
SCRIPT
chmod +x sum-prices.sh

# --- Makefile: run inside the local container ---
cat > Makefile <<'MF'
.POSIX:
SIF = env.sif

all: total.txt

total.txt: raw-data/prices.csv sum-prices.sh $(SIF)
	singularity exec --cleanenv $(SIF) ./sum-prices.sh

clean:
	rm -f total.txt

.PHONY: all clean
MF

cat > .gitignore <<'GI'
total.txt
GI

cat > README.md <<'README'
# Grocery Receipt Analysis

Run `make` to produce `total.txt` from raw price data.

The analysis runs inside an Alpine Linux container (`env.sif`)
that is committed to this repository — no network access needed.
Raw data lives in the `raw-data/` git submodule.

    git clone --recurse-submodules <url>
    make

Requires: POSIX sh, make, singularity (or apptainer).
README

git add -A
git commit -m "Add analysis script, Makefile, and README"

# --- run it ---
make
echo "=== Total ==="
cat total.txt
echo ""
echo "=== Repository structure ==="
git submodule status
git log --oneline

Browse grocery-analysisBrowse raw-data-work

This recovers the full STAMPED stack using only git, make, and singularity — no specialized research data management tools required:

Property	How it is realized
S — Self-contained	Container image (`env.sif`), analysis script, and Makefile are all committed to git. Raw data is pinned via a git submodule at a specific commit. `git clone --recurse-submodules` + `make` is all anyone needs.
T — Tracked	Git records every change to code, data (in the submodule), and even the container image. The Makefile declares the exact dependency graph.
A — Actionable	`make` re-derives results from source. The `README.md` tells a collaborator exactly what to run.
M — Modular	Raw data is a separate git repository included as a submodule — reusable in other projects, versioned independently.
P — Portable	The container pins the `awk` implementation; POSIX shell + `LC_ALL=C` pins the script behavior.
E — Ephemeral	The entire analysis runs in a temp directory built from scratch.
D — Distributable	Standard `git push` to any remote. The repository can be pushed to multiple hosts (GitHub, GitLab, institutional server) simultaneously. For archival, `git bundle` creates a single-file snapshot of the entire history.

The progression across all four scenarios illustrates a general pattern: each STAMPED property you add removes a class of failure, but introducing an external dependency (the container) can remove properties you already had (self-containment) unless you provision for it explicitly.

Beyond git: scaling with git-annex and DataLad #

For projects where the data or container images outgrow what is practical to commit to git directly, tools like git-annex or DataLad extend this pattern with content-addressed storage and multi-remote availability tracking — the same dataset can be distributed to GitHub, Figshare (with a DOI), S3, or institutional archives, and the availability information (which remotes hold which files) travels with the dataset so that a fresh clone can assemble itself from whichever sources are reachable.

In particular, datalad-container simplifies container management within DataLad datasets: it maintains a local catalog of container images (tracked by git-annex), and its datalad containers-run command records which container was used for each computation — adding container identity to the provenance chain automatically.

For neuroimaging and other scientific domains, ReproNim/containers provides a ready-made DataLad dataset of popular containerized tools (FreeSurfer, fMRIPrep, BIDS Apps, etc.). It is itself a STAMPED research object: a modular collection that can be included as a git submodule or DataLad sub-dataset, providing portable access to pinned container versions without each project having to manage its own images.