Git Submodules for Modular Dataset Composition
Table of Contents
The problem: monolithic datasets resist reuse #
Research projects rarely work with a single, self-contained blob of data. A typical neuroimaging study might use a standard brain atlas maintained by one group, a set of stimuli curated by another, and raw scanner output that is unique to the study. When all of this is dumped into a single repository, several problems emerge:
- No independent versioning. The atlas is at version 2.3, but there is no record of that inside the monolithic repo – just a snapshot of files. When the atlas releases version 2.4 you cannot cleanly upgrade.
- No reuse across projects. A colleague running a different study that needs the same atlas cannot pull it from your project without manually copying files. Two copies now drift independently.
- Bloated history. Every project that embeds the atlas carries a full copy of its history (or, worse, no history at all). Cloning becomes slow and storage costs multiply.
The root cause is that a flat directory tree conflates composition (assembling components into a project) with ownership (maintaining each component).
The solution: git submodules separate composition from ownership #
Git submodules let you nest one Git repository inside another. The parent
repository records which child repository to include and at which
commit, but the child retains its own .git directory, its own history,
and its own remote. This is exactly the separation we need:
- The parent (your research project) controls the composition.
- Each child (atlas, stimuli, raw data) controls its own content and version history.
This maps directly to the YODA principle that the inputs/ directory of a
dataset should contain independently versioned subdatasets rather than loose
copies of external data.
Step-by-step walkthrough #
1. Create the parent project #
Start with a fresh repository that will serve as the top-level research project:
mkdir my-study && cd my-study
git init
Create the YODA-style directory skeleton:
mkdir -p code inputs outputs
Add a minimal README and commit:
cat > README.md << 'EOF'
# My Study
Research project following YODA conventions.
- `code/` -- analysis scripts (tracked directly)
- `inputs/` -- input datasets (git submodules)
- `outputs/` -- results (ephemeral, regenerable)
EOF
git add README.md code inputs outputs
git commit -m "Initialize project skeleton"
2. Add an external dataset as a submodule #
Suppose the brain atlas lives in its own repository on GitHub. Add it
as a submodule under inputs/:
git submodule add https://github.com/example-org/brain-atlas.git inputs/brain-atlas
Git does three things here:
- Clones
brain-atlasintoinputs/brain-atlas/. - Creates (or updates) a
.gitmodulesfile at the project root recording the URL and local path. - Stages a special “gitlink” entry that records the exact commit SHA of the submodule.
Inspect what changed:
git status
# On branch main
# Changes to be committed:
# new file: .gitmodules
# new file: inputs/brain-atlas
The .gitmodules file looks like this:
[submodule "inputs/brain-atlas"]
path = inputs/brain-atlas
url = https://github.com/example-org/brain-atlas.git
Commit the addition:
git commit -m "Add brain-atlas v2.3 as input submodule"
3. Add a second submodule #
Add a stimulus set the same way:
git submodule add https://github.com/example-org/visual-stimuli.git inputs/visual-stimuli
git commit -m "Add visual-stimuli as input submodule"
4. Resulting directory structure #
After these steps, the project looks like this:
my-study/
.git/
.gitmodules
README.md
code/
analyze.py # tracked directly in the parent
inputs/
brain-atlas/ # submodule -> github.com/example-org/brain-atlas @ abc1234
atlas.nii.gz
labels.tsv
README.md
visual-stimuli/ # submodule -> github.com/example-org/visual-stimuli @ def5678
stim_001.png
stim_002.png
metadata.json
outputs/
(empty, will hold results)
The key insight: inputs/brain-atlas/ is a complete Git repository with
its own history. You can cd inputs/brain-atlas && git log to see the
atlas’s full commit history, completely independent of the parent project.
5. Cloning a project that uses submodules #
When a collaborator clones your project, submodule directories will exist but will be empty by default. They need one extra step:
git clone https://github.com/you/my-study.git
cd my-study
git submodule update --init
Or, to do both in one command:
git clone --recurse-submodules https://github.com/you/my-study.git
This fetches the parent and then checks out each submodule at the exact commit recorded by the parent.
6. Updating a submodule to a newer version #
When the atlas releases version 2.4, you can update the submodule pointer:
cd inputs/brain-atlas
git fetch
git checkout v2.4 # or: git pull origin main
cd ../..
git add inputs/brain-atlas
git commit -m "Update brain-atlas to v2.4"
The parent now records the new commit SHA. Anyone who runs
git submodule update will get the updated atlas. The old version is
still accessible via the parent’s history – just check out the previous
parent commit and run git submodule update again.
Connection to YODA principles #
The YODA layout convention places input data under inputs/ and analysis
code under code/. Git submodules implement the Modularity principle
for the inputs/ directory:
| YODA directory | Tracked how? | Why? |
|---|---|---|
code/ | Directly in the parent repo | Code is authored by the project team |
inputs/ | As submodules (or subdatasets) | Input data is maintained by external parties |
outputs/ | Ignored or ephemeral | Results are regenerable from code + inputs |
This separation means you can:
- Pin your analysis to a specific version of each input.
- Upgrade an input independently without touching code or other inputs.
- Share an input dataset across projects without copying it.
- Credit the maintainers of each input by pointing to their repository.
Limitations and when to prefer DataLad subdatasets #
Git submodules are a built-in Git feature and require no additional tools, which makes them a good starting point. However, they have limitations that become significant at scale:
No large-file support. Git submodules do not change how Git handles file content. If your atlas contains large binary files, each clone downloads the full history of those files. Git-annex or Git LFS is needed to manage large data efficiently.
Manual management. Adding, updating, and removing submodules requires several commands and careful attention to
.gitmodulesand.git/config. It is easy to leave a project in an inconsistent state.No partial fetch. You cannot easily fetch only a subset of a submodule’s files. For large datasets where you only need a slice, this is wasteful.
No recursive save/push. Each submodule must be committed and pushed independently, bottom-up. In a deeply nested hierarchy this becomes tedious.
DataLad subdatasets build on Git submodules but solve these problems by
integrating git-annex for large-file management and providing commands
like datalad save (recursive commit across all nesting levels),
datalad get (on-demand file retrieval), and datalad push (recursive
push). If your project involves large files or deep nesting, DataLad
subdatasets are the natural next step from plain Git submodules.
Summary #
| Aspect | Plain Git submodule | DataLad subdataset |
|---|---|---|
| Tooling required | Git only | Git + DataLad (+ git-annex) |
| Large file handling | None (full clone) | git-annex (on-demand fetch) |
| Recursive operations | Manual per submodule | datalad save, datalad push |
| Metadata integration | .gitmodules only | .datalad/config, structured metadata |
| Best for | Small/medium text-heavy repos | Any size, especially large data |
Start with Git submodules if your data is small and your nesting is shallow. Graduate to DataLad subdatasets when scale or convenience demands it. Either way, the underlying principle is the same: compose your project from independently versioned, reusable components.