Best practices for git submodules

Git submodules seem to be a contentious topic among expert git users, while beginner git users either desperately cope with them or are typically glad that they can avoid them.

I’ve been using git submodules for many years in a large software development environment that contained embedded C code, java-based desktop tools, python code, various scripts and similar, all of them with intricate dependencies between them. I’m firmly convinced that submodules are a great tool and when used correctly, they can contribute greatly to the dependency management of your software project.

However, they don’t come without hiccups. The interface to git submodules is a bit quirky. They often end up in “dirty” state from the super-module view for seemingly unknown reasons. Merging branches with different submodule pointers can possibly turn into a mess and merging branches which have different submodules all together will definitely turn into a mess. Junior developers will inevitably start running around with their HEADs detached.

So basically: with great power, comes great responsibility.

This post assumes that you are very familiar with submodules. It’s not intended as any kind of tutorial, it merely debates best practices as I’ve adopted them.

I try to follow very strict rules when it comes to use of submodules:

When a repo A is declared a submodule of super-repo B, that should mean ONE and ONLY ONE thing: B depends on A.
There should be no use of submodules for purpose of breaking up repos on account of them being too large. There should be no submodules for creating good directory taxonomies or similar reasons. The submodule relationship has a very strict meaning of dependency, and nothing else.
Therefore, whenever a developer updates any submodule pointer at any level, this action has a well defined meaning: “I, the committer, have tested that stated version of submodule works with current version of containing module.”
There is no applied meaning to levels further up, since that is out of scope and out of control at the layer of this statement.
When super-repo appS contains submodules modA and modB, and both modA and modB contains a submodule libX, super-repo appS will contain 2 copies of submodule libX, two levels down. That is not a problem, that is simply a fact, and an opportunity to do some sanity checks!
At the level of appS, you typically need a single version of libX. Therefore you should submodule libX to appS as well. This one version, should be used to build both modA and modB, and the libX submodule under modA and modB should be ignored. In case of version missmatch, you can detect that early in your continuous integration build system, and simply stop the build with a clear error message that modA and modB require two different versions of libX. A developer should then look at this, and make sure that modA and modB on the branches that are being used together, both point to the same submodules.
Note that this way, we have turned a possible problem with having multiple same submodules, into an opportunity to do some more sanity checks before the build!
It allows a build to verify that modA developers AND modB developers are in fact using the same version of libX. If not, then it’s good to know early, rather than wait for build to explode later, or even worse, succeed in a subtly broken state.
A consequence of 2: you shouldn’t use –recursive when dealing with submodules, unless you really know why are you doing this. In the above example, appS has everything it needs as direct submodules: modA, modB and libX. There is no need for recursive actions on submodules at that level. If you are independently working on modA, without any awareness of appS, then you may want to clone modA separately somewhere else anyway, since you’re operating within a different context. Or maybe in this case you do want to init the submodules of modA. They don’t hurt, it simply gives you an ability to operate on modA independently. You make it work on its own, then go back to appS and update the submodule for testing the integration.
Build system should ALWAYS allow for a configurable locations where individual modules exist. In the above example, when you’re building modA independently, as a standalone repo, you have to specify the location of libX to wherever you mounted the submodule inside modA.
However, when you’re building modA within a context of appS, the build system needs to use the version of libX as present directly under appS, ignoring the one that is mounted under modA. The one that is mounted under modA, might not even be initialized, if you have never done a recursive submodule init.
So when you’re building Makefile or gradle file or ant file or whatever your build system is, for the modA, do NOT lock the system into libX being present at the location where its mounted as a submodule. You can use that as a default, but you have to be able to specify alternative location. Ideally your build system will follow a modern approach of using an artifact depot somewhere, and everything is built by taking dependencies from a depot, building a unit, and then publishing it back to the depot.

I believe that if you set out such best practices, the submodules can be used as great tool, and not as a nightmare that some projects turn them into.