Philipp Stephan - Quitting pip: How we use git submodules to manage internal dependencies that require fast iteration

This post is is a write-up of my talk at PyCon ‘22. I will explain how we distribute our internal libraries, what we tried in the past and what we ended up using now. At my company medaire, our main product is setup using a dockerized microservice architecture. Keep that in mind, as a lot of the experiences are specific to our kind of setup. Hopefully it will be helpful for other kinds of setups as well.

At first we started out with just a handful components, say: retrieve data, process (ML), generate report, and finally send that out. These would have some external dependencies like numpy, matplotlib, or sqlalchemy, which we store in a requirements.txt and pull from PyPI, the official Python Package Index.

Over time, we added more and more components, a work queue manager, telemetry, etc. These started to share more and more code, so it made sense to extract that into a common library. But where do we store that?

A short detour/rant: Python gives us a beautiful easter egg, explaining some of its core design philosophy. Just type python -m this. And for packaging and package distribution, line 13 falls short of it’s expectations.

There should be one—and preferably only one—obvious way to do it.

While languages like Rust or Go have their canonical package manager built-in, in the Python ecosystem there are many options to choose from:

pip, which used to be easy_install is almost official, but needs to be installed separately
Conda is very popular in data science and also handles many other kinds of packages, it was designed specifically because pip can be pretty hard to set up for beginners
Poetry is the new kid on the block
and there is even more

And at least for pip you can also install packages in various formats from many different kinds of sources in requirements.txt

local path like ../../mylib
a wheel, which can contain compiled extensions
a egg, which is an older version of wheel basically?
a git URL like git@github.com:user/repo#4398f09345a3004
a URL to a zip file of a package directory

And because you need to separate the dependencies for different projects, so you can install different versions for different projects you also need to use some sort of environment separator like

[venv][venv] which is in the standard library
but the Python Packaging Authority actually recommends using the virtualenv wrapper
but also recommends using pipenv
and Conda has its own thing condaenv So as you can see this ecosystem can be quite confusing, especially for beginners.

So in the end, we decided that most of this logic is pretty generic and could very well be open source. We just put it on a public Github repository, used tags for versioning and pointed our requirements.txt to the git URL as shown above. Would not need to set up any kind of authentication to Guithub on our workstations or CI, because it was out in the open anyway.

That worked pretty well for a while, but we added more classification components that contained confidential business logic. So while these components also contained shared code that would be worth extracting into a library, we did not want to make this one open source. So our old solution was just not cutting it.

We decided to stick with what we know: PyPI works great for external dependencies, so why not use it for internal dependencies? We set up our own private PyPI instance, built wheels for our internal libraries and pushed them there. Now we can add those to our requirements.txt just like the external ones. But over time, we found more and more problems with this approach.

Authentication: We do not want have our private PyPI be accessible from the open internet. In the end that was the whole point of setting this up. So we put it behind a VPN. But now we need to setup the VPN on all the workstations that need to be able to build our Docker images. And on top of that we have to set the VPN connection up on our CI. That is a lot of work and you have to take care to not accidentally expose those credentials.
Speed of iteration: We found that a PyPI is well suited to serving stable versions of libraries, think of it like an archive of packages that you can go back to and rebuild old image versions if you ever need to go back. But iterating the library while working on a feature in another project can be a hassle: Make some changes to the library, build a new WIP wheel, rebuild the project to pull the new version, find the type, rinse and repeat. Afterwards clean up all thos WIP wheels—oh wait I don’t have permission to delete here (which is a good thing)… You see where I am going.
Single point of failure: Now we have this critical service, if the PyPI is down, no-one can build images, we are all blocked. We had problems with the availability and performance of the instance, but we actually just want to write code, not host infrastructure.
Version compatibility: In a lot of cases, you need to rebuild your wheels if you upgrade your Python version, so that can be a hassle, = especially if you have to rebuild old legacy versions that some old project still depends on.
Security: Dependency confusion attacks are very real, as demonstrated in this blogpost and news coverage. If someone adds a package by the same name of one of your internal dependencies to the public PyPI, pip will silently favor that one (with default settings). That can be a route to inject malicious code into your codebase!

After trying out different approaches to handle this problem, we settled with not using a package manager at all, but instead include all the internal dependencies explicitly, directly in the project’s repository. And git submodules give us.

please check back in a couple of days for the complete version