How to do Python development without tears
Many machine learning projects are written in Python. Many of these projects are written in a way that makes it difficult for individuals to recreate a known working development environment. This is a pity, because it slows down scientific progress; peers find it difficult to confirm results and collaborators struggle to extend ideas in novel ways. Following some simple guidelines can dramatically help the adoption of your project and ideas. Your results will be verified and extended by others, your projects will gain adoption, and your set of potential collaborators will, in turn, grow.
Where we have previously focused on writing extensible software, here we focus on the ease of “getting started” and being able to quickly create a known-to-be-working development environment.1Notably, we will not be focusing on exact reproducibility. Reproducibility is larger in scope, harder, and deserves its own blog post. I.e., is there a fairly easy way to capture direct and transitive dependencies for a Python project that allows one to (re-)create a development environment with little effort?
By the end of the post you will understand how to effectively leverage tools such as pdm
and pdm-conda
to make your life2And those of your collaborators and users. easier.
Let’s dig in!
The scene
Our story, like all stories, begins at the beginning. We have the following characters:
- Alice
- A data scientist who heavily uses Python. She has developed a number of projects3Not all of which use the same Python version. to help other data scientists’ train and use machine learning models. At any given moment she, usually, has several projects that she is actively maintaining. Alice, routinely, will also go back to her projects from a few months or years ago to extend them. As such, it is necessary for Alice to be able to work on multiple projects4Each with their own set of dependencies. simultaneously. It is also necessary for her to quickly create a previously known-to-be-working development environment. 5Alice relies on the successful evaluation of the test suites in her projects to confirm that for her. I.e, what she ultimately cares about is an environment that-successfully-runs-code, which doesn’t necessarily require creating a known-to-be-working environment. However, the latter can often be a cost-effective way of accomplishing the former.
- Alice’s latest project
- Alice’s latest machine learning project. It is based on PyTorch and uses PostgreSQL as well. It comes with a test suite, the successful evaluation of which is necessary to confirm that the project is working as expected and is sufficient to confirm that the project is being run in an environment that is correctly configured. The project also comes with a Jupyter notebook with some performance benchmarks and detailed profiling results.
- Bob
- Alice’s colleague and fellow data scientist. Alice and Bob frequently collaborate on projects. Bob is especially skilled at performance tuning and has helped Alice optimize CUDA kernels in her projects in the past. As such, it is necessary for Bob to be able to evaluate the Jupyter notebooks6In addition to the test suite. for the project.
- Charlie
- A user7Or, perhaps, reviewer. of Alice’s latest machine learning project. Charlie doesn’t intend to collaborate on the project with Alice and Bob, but is grateful to them for their developmental efforts. Charlie wants to evaluate Alice’s project on his own “private” data. As such, it’s necessary for him to be able to successfully evaluate the test suite first and foremost. 8After all, if he’s unable to even do that, is there much point in him trying to go further?
Alice is about to start development on a Python 3.129With its support for a per-interpreter GIL. project that has both PyPI and conda
dependencies.
She wants to be able to configure the project to her liking, without constraining other developers10Such as Bob., while also helping users11Such as Charlie. quickly get started.
We will assume that everyone involved already has appropriate versions of pdm
and pdm-conda
installed.
12I.e., Alice has taken the time to note this in the development instructions of her project.
If you have mamba
installed, you may run the commands below to get a similar environment:
13In a telling turn of events, some time between when the first draft of this post was written and when it was published, updated versions of pdm
and pdm-conda
were released that were mutually incompatible.
mamba install -n base --force-reinstall "conda-forge::pdm==2.15.4" mamba install -n base --force-reinstall --update-deps "conda-forge::pdm==2.15.4" pdm self add pdm-conda==0.18.1
One may confirm that the above commands have worked as intended using the below:
which pdm pdm --version pdm self list 2>/dev/null | grep pdm-conda
/home/user/miniconda3/bin/pdm PDM, version 2.15.4 pdm-conda 0.18.1 A PDM plugin to resolve/install/uninstall
Alice’s perspective
Alice initializes her project.
cd /tmp rm -rf /tmp/alice mkdir -p alice && cd alice git init git commit --allow-empty -m "alice: init" pdm init --runner mamba # using python 3.12 and defaults git add . git commit -m "alice: pdm init using python 3.12 and mamba"
Since this project will have both conda
and PyPI
dependencies, Alice also creates a conda
environment that she’ll be using and makes pdm
aware of it.
pdmcondaprofile="$(basename $(pdm venv create -cn alice | cut -d ' ' -f 2))" pdm use --venv "${pdmcondaprofile}" # populate the .pdm-python file with the name of the environment echo "${pdmcondaprofile}"
STATUS: Creating virtualenv using mamba... Using Python interpreter: /home/user/miniconda3/envs/alice-NLGZxbVC-3.12/bin/python (3.12) alice-NLGZxbVC-3.12
Alice then installs dependencies for her project.
However, instead of using pip
or conda
(or mamba
) directly, she installs them via pdm
and the pdm-conda
plugin.
Doing so ensures that the dependencies and their specific versions are noted in the project “lock” file.
14For the benefit of her future self and others.
git status # confirm it's clean pdm add --conda "postgresql>12.0" # may need multiple invocations for success
Having installed the dependencies, she confirms that they have been captured in the project “lock” file.
git status
On branch master Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) modified: pyproject.toml Untracked files: (use "git add <file>..." to include in what will be committed) pdm.lock no changes added to commit (use "git add" and/or "git commit -a")
She adds the “lock” file to the git repository index and commits the changes.
git add pdm.lock pyproject.toml
git commit -m "alice: install: postgresql>12"
git status
Alice will be using PyTorch in this project. She installs this from PyPI and commits the changes to the repository.
pdm add "torch>=2.3.0" # ✓ git add pdm.lock pyproject.toml git status git commit -m "alice: install: torch>=2.3.0"
During development, Alice also prefers some additional utilities to be available. Since these are not central to the project, but may be helpful to other collaborators, she installs these as development only dependencies.
pdm add --dev --conda "ipython" pdm add --dev --conda "jupyterlab"
And then commits the changes.
git add pdm.lock pyproject.toml
git commit -m "alice: install: dev: ipython, jupyterlab"
git status
Thus far, Alice has all her dependencies installed and she has the ability to invoke the Python interpreter15And other utilities. in the correct environment via pdm run
:
pdm run which python
pdm run which postgres
pdm run python -c "import torch; print(torch.__file__)"
pdm run which ipython
/home/user/miniconda3/envs/alice-NLGZxbVC-3.12/bin/python /home/user/miniconda3/envs/alice-NLGZxbVC-3.12/bin/postgres /home/user/miniconda3/envs/alice-NLGZxbVC-3.12/lib/python3.12/site-packages/torch/__init__.py /home/user/miniconda3/envs/alice-NLGZxbVC-3.12/bin/ipython
Note, however, that directly invoking python
doesn’t do the “right” thing16I.e., “right” here means invoking the version installed in the conda
environment with access to the installed dependencies. — she has to prefix the command invocation with pdm run
.
17Even if Alice were using Rye and had the associated shim installed, directly invoking python
would not have worked as intended.
This is because she’s not using the project-local .venv
, but is instead using an external conda environment.
As such, the Rye shim wouldn’t have automatically been able to correctly resolve things.
which python
python -c "import torch; print(torch.__file__)"
/usr/bin/python Traceback (most recent call last): ModuleNotFoundError: No module named 'torch'
If Alice doesn’t want every command invocation to be prefixed with pdm run
, she can accomplish that using direnv
.
In her case she has direnv
installed via Guix.
which direnv
/home/user/.guix-profile/bin/direnv
She creates a .envrc
file in her project with the following contents:
if [ -f ".pdm-python" ]; then cmd="$(pdm venv activate 2>/dev/null)" if [ conda = "$(echo $cmd | cut -d ' ' -f 1)" ]; then # <https://github.com/direnv/direnv/issues/326#issuecomment-574779485> eval "$(conda shell.bash hook)" fi eval "$cmd" else echo "Missing .pdm-python; ensure pdm has been initialized" fi
She then “allows” the .envrc
file so that it may be automatically evaluated.
direnv allow .
Now, running python
directly, anywhere within the project directory,18Including a sub-directory of it, and only when so. automatically invokes the correct Python interpreter.
Prefixing the command invocation with pdm run
is no longer necessary.
which python
python -c "import torch; print(torch.__file__)"
/home/user/miniconda3/envs/alice-NLGZxbVC-3.12/bin/python /home/user/miniconda3/envs/alice-NLGZxbVC-3.12/lib/python3.12/site-packages/torch/__init__.py
Since Alice’s choice to use direnv
is a personal one, she doesn’t subject others to it.
She adds the .envrc
file to .gitignore
.
echo -e "\n.envrc" >> .gitignore git add .gitignore git commit -m "alice: gitignore: ignore .envrc"
Having done the initial setup to her liking, Alice continues developing, testing, extending her project. She takes care to ensure that every time she needs to add or alter a dependency
- she does so via
pdm
, and - subsequently commits the changes to the
pyproject.toml
andpdm.lock
files togit
.
The above allows her to track her project dependencies with the project itself. It also allows her to track the dependencies in sufficient detail to allow her future self to recreate a previously known-to-be-working environment. As we’ll shortly see, doing the above also allows her to serve the needs of collaborators like Bob and users like Charlie at no extra cost.
Bob’s perspective
Bob is Alice’s colleague, and he wants to be able to collaborate with Alice. He clones Alice’s repository so he can start working on it locally.
git clone /tmp/alice bob
cd bob
With a fresh checkout, this is what Bob’s view looks like.
19Note, in particular, the absence of the .pdm-python
file.
. .. .git .gitignore pdm.lock pdm.toml pyproject.toml README.md src tests
Bob then proceeds to create a conda
environment.
20Alice, graciously, has noted the instructions in the README
.
pdmcondaprofile="$(basename $(pdm venv create -cn bob | cut -d ' ' -f 2))" pdm use --venv "${pdmcondaprofile}" echo "${pdmcondaprofile}"
STATUS: Creating virtualenv using mamba... Using Python interpreter: /home/user/miniconda3/envs/bob-i2DthxFJ-3.12/bin/python (3.12) bob-i2DthxFJ-3.12
Doing so creates a .pdm-python
file with the correct entry.
/home/user/miniconda3/envs/bob-i2DthxFJ-3.12/bin/python
Finally, he installs the dependencies from the lock file.
Since Bob will be contributing to the project, he installs both the development and the production dependencies.
21This is the default behaviour of pdm install
.
pdm install # may need multiple invocations for success
And now he has access to both Conda and PyPI dependencies.
pdm run which postgres
pdm run which python
pdm run python -c "import torch; print(torch.__file__)"
pdm run which ipython
/home/user/miniconda3/envs/bob-i2DthxFJ-3.12/bin/postgres /home/user/miniconda3/envs/bob-i2DthxFJ-3.12/bin/python /home/user/miniconda3/envs/bob-i2DthxFJ-3.12/lib/python3.12/site-packages/torch/__init__.py /home/user/miniconda3/envs/bob-i2DthxFJ-3.12/bin/ipython
Not only can Bob now view Alice’s benchmarking and profiling results, but he can also run the Jupyter notebooks locally. He has been able to quickly and easily get to a point where he can lend his performance tuning expertise to Alice and collaborate with her.
Charlie’s perspective
Charlie is a user for whom Alice’s project is, simply, a means to an end. He would like to use Alice’s model on his own data. However, before he does so he would like to ensure that Alice’s project is correctly configured and he has all the requisite dependencies. 22I.e., he would like to quickly verify that the test suite in Alice’s project successfully runs. He will, at present, not be contributing any changes to it. 23In other words, Charlie is not interested in evaluating the benchmarks contained in the Jupyter notebooks. As such, he’s only interested in the production dependencies.24As opposed to Bob.
git clone /tmp/alice charlie
cd charlie
Since the project has some “production” dependencies from conda
, Charlie proceeds to create a conda
environment following Alice’s instructions.
pdmcondaprofile="$(basename $(pdm venv create -cn charlie | cut -d ' ' -f 2))" pdm use --venv "${pdmcondaprofile}" echo "${pdmcondaprofile}"
STATUS: Creating virtualenv using mamba... Using Python interpreter: /home/user/miniconda3/envs/charlie-Z9GinPUb-3.12/bin/python (3.12) charlie-Z9GinPUb-3.12
Finally, he installs the production dependencies from the lock file.
pdm install --prod # may need multiple invocations for success
As expected, he now has access to the production, but not the development dependencies.
pdm run which postgres pdm run which python pdm run python -c "import torch; print(torch.__file__)" if [ $(pdm list | grep ipython >/dev/null) ]; then echo "STATUS ERROR: development dependencies were installed" else echo "STATUS OK: No development dependencies were installed" fi
/home/user/miniconda3/envs/charlie-Z9GinPUb-3.12/bin/postgres /home/user/miniconda3/envs/charlie-Z9GinPUb-3.12/bin/python /home/user/miniconda3/envs/charlie-Z9GinPUb-3.12/lib/python3.12/site-packages/torch/__init__.py STATUS OK: No development dependencies were installed
Charlie can now proceed to run the test suite. Note that the test suite may still fail for other reasons. 25For instance, if Alice has forgotten to pin some random seed. But Charlie is one step closer to understanding whether or not Alice’s project in its current state could meet his needs.
What did we learn?
By ensuring that all project dependencies are installed via pdm
26Either directly, or indirectly via the pdm-conda
plugin., Alice has ensured that pyproject.toml
captures the direct dependencies of the project.
Additionally, she has ensured that the pdm.lock
file captures a snapshot27I.e., for each package, the specific version installed by Alice. of all dependencies.28I.e., both direct and transitive.
By tracking both these files in git
, she has ensured that, if nothing else, she can recreate a previously known-to-be-working environment without too much effort.
This is significant.
Why pdm
and not …
Conda
Conda, like pdm
, supports Python version management.
29I.e., one could have two different projects, each using a different version of Python.
However, unlike pdm
, with Conda you are unable to track dependencies that are not available in a Conda channel.
This isn’t just a theoretical limitation, but a practical one.
Conda, unlike pdm
, also doesn’t generate a “lock” file.
30Though there seems to be conda-lock
which claims to do that.
Without a “lock” file there is little hope to be able to recreate a previously known-to-be-working environment.
31Note, that a “lock” file isn’t sufficient. What happens when a previously available version stops being available? Thankfully, there are projects such as Software Heritage.
Poetry
Poetry, like pdm
, supports a “lock” file.
However, unlike pdm
, poetry
does not support Python version management.
Rye
Rye, like pdm
, supports Python version management.
However, unlike pdm
32Via pdm-conda
., is unable to track dependencies from Conda.
Also, unlike pdm
the “lock” file isn’t cross-platform.
Guix33Or Nix, if so inclined.
Guix is a cross-platform functional package manager.
It is the thing one should use if one cares about reproducibility.
34Reproducibility and/or Repeatability is an important topic on which I have a few thoughts. However, what I have to say is too large to fit in the margin and must wait for a future blog post.
However, using Guix would require strictly more effort than pdm
.
35Whether or not said effort is justified depends on the specific needs of the project.
Limitations
A non-exhaustive list:
- While the “lock” file generated by
pdm
is cross-platform by default, there is no guarantee that Alice hasn’t opted for the lock strategy to be platform specific. - Additionally, even if the “lock” file is cross-platform, it doesn’t mean that the test suite has been confirmed to be successfully working on multiple platforms.
- The “lock” file doesn’t capture some environment specifics such as the OS kernel version, version of
libc
etc. 36These gaps may or may not prove to be pertinent for the code to fulfill the developer’s intent. - As noted previously, a “lock” file also doesn’t offer any protection against the specific versions of dependencies simply becoming unavailable at a future point in time.
- The approach followed by Alice in this post, while fulfilling some necessary criteria for reproducibility, is insufficient for reproducibility. Correctly capturing project dependencies doesn’t offer any protection from, say, the contents of a remote dataset changing over time. 37Or a host of other possible issues that can affect reproducibility.
And yet, despite the above, if the approach outlined in this post is followed, the situation is already considerably better than what one routinely encounters in Python projects today.
Comments
Footnotes:
Notably, we will not be focusing on exact reproducibility. Reproducibility is larger in scope, harder, and deserves its own blog post.
And those of your collaborators and users.
Not all of which use the same Python version.
Each with their own set of dependencies.
Alice relies on the successful evaluation of the test suites in her projects to confirm that for her. I.e, what she ultimately cares about is an environment that-successfully-runs-code, which doesn’t necessarily require creating a known-to-be-working environment. However, the latter can often be a cost-effective way of accomplishing the former.
In addition to the test suite.
Or, perhaps, reviewer.
After all, if he’s unable to even do that, is there much point in him trying to go further?
With its support for a per-interpreter GIL.
Such as Bob.
Such as Charlie.
I.e., Alice has taken the time to note this in the development instructions of her project.
In a telling turn of events, some time between when the first draft of this post was written and when it was published, updated versions of pdm
and pdm-conda
were released that were mutually incompatible.
For the benefit of her future self and others.
And other utilities.
I.e., “right” here means invoking the version installed in the conda
environment with access to the installed dependencies.
Even if Alice were using Rye and had the associated shim installed, directly invoking python
would not have worked as intended.
This is because she’s not using the project-local .venv
, but is instead using an external conda environment.
As such, the Rye shim wouldn’t have automatically been able to correctly resolve things.
Including a sub-directory of it, and only when so.
Note, in particular, the absence of the .pdm-python
file.
Alice, graciously, has noted the instructions in the README
.
This is the default behaviour of pdm install
.
I.e., he would like to quickly verify that the test suite in Alice’s project successfully runs.
In other words, Charlie is not interested in evaluating the benchmarks contained in the Jupyter notebooks.
As opposed to Bob.
For instance, if Alice has forgotten to pin some random seed.
Either directly, or indirectly via the pdm-conda
plugin.
I.e., for each package, the specific version installed by Alice.
I.e., both direct and transitive.
I.e., one could have two different projects, each using a different version of Python.
Though there seems to be conda-lock
which claims to do that.
Note, that a “lock” file isn’t sufficient. What happens when a previously available version stops being available? Thankfully, there are projects such as Software Heritage.
Via pdm-conda
.
Reproducibility and/or Repeatability is an important topic on which I have a few thoughts. However, what I have to say is too large to fit in the margin and must wait for a future blog post.
Whether or not said effort is justified depends on the specific needs of the project.
These gaps may or may not prove to be pertinent for the code to fulfill the developer’s intent.
Or a host of other possible issues that can affect reproducibility.