The Weary Travelers

A blog for computer scientists


Date: 2024-07-16
Author: Suhail
Suhail.png

How to do Python development without tears

Many machine learning projects are written in Python. Many of these projects are written in a way that makes it difficult for individuals to recreate a known working development environment. This is a pity, because it slows down scientific progress; peers find it difficult to confirm results and collaborators struggle to extend ideas in novel ways. Following some simple guidelines can dramatically help the adoption of your project and ideas. Your results will be verified and extended by others, your projects will gain adoption, and your set of potential collaborators will, in turn, grow.

Where we have previously focused on writing extensible software, here we focus on the ease of “getting started” and being able to quickly create a known-to-be-working development environment.1Notably, we will not be focusing on exact reproducibility. Reproducibility is larger in scope, harder, and deserves its own blog post. I.e., is there a fairly easy way to capture direct and transitive dependencies for a Python project that allows one to (re-)create a development environment with little effort?

By the end of the post you will understand how to effectively leverage tools such as pdm and pdm-conda to make your life2And those of your collaborators and users. easier.

Let’s dig in!

The scene

Our story, like all stories, begins at the beginning. We have the following characters:

Alice
A data scientist who heavily uses Python. She has developed a number of projects3Not all of which use the same Python version. to help other data scientists’ train and use machine learning models. At any given moment she, usually, has several projects that she is actively maintaining. Alice, routinely, will also go back to her projects from a few months or years ago to extend them. As such, it is necessary for Alice to be able to work on multiple projects4Each with their own set of dependencies. simultaneously. It is also necessary for her to quickly create a previously known-to-be-working development environment. 5Alice relies on the successful evaluation of the test suites in her projects to confirm that for her. I.e, what she ultimately cares about is an environment that-successfully-runs-code, which doesn’t necessarily require creating a known-to-be-working environment. However, the latter can often be a cost-effective way of accomplishing the former.
Alice’s latest project
Alice’s latest machine learning project. It is based on PyTorch and uses PostgreSQL as well. It comes with a test suite, the successful evaluation of which is necessary to confirm that the project is working as expected and is sufficient to confirm that the project is being run in an environment that is correctly configured. The project also comes with a Jupyter notebook with some performance benchmarks and detailed profiling results.
Bob
Alice’s colleague and fellow data scientist. Alice and Bob frequently collaborate on projects. Bob is especially skilled at performance tuning and has helped Alice optimize CUDA kernels in her projects in the past. As such, it is necessary for Bob to be able to evaluate the Jupyter notebooks6In addition to the test suite. for the project.
Charlie
A user7Or, perhaps, reviewer. of Alice’s latest machine learning project. Charlie doesn’t intend to collaborate on the project with Alice and Bob, but is grateful to them for their developmental efforts. Charlie wants to evaluate Alice’s project on his own “private” data. As such, it’s necessary for him to be able to successfully evaluate the test suite first and foremost. 8After all, if he’s unable to even do that, is there much point in him trying to go further?

Alice is about to start development on a Python 3.129With its support for a per-interpreter GIL. project that has both PyPI and conda dependencies. She wants to be able to configure the project to her liking, without constraining other developers10Such as Bob., while also helping users11Such as Charlie. quickly get started.

We will assume that everyone involved already has appropriate versions of pdm and pdm-conda installed. 12I.e., Alice has taken the time to note this in the development instructions of her project. If you have mamba installed, you may run the commands below to get a similar environment: 13In a telling turn of events, some time between when the first draft of this post was written and when it was published, updated versions of pdm and pdm-conda were released that were mutually incompatible.

mamba install -n base --force-reinstall  "conda-forge::pdm==2.15.4"
mamba install -n base --force-reinstall --update-deps "conda-forge::pdm==2.15.4"
pdm self add pdm-conda==0.18.1

One may confirm that the above commands have worked as intended using the below:

which pdm
pdm --version
pdm self list 2>/dev/null | grep pdm-conda
/home/user/miniconda3/bin/pdm
PDM, version 2.15.4
pdm-conda               0.18.1     A PDM plugin to resolve/install/uninstall    

Alice’s perspective

Alice initializes her project.

cd /tmp
rm -rf /tmp/alice
mkdir -p alice && cd alice
git init
git commit --allow-empty -m "alice: init"
pdm init --runner mamba # using python 3.12 and defaults
git add .
git commit -m "alice: pdm init using python 3.12 and mamba"

Since this project will have both conda and PyPI dependencies, Alice also creates a conda environment that she’ll be using and makes pdm aware of it.

pdmcondaprofile="$(basename $(pdm venv create -cn alice | cut -d ' ' -f 2))"
pdm use --venv  "${pdmcondaprofile}" # populate the .pdm-python file with the name of the environment
echo "${pdmcondaprofile}"
STATUS: Creating virtualenv using mamba...
Using Python interpreter: /home/user/miniconda3/envs/alice-NLGZxbVC-3.12/bin/python (3.12)
alice-NLGZxbVC-3.12

Alice then installs dependencies for her project. However, instead of using pip or conda (or mamba) directly, she installs them via pdm and the pdm-conda plugin. Doing so ensures that the dependencies and their specific versions are noted in the project “lock” file. 14For the benefit of her future self and others.

git status # confirm it's clean
pdm add --conda "postgresql>12.0" # may need multiple invocations for success

Having installed the dependencies, she confirms that they have been captured in the project “lock” file.

git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   pyproject.toml

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	pdm.lock

no changes added to commit (use "git add" and/or "git commit -a")

She adds the “lock” file to the git repository index and commits the changes.

git add pdm.lock pyproject.toml
git commit -m "alice: install: postgresql>12"
git status

Alice will be using PyTorch in this project. She installs this from PyPI and commits the changes to the repository.

pdm add "torch>=2.3.0" # 
git add pdm.lock pyproject.toml
git status
git commit -m "alice: install: torch>=2.3.0"

During development, Alice also prefers some additional utilities to be available. Since these are not central to the project, but may be helpful to other collaborators, she installs these as development only dependencies.

pdm add --dev --conda "ipython"
pdm add --dev --conda "jupyterlab"

And then commits the changes.

git add pdm.lock pyproject.toml
git commit -m "alice: install: dev: ipython, jupyterlab"
git status

Thus far, Alice has all her dependencies installed and she has the ability to invoke the Python interpreter15And other utilities. in the correct environment via pdm run:

pdm run which python
pdm run which postgres
pdm run python -c "import torch; print(torch.__file__)"
pdm run which ipython
/home/user/miniconda3/envs/alice-NLGZxbVC-3.12/bin/python
/home/user/miniconda3/envs/alice-NLGZxbVC-3.12/bin/postgres
/home/user/miniconda3/envs/alice-NLGZxbVC-3.12/lib/python3.12/site-packages/torch/__init__.py
/home/user/miniconda3/envs/alice-NLGZxbVC-3.12/bin/ipython

Note, however, that directly invoking python doesn’t do the “right” thing16I.e., “right” here means invoking the version installed in the conda environment with access to the installed dependencies. — she has to prefix the command invocation with pdm run. 17Even if Alice were using Rye and had the associated shim installed, directly invoking python would not have worked as intended. This is because she’s not using the project-local .venv, but is instead using an external conda environment. As such, the Rye shim wouldn’t have automatically been able to correctly resolve things.

which python
python -c "import torch; print(torch.__file__)"
/usr/bin/python
Traceback (most recent call last):
ModuleNotFoundError: No module named 'torch'

If Alice doesn’t want every command invocation to be prefixed with pdm run, she can accomplish that using direnv. In her case she has direnv installed via Guix.

which direnv
/home/user/.guix-profile/bin/direnv

She creates a .envrc file in her project with the following contents:

if [ -f ".pdm-python"  ]; then
    cmd="$(pdm venv activate 2>/dev/null)"
    if [ conda = "$(echo $cmd | cut -d ' ' -f 1)" ]; then
        # <https://github.com/direnv/direnv/issues/326#issuecomment-574779485>
        eval "$(conda shell.bash hook)"
    fi
    eval "$cmd"
else
    echo "Missing .pdm-python; ensure pdm has been initialized"
fi

She then “allows” the .envrc file so that it may be automatically evaluated.

direnv allow .

Now, running python directly, anywhere within the project directory,18Including a sub-directory of it, and only when so. automatically invokes the correct Python interpreter. Prefixing the command invocation with pdm run is no longer necessary.

which python
python -c "import torch; print(torch.__file__)"
/home/user/miniconda3/envs/alice-NLGZxbVC-3.12/bin/python
/home/user/miniconda3/envs/alice-NLGZxbVC-3.12/lib/python3.12/site-packages/torch/__init__.py

Since Alice’s choice to use direnv is a personal one, she doesn’t subject others to it. She adds the .envrc file to .gitignore.

echo -e "\n.envrc" >> .gitignore
git add .gitignore
git commit -m "alice: gitignore: ignore .envrc"

Having done the initial setup to her liking, Alice continues developing, testing, extending her project. She takes care to ensure that every time she needs to add or alter a dependency

  • she does so via pdm, and
  • subsequently commits the changes to the pyproject.toml and pdm.lock files to git.

The above allows her to track her project dependencies with the project itself. It also allows her to track the dependencies in sufficient detail to allow her future self to recreate a previously known-to-be-working environment. As we’ll shortly see, doing the above also allows her to serve the needs of collaborators like Bob and users like Charlie at no extra cost.

Bob’s perspective

Bob is Alice’s colleague, and he wants to be able to collaborate with Alice. He clones Alice’s repository so he can start working on it locally.

git clone /tmp/alice bob
cd bob

With a fresh checkout, this is what Bob’s view looks like. 19Note, in particular, the absence of the .pdm-python file.

.  ..  .git  .gitignore  pdm.lock  pdm.toml  pyproject.toml  README.md  src  tests

Bob then proceeds to create a conda environment. 20Alice, graciously, has noted the instructions in the README.

pdmcondaprofile="$(basename $(pdm venv create -cn bob | cut -d ' ' -f 2))"
pdm use --venv  "${pdmcondaprofile}"
echo "${pdmcondaprofile}"
STATUS: Creating virtualenv using mamba...
Using Python interpreter: /home/user/miniconda3/envs/bob-i2DthxFJ-3.12/bin/python (3.12)
bob-i2DthxFJ-3.12

Doing so creates a .pdm-python file with the correct entry.

/home/user/miniconda3/envs/bob-i2DthxFJ-3.12/bin/python

Finally, he installs the dependencies from the lock file. Since Bob will be contributing to the project, he installs both the development and the production dependencies. 21This is the default behaviour of pdm install.

pdm install # may need multiple invocations for success

And now he has access to both Conda and PyPI dependencies.

pdm run which postgres
pdm run which python
pdm run python -c "import torch; print(torch.__file__)"
pdm run which ipython
/home/user/miniconda3/envs/bob-i2DthxFJ-3.12/bin/postgres
/home/user/miniconda3/envs/bob-i2DthxFJ-3.12/bin/python
/home/user/miniconda3/envs/bob-i2DthxFJ-3.12/lib/python3.12/site-packages/torch/__init__.py
/home/user/miniconda3/envs/bob-i2DthxFJ-3.12/bin/ipython

Not only can Bob now view Alice’s benchmarking and profiling results, but he can also run the Jupyter notebooks locally. He has been able to quickly and easily get to a point where he can lend his performance tuning expertise to Alice and collaborate with her.

Charlie’s perspective

Charlie is a user for whom Alice’s project is, simply, a means to an end. He would like to use Alice’s model on his own data. However, before he does so he would like to ensure that Alice’s project is correctly configured and he has all the requisite dependencies. 22I.e., he would like to quickly verify that the test suite in Alice’s project successfully runs. He will, at present, not be contributing any changes to it. 23In other words, Charlie is not interested in evaluating the benchmarks contained in the Jupyter notebooks. As such, he’s only interested in the production dependencies.24As opposed to Bob.

git clone /tmp/alice charlie
cd charlie

Since the project has some “production” dependencies from conda, Charlie proceeds to create a conda environment following Alice’s instructions.

pdmcondaprofile="$(basename $(pdm venv create -cn charlie | cut -d ' ' -f 2))"
pdm use --venv  "${pdmcondaprofile}"
echo "${pdmcondaprofile}"
STATUS: Creating virtualenv using mamba...
Using Python interpreter: /home/user/miniconda3/envs/charlie-Z9GinPUb-3.12/bin/python (3.12)
charlie-Z9GinPUb-3.12

Finally, he installs the production dependencies from the lock file.

pdm install --prod # may need multiple invocations for success

As expected, he now has access to the production, but not the development dependencies.

pdm run which postgres
pdm run which python
pdm run python -c "import torch; print(torch.__file__)"
if [ $(pdm list | grep ipython >/dev/null) ]; then
    echo "STATUS ERROR: development dependencies were installed"
else
    echo "STATUS OK: No development dependencies were installed"
fi
/home/user/miniconda3/envs/charlie-Z9GinPUb-3.12/bin/postgres
/home/user/miniconda3/envs/charlie-Z9GinPUb-3.12/bin/python
/home/user/miniconda3/envs/charlie-Z9GinPUb-3.12/lib/python3.12/site-packages/torch/__init__.py
STATUS OK: No development dependencies were installed

Charlie can now proceed to run the test suite. Note that the test suite may still fail for other reasons. 25For instance, if Alice has forgotten to pin some random seed. But Charlie is one step closer to understanding whether or not Alice’s project in its current state could meet his needs.

What did we learn?

By ensuring that all project dependencies are installed via pdm26Either directly, or indirectly via the pdm-conda plugin., Alice has ensured that pyproject.toml captures the direct dependencies of the project. Additionally, she has ensured that the pdm.lock file captures a snapshot27I.e., for each package, the specific version installed by Alice. of all dependencies.28I.e., both direct and transitive. By tracking both these files in git, she has ensured that, if nothing else, she can recreate a previously known-to-be-working environment without too much effort. This is significant.

Why pdm and not …

Conda

Conda, like pdm, supports Python version management. 29I.e., one could have two different projects, each using a different version of Python. However, unlike pdm, with Conda you are unable to track dependencies that are not available in a Conda channel. This isn’t just a theoretical limitation, but a practical one. Conda, unlike pdm, also doesn’t generate a “lock” file. 30Though there seems to be conda-lock which claims to do that. Without a “lock” file there is little hope to be able to recreate a previously known-to-be-working environment. 31Note, that a “lock” file isn’t sufficient. What happens when a previously available version stops being available? Thankfully, there are projects such as Software Heritage.

Poetry

Poetry, like pdm, supports a “lock” file. However, unlike pdm, poetry does not support Python version management.

Rye

Rye, like pdm, supports Python version management. However, unlike pdm32Via pdm-conda., is unable to track dependencies from Conda. Also, unlike pdm the “lock” file isn’t cross-platform.

Guix33Or Nix, if so inclined.

Guix is a cross-platform functional package manager. It is the thing one should use if one cares about reproducibility. 34Reproducibility and/or Repeatability is an important topic on which I have a few thoughts. However, what I have to say is too large to fit in the margin and must wait for a future blog post. However, using Guix would require strictly more effort than pdm. 35Whether or not said effort is justified depends on the specific needs of the project.

Limitations

A non-exhaustive list:

  • While the “lock” file generated by pdm is cross-platform by default, there is no guarantee that Alice hasn’t opted for the lock strategy to be platform specific.
  • Additionally, even if the “lock” file is cross-platform, it doesn’t mean that the test suite has been confirmed to be successfully working on multiple platforms.
  • The “lock” file doesn’t capture some environment specifics such as the OS kernel version, version of libc etc. 36These gaps may or may not prove to be pertinent for the code to fulfill the developer’s intent.
  • As noted previously, a “lock” file also doesn’t offer any protection against the specific versions of dependencies simply becoming unavailable at a future point in time.
  • The approach followed by Alice in this post, while fulfilling some necessary criteria for reproducibility, is insufficient for reproducibility. Correctly capturing project dependencies doesn’t offer any protection from, say, the contents of a remote dataset changing over time. 37Or a host of other possible issues that can affect reproducibility.

And yet, despite the above, if the approach outlined in this post is followed, the situation is already considerably better than what one routinely encounters in Python projects today.

Comments

Comments can be left on twitter, mastodon, as well as below, so have at it.

To view the Giscus comment thread, enable Giscus and GitHub’s JavaScript or navigate to the specific discussion on Github.

Footnotes:

1

Notably, we will not be focusing on exact reproducibility. Reproducibility is larger in scope, harder, and deserves its own blog post.

2

And those of your collaborators and users.

3

Not all of which use the same Python version.

4

Each with their own set of dependencies.

5

Alice relies on the successful evaluation of the test suites in her projects to confirm that for her. I.e, what she ultimately cares about is an environment that-successfully-runs-code, which doesn’t necessarily require creating a known-to-be-working environment. However, the latter can often be a cost-effective way of accomplishing the former.

6

In addition to the test suite.

7

Or, perhaps, reviewer.

8

After all, if he’s unable to even do that, is there much point in him trying to go further?

9

With its support for a per-interpreter GIL.

10

Such as Bob.

11

Such as Charlie.

12

I.e., Alice has taken the time to note this in the development instructions of her project.

13

In a telling turn of events, some time between when the first draft of this post was written and when it was published, updated versions of pdm and pdm-conda were released that were mutually incompatible.

14

For the benefit of her future self and others.

15

And other utilities.

16

I.e., “right” here means invoking the version installed in the conda environment with access to the installed dependencies.

17

Even if Alice were using Rye and had the associated shim installed, directly invoking python would not have worked as intended. This is because she’s not using the project-local .venv, but is instead using an external conda environment. As such, the Rye shim wouldn’t have automatically been able to correctly resolve things.

18

Including a sub-directory of it, and only when so.

19

Note, in particular, the absence of the .pdm-python file.

20

Alice, graciously, has noted the instructions in the README.

21

This is the default behaviour of pdm install.

22

I.e., he would like to quickly verify that the test suite in Alice’s project successfully runs.

23

In other words, Charlie is not interested in evaluating the benchmarks contained in the Jupyter notebooks.

24

As opposed to Bob.

25

For instance, if Alice has forgotten to pin some random seed.

26

Either directly, or indirectly via the pdm-conda plugin.

27

I.e., for each package, the specific version installed by Alice.

28

I.e., both direct and transitive.

29

I.e., one could have two different projects, each using a different version of Python.

30

Though there seems to be conda-lock which claims to do that.

31

Note, that a “lock” file isn’t sufficient. What happens when a previously available version stops being available? Thankfully, there are projects such as Software Heritage.

32

Via pdm-conda.

33

Or Nix, if so inclined.

34

Reproducibility and/or Repeatability is an important topic on which I have a few thoughts. However, what I have to say is too large to fit in the margin and must wait for a future blog post.

35

Whether or not said effort is justified depends on the specific needs of the project.

36

These gaps may or may not prove to be pertinent for the code to fulfill the developer’s intent.

37

Or a host of other possible issues that can affect reproducibility.