Jose
Jose 4 min read Data scientist from Alcalá la Real. Studied at BarcelonaTech, worked as a researcher at the UGR and UPC, was a machine learning engineer at El Ranchito and Nemeda and now work in Koh Young Research Spain. Always wanting to explain my knowledge to the world.

How to build a python library

When I was halfway through my bachelor’s thesis I decided to package everything into a python package to facilitate easier usage for the next student. However, instead of just following a simple and nice tutorial I decided to read and study the relevant PEPs and the official setuptools documentation. That was quite time consuming but also very worth it. I will always recommend getting as deep as you can in every topic you are interested. This post is a summary of the fundamental takeaways. Packaging in python has changed over the last years, so what you will find here is probably not used in some of the packages you already know. The post is structured as follows: first, the summary. Then, some extra details. And finally, some anecdotes.

Packaging your code

The process is very simple once you know how to do it. You just need to put every code file inside a folder, add a pyproject.toml file, run some commands and you are done. For example, my library was called tumourkit, so I put everything I got so far into a folder called tumourkit. Now, the most basic pyproject.toml can be this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"

[project]
name = "tumourkit"
authors = [
  { name = "Jose Pérez Cano", email = "joseperez2000@hotmail.es" },
]
description = "A SDK for tumour study"
requires-python = ">=3.8"
dynamic = ["version", "dependencies"]

[tool.setuptools.dynamic]
dependencies = { file = "requirements.txt" }

Here, the file structure should be

1
2
3
4
5
root-folder
├── tumourkit
│   ├── ...
├── pyproject.toml
├── requirements.txt

With that, you just need to install some packages python -m pip install twine build wheel and run two commands: python -m build and python -m twine upload dist/*. With that you are done.

You will be probably thinking that this is very simple, and you are right. Packaging code can get quite complex, but for the most part it is just this. There are some extra additions you can consider. Like adding a readme, exposing some command line programs, tagging the package or including a license. Many of the previous changes can be done by simply adding some lines to the project section:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[project]
...
readme = "README.md"
license = { file = "LICENSE" }
classifiers = [
    "Development Status :: 2 - Pre-Alpha",
    "Environment :: Console",
    "Intended Audience :: Healthcare Industry",
    "Programming Language :: Python :: 3.8",
    "Programming Language :: Python :: 3.9",
    "Programming Language :: Python :: 3.10",
    "Typing :: Stubs Only",
    ...
]

There are many classifiers, a full list can be found here. And the file structure should now include the extra files:

1
2
3
4
5
6
7
root-folder
├── tumourkit
│   ├── ...
├── pyproject.toml
├── requirements.txt
├── README.md
├── LICENSE

For the commands it is also very simple. Suppose you have a function called main under the file tumourkit/example.py and you want to call that command my_example. Then you could add these lines to the toml:

1
2
[project.scripts]
my_example = "tumourkit.example:main"

So now, when the library is installed running my_example in the terminal is equivalent to running that function. You don’t need to have the classical if __name__=='__main__', in fact it will get ignored. The choosing of the license could be a topic for another day. For now, I will just mention that my library ended up being AGPL-3.0.

Even though this is enough for a package to be distributed, many of the packages you know have far more files than just the four I mentioned. That is because packages also have documentation, testing, building, excluded / included non-code files and many more. For every extra thing you add to your package, you will need an extra file that contains information about that. In my case I only added one more file which is .readthedocs.yaml. This file is read by the read the docs servers and contains the information on how to build the documentation. The file was just copied from their page. This is something to be done incrementally. Whenever something is required you look up the relevant documentation and include it. Following the philosophy of doing things incrementally, it is also a good practice to keep a CHANGELOG.md and follow semantic versioning.

The rabbit hole

All of the above is now very well explained here. But when I decided to do it, it was a period of transition. There were some online tutorials that still referred to the old-fashioned setuptools way of doing things. Until only very recently, setuptools was the only way of doing things. You needed a setup.py file that was run to build the egg files that where then installed or uploaded. But the python foundation decided it was time to democratize it. PEP 517 specified how to create new build backends so that others could emerge. We now have poetry, flit, hatchling and others. Also, to simplify the configuration file, in PEP 518 the pyproject.toml was introduced. This was written in 2016 and 2017. But it took a few years until backends started to appear. Between 2018 and 2019 setuptools added support for pyproject.toml in version 40 and you no longer needed to add the setup.py file even if you use setuptools. I did my thesis in 2023. It would seem as if 4 years were enough for the world to adapt to those changes. But it wasn’t. I think in most cases packages are built once and used forever, but never fully maintained. The cost of being up to date with those changes is quite high with no reward at all. I dit it this way because it was my first time, so I thought I better be doing it the updated way. But I can understand that you don’t want to change the way everything is configured now that it is working.

Python is always evolving. In the past months the PEP 703 showed up proposing a way to remove the GIL, which is probably one of the most representative aspects of the language, together with the garbage collector. And some very characteristic features like pip, were not in the beginning. In the beginning there was distutils and easy_install. It was in 2013 with PEP 453 that pip was proposed as the default. However, community often rejects wide adoption as with the type hints. They were proposed in PEP 484 in 2014 but as of today, I can only find them in very mature libraries. Amateurs just use python without types. And that is fine. There is no need to be knowledgeable about everything. Sometimes the basics are more than enough. That is something I really like about computer science. You can do a lot knowing so little, but if you want to know more, there is always more.

Anecdote

I was hired by the university when I was doing my bachelor’s thesis to develop an algorithm that could solve a problem. That means you obtain money in exchange for some freedom on what you can and cannot do. In my case that meant to provide guidance to others in the project and that I wasn’t the one taking the decisions about priorities. Packaging code is time consuming, and therefore I needed to ask my “boss” about whether or not I should spend time on that. However, I just wanted to do it, so I dit it. But since I needed to justify myself I did the following. On a friday night I emailed my tutors asking for permission. I knew they were busy so they wouldn’t respond until next monday. During saturday I spent more than 10 hours reading and making changes to the code so that it could be a package. There were around 5 thousand lines of code, but luckily I could refactor everything in one day. When sunday arrived, I just emailed again saying the absence of response was an approval and that I have just done everything that was needed. They couldn’t say no, after all, the work is already done. Normally, you won’t get blamed for doing more, only for doing less. If you want to build something, just do it.