This lesson is still being designed and assembled (Pre-Alpha version)

Packaging

Environments and task runners

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How do you install and manage packages?

  • How can you ensure others run the same code you do?

Objectives
  • Learn about virtual environments

  • Use a task runner to manage environments and run code

You will see two very common recommendations when installing a package:

pip install <package>         # Use only in virtual environment!
pip install --user <package>  # Almost never use

Don’t use them unless you know exactly what you are doing! The first one will try to install globally, and if you don’t have permission, will install to your user site packages. In global site packages, you can get conflicting versions of libraries, you can’t tell what you’ve installed for what, packages can update and break your system; it’s a mess. And user site packages are worse, because all installs of Python on your computer share it, so you might override and break things you didn’t intend to. And with pip’s new smart solver, updating packages inside a global environment can take many minutes and produce unexpectedly solves that are technically “correct” but don’t work because it backsolved conflicts to before issues were discovered.

There is a solution: virtual environments (libraries) or pipx (applications).

There are likely a few libraries (ideally just pipx) that you just have to install globally. Go ahead, but be careful (and always use your system package manager instead if you can, like brew on macOS or the Windows ones – Linux package managers tend to be too old to use for Python libraries).

Virtual Environments

The following uses the standard library venv module. The virtualenv module can be installed from PyPI, and works identically, though is a bit faster and provides newer pip by default.

Python 3 comes with the venv module built-in, which supports making virtual environments. To make one, you call the module with

python3 -m venv .venv

This creates links to Python and pip in .venv/bin, and creates a site-packages directory at .venv/lib. You can just use .venv/bin/python if you want, but many users prefer to source the activation script:

. .venv/bin/activate

(Shell specific, but there are activation scripts for all common shells here). Now .venv/bin has been added to your PATH, and usually your shell’s prompt will be modified to indicate you are “in” a virtual environment. You can now use python, pip, and anything you install into the virtualenv without having to prefix it with .venv/bin/.

Check the version of pip installed! If it’s old, you might want to run pip install -U pip or, for modern versions of Python, you can add --upgrade-deps to the venv creation line.

To “leave” the virtual environment, you undo those changes by running the deactivate function the activation added to your shell:

deactivate

What about conda?

The same concerns apply to Conda. You should avoid installing things to the base environment, and instead make environments and use those above. Quick tips:

conda config --set auto_activate_base false  # turn off the default environment
conda env create -n some_name  # or use paths with `-p`
conda activate some_name
conda deactivate

Pipx

There are many Python packages that provide a command line interface and are not really intended to be imported (pip, for example, should not be imported). It is really inconvenient to have to set up venvs for every command line tool you want to install, however. pipx, from the makers of pip, solves this problem for you. If you pipx install a package, it will be created inside a new virtual environment, and just the executable scripts will be exposed in your regular shell.

Pipx also has a pipx run <package> command, which will download a package and run a script of the same name, and will cache the temporary environment for a week. This means you have all of PyPI at your fingertips in one line on any computer that has pipx installed!

Task runner (nox)

A task runner, like make (fully general), rake (Ruby general), invoke (Python general), tox (Python packages), or nox (Python simi-general), is a tool that lets you specify a set of tasks via a common interface. These can be a crutch, allowing poor packaging practices to be employed behind a custom script, and they can hide what is actually happening.

Nox has two strong points that help with this concern. First, it is very explicit, and even prints what it is doing as it operates. Unlike the older tox, it does not have any implicit assumptions built-in. Second, it has very elegant built-in support for both virtual and Conda environments. This can greatly reduce new contributor friction with your codebase.

A daily developer is not expected to use nox for simple tasks, like running tests or linting. You should not rely on nox to make a task that should be made simple and standard (like building a package) complicated. You are not expected to use nox for linting on CI, or sometimes even for testing on CI, even if those tasks are provided for users. Nox is a few seconds slower than running directly in a custom environment - but for new users and rarely run tasks, it is much faster than explaining how to get setup or manually messing with virtual environments. It is also highly reproducible, creating and destroying the temporary environment each time by default.

You should use nox to make it easy and simple for new contributors to run things. You should use nox to make specialized developer tasks easy. You should use nox to avoid making single-use virtual environments for docs and other rarely run tasks.

Since nox is an application, you should install it with pipx. If you use Homebrew, you can install nox with that (Homebrew isolates Python apps it distributes too, just like pipx).

Running nox

If you see a noxfile.py in a repository, that means nox is supported. You can start by checking to see what the different tasks (called sessions in nox) are provided by the noxfile author. For example, if we do this on packaging.python.org’s repository:

nox -l  # or --list-sessions
Sessions defined in /github/pypa/packaging.python.org/noxfile.py:

- translation -> Build the gettext .pot files.
- build -> Make the website.
- preview -> Make and preview the website.
- linkcheck -> Check for broken links.

sessions marked with * are selected, sessions marked with - are skipped.

You can see that there are several different sessions. You can run them with -s:

nox -s preview

Will build and start up a preview of the site.

If you need to pass options to the session, you can separate nox options with and the session options with --.

Writing a Noxfile

For this example, we’ll need a minimal test file for pytest to run. Let’s make this file in a local directory:

# test_nox.py

def test_runs():
    assert True

Let’s write our own noxfile. If you are familiar with pytest, this should look familiar as well; it’s intentionally rather close to pytest. We’ll make a minimal session that runs pytest:

# noxfile.py
import nox

@nox.session()
def tests(session):
    session.install("pytest")
    session.run("pytest")

A noxfile is valid Python, so we import nox. The session decorator tells nox that this function is going to be a session. By default, the name will be the function name, the description will be the function docstring, it will run on the current version of Python (the one nox is using), and it will make a virtual environment each time the session runs, though all of this is changeable via keyword arguments to session.

The session function will be given a nox.Session object that has various useful methods. .install will install things with pip, and .run will run a command in a sesson. The .run command will print a warning if you use an executable outside the virtual environment unless external=True is passed. Errors will exit the session.

Let’s expand this a little:

# noxfile.py
import nox

@nox.session()
def tests(session: nox.Session) -> None:
    """
    Run our tests.
    """
    session.install("pytest")
    session.run("pytest", *session.posargs)

This adds a type annotation to the session object, so that IDE’s and type checkers can help you write the code in the function. There’s a docstring, which will print out nice help text when a user lists the sessions. And we pass through to pytest anything the user passes in via session.posargs

Let’s try running it:

nox -s tests
nox > Running session tests
nox > Creating virtual environment (virtualenv) using python3.10 in .nox/tests
nox > python -m pip install pytest
nox > pytest
==================================== test session starts ====================================
platform darwin -- Python 3.10.5, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/henryschreiner/git/teaching/packaging
collected 1 item

test_nox.py .                                                                          [100%]

===================================== 1 passed in 0.05s =====================================
nox > Session tests was successful.

Passing arguments through

Try passing -v to pytest.

Solution

nox -s tests -- -v

Virtual environments

Nox is really just doing the same thing we would do manually (and printing all the steps except the exact details of creating the virtual environment. You can see the virtual environment in .nox/tests! How would you activate this environment?

Solution

. .nox/tests/bin/activate

Key Points

  • Virtual environments isolate software

  • Virtual environments solve the update problem

  • A task runner makes it easier to contribute to software


Python to package

Overview

Teaching: 20 min
Exercises: 5 min
Questions
  • How do we take code in a Jupyter Notebook or Python script and turn that into a package?

  • What are the minimum elements required for a Python package?

  • How do you set up tests?

Objectives
  • Create and install a Python package

  • Create and run a test

Much research software is initially developed by hacking away in an interactive setting, such as in a Jupyter Notebook or a Python shell. However, at some point when you have a more-complicated workflow that you want to repeat, and/or make available to others, it makes sense to package your functions into modules and ultimately a software package that can be installed. This lesson will walk you through that process.

Consider the rescale() function written as an exercise in the Software Carpentry Programming with Python lesson.

First, as needed, create your virtual environment and install NumPy with

virtualenv .venv
source .venv/bin/activate
pip install numpy

Then, in a Python shell or Jupyter Notebook, declare the function:

import numpy as np

def rescale(input_array):
    """Rescales an array from 0 to 1.

    Takes an array as input, and returns a corresponding array scaled so that 0
    corresponds to the minimum and 1 to the maximum value of the input array.
    """
    L = np.min(input_array)
    H = np.max(input_array)
    output_array = (input_array - L) / (H - L)
    return output_array

and call the function with:

>>> rescale(np.linspace(0, 100, 5))

which provides the output:

array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])

Creating our package in six lines

Let’s create a Python package that contains this function.

First, create a new directory for your software package, called package, and move into that:

$ mkdir package
$ cd package

You should immediately initialize an empty Git repository in this directory; if you need a refresher on using Git for version control, check out the Software Carpentry Version Control with Git lesson. (This lesson will not explicitly remind you to commit your work after this point.)

$ git init

Next, we want to create the necessary directory structure for your package. This includes:

$ mkdir -p src/rescale tests docs

(The -p flag tells mkdir to create the src parent directory for rescale.)

Putting the package directory and source code inside the src directory is not actually required; instead, if you put the <package_name> directory at the same level as tests and docs then you could actually import or call the package directory from that location. However, this can cause several issues, such as running tests with the local version instead of the installed version. In addition, this package structure matches that of compiled languages, and lets your package easily contain non-Python compiled code, if necessary.

Inside src/rescale, create the files __init__.py and rescale.py:

$ touch src/rescale/__init__.py src/rescale/rescale.py

__init__.py is required to import this directory as a package, and should remain empty (for now). rescale.py is the module inside this package that will contain the rescale() function; copy the contents of that function into this file. (Don’t forget the NumPy import!)

The last element your package needs is a pyproject.toml file. Create this with

$ touch pyproject.toml

and then provide the minimally required metadata, which include information about the build system (hatchling) and the package itself (name and version):

# contents of pyproject.toml
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "package"
version = "0.1.0"

The package name given here, “package,” matches the directory package that contains our project’s code. We’ve chosen 0.1.0 as the starting version for this package; you’ll see more in a later episode about versioning, and how to specify this without manually writing it here.

The only elements of your package truly required to install and import it are the pyproject.toml, __init__.py, and rescale.py files. At this point, your package’s file structure should look like this:

.
├── docs
├── pyproject.toml
├── src
│   └── package
│   │   ├── __init__.py
│   │   └── rescale.py
└── tests

Installing and using your package

Now that your package has the necessary elements, you can install it into your virtual environment (which should already be active). From the top level of your project’s directory, enter

$ pip install -e .

The -e flag tells pip to install in editable mode, meaning that you can continue developing your package on your computer as you test it.

Then, in a Python shell or Jupyter Notebook, import your package and call the (single) function:

>>> import numpy as np
>>> from package.rescale import rescale
>>> rescale(np.linspace(0, 100, 5))
array([0.  , 0.25, 0.5 , 0.75, 1.  ])

This matches the output we expected based on our interactive testing above! 😅

Your first test

Now that we have installed our package and we have manually tested that it works, let’s set up this situation as a test that can be automatically run using nox and pytest.

In the tests directory, create the test_rescale.py file:

touch tests/test_rescale.py

In this file, we need to import the package, and check that a call to the rescale function with our known input returns the expected output:

# contents of tests/test_rescale.py
import numpy as np
from package.rescale import rescale

def test_rescale():
    np.testing.assert_allclose(
        rescale(np.linspace(0, 100, 5)),
        np.array([0., 0.25, 0.5, 0.75, 1.0 ]),
        )

Next, take the noxfile.py you created in an earlier episode, and modify it to

with:

# contents of noxfile.py
import nox

@nox.session
def tests(session):
    session.install('numpy', 'pytest')
    session.install('.')
    session.run('pytest')

Now, with the added test file and noxfile.py, your package’s directory structure should look like:

.
├── docs
├── noxfile.py
├── pyproject.toml
├── src
│   └── package
│   │   ├── __init__.py
│   │   └── rescale.py
└── tests
    └── test_rescale.py

(You may also see some __pycache__ directories, which contain compiled Python bytecode that was generated when calling your package.)

Have nox run your tests with the command

$ nox

This should give you some information about what nox is doing, and show output along the lines of

nox > Running session tests
nox > Creating virtual environment (virtualenv) using python in .nox/tests
nox > python -m pip install numpy pytest
nox > python -m pip install .
nox > pytest
======================================================================= test session starts =================================================
platform darwin -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/niemeyek/Desktop/rescale
collected 1 item

tests/test_rescale.py .                                                                                                                [100%]

======================================================================== 1 passed in 0.07s ==================================================
nox > Session tests was successful.

This tells us that the output of the test function matches the expected result, and therefore the test passes! 🎉

We now have a package that is installed, can be interacted with properly, and has a passing test. Next, we’ll look at other files that should be included with your package.

Key Points

  • Use a pyproject.toml file to describe a Python package


Other files that belong with your package

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • What other files are important parts of your software package?

  • What software license should you use for your project?

Objectives
  • Create a README for a software package

  • Choose a software license for a software package

  • Create a CHANGELOG for a package

We now have an installed, working Python package that provides some functionality. Are we ready to push the code to GitHub (or your preferred code hosting service) for others to use and contribute to? 🛑 Not quite—we need to add a few more files at minimum to describe our package, and to actually make it open-source software.

Aside from the name of the package and docstring included with the (single) function, we haven’t yet provided any description or other information about the package for anybody that comes across it.

We also haven’t specified the terms and conditions under which the software may be downloaded, used, and/or modified. This means that if we posted it online right now, due to copyright laws (in the United States, at least) nobody else would actually be able to use or modify the code, since we haven’t given explicit permission to do so.

Lastly, as you continue working on your package, you will likely fix bugs and modify/add/remove functionality. Although these changes will technically be present in your Git logs—because you are committing regularly and writing descriptive commit messages, right? 😉—you should also maintain a file that describes these changes in a human-readable way.

Creating a README

A README is a plaintext file that sits at the top level of your package (next to the src, tests, docs directories and pyproject.toml file) and provides general information about your software. Modern READMEs are typically written in Markdown, or occasionally reStructuredText (ReST), due to the additional formatting options that services like GitHub nicely render.

A README is a form of software documentation, and should contain at minimum:

In addition, a README may also contain:

Create a README using

$ touch README.md

and then add these elements:

# Package

`Package` is a simple Python library that contains a single function for rescaling arrays.

## Installation

Download the source code and use the package manager [pip](https://pip.pypa.io/en/stable/) to install `package`:

```bash
pip install .
```

## Usage

```python
import numpy as np
from package.rescale import rescale

# rescales over 0 to 1
rescale(np.linspace(0, 100, 5))
```

## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

## License
TBD

You can see more guidance on creating READMEs at https://www.makeareadme.com.

Keep your READMEs relatively brief

You should try to keep your README files relatively brief, rather than including very detailed documentation to this file. This should only be a high-level introduction, with detailed theory, examples, and other information reserved for a true documentation website.

Choosing a software license

Now, your package includes a README file, which tells someone who finds the source code a bit about how to use your project and also contribute to it. However, you still need one more element before uploading it to GitHub or another code-hosting service: a software license that explicitly gives specific permissions to users and contributors. Simply making your project available publicly is not the same as making it an open-source software project.

By default, when you make a creative work such as software (but also including writing and images), your work is under exclusive copyright. Others cannot use, copy, share/distribute, or modify your work without your permission. This is often a good thing, because it means you can put your work out into the world, and copyright protects you as the creator and owner of the work. Open Source Guides has more about the legal side of open source software.

However, if you have created research software and plan to share it openly, you want others to use your software, and possibly contribute to it. (Who doesn’t love having other people fix the bugs in their code?)

A software license provides the explicit permissions for others to use, modify, or share your code, and lays out the specific rules for any restrictions about how they can do those things. To pick a license, use resources like Choose a License or Civic Commons “Choosing a License” based on how you want others to interact with your software. You can also see the full list of open-source licenses approved by the Open Source Initiative, which maintains the Open Source Definition.

For a new project, you essentially have one major choice to make:

  1. Do you want to allow others to use your software in almost any way they want, or
  2. Do you want to require others to share any uses of your project in an open way?

These two categories are “permissive” and “copyleft” licenses. Common permissive licenses include the MIT License and BSD 3-Clause License. The GNU General Public License v3.0 (or GNU GPLv3) License is a common copyleft license.

Most research software uses permissive licenses like the MIT License, BSD 3-Clause License, or the Apache License 2.0. For an easy choice, we recommend using the BSD 3-Clause License, which includes a specific clause preventing the names of creators/contributors from being used to endorse or promote derivatives, without permission. However, you should choose the specific license that best fits your needs. In addition, when working on a project with others or as part of a larger effort, you should check if your collaborators have already determined an appropriate license; for example, on work funded by a grant, a particular license may be mandated by the proposal/agreement.

Create a LICENSE.txt file using

$ touch LICENSE.txt

and copy the exact text of the BSD 3-Clause License, modifying only the year and names:

BSD 3-Clause License

Copyright (c) [year], [fullname]

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
   list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
   this list of conditions and the following disclaimer in the documentation
   and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its
   contributors may be used to endorse or promote products derived from
   this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

That’s it!

Do not write your own license!

You should never try to write your own software license, or modify the text of an existing license.

Although we are not lawyers, the licenses approved and maintained by the Open Source Initiative have gone through a rigorous review process, including legal review, to ensure that they are both consistent with the Open Source Definition and also are legally valid.

Adding a license badge to README

Badges are a fun and informative way to quickly show information about your software package in the README. Shields.io is a resource for generating badge images in SVG format, which can easily be added to the top of your README as links pointing to more information.

Using Shields.io (or an example found elsewhere), generate the Markdown syntax for adding a badge describing the BSD 3-Clause License we chose for this example package.

Solution

The Markdown syntax for adding a badge describing the BSD 3-Clause License is:

[![License](https://img.shields.io/badge/license-BSD-green.svg)](https://opensource.org/licenses/BSD-3-Clause)

and will render as License

Keeping a CHANGELOG

Over time, our package will likely evolve, whether through bug fixes, improvements, or feature changes. For example, the rescale function in our package does not have a way of properly treating cases where the max and min of the array are the same (i.e., when the array holds the same number repeated). For example:

import numpy as np
from package.rescale import rescale
a = 2 * np.ones(5)
rescale(a)

gives

rescale.py:11: RuntimeWarning: invalid value encountered in divide
  output_array = (input_array - L) / (H - L)
array([nan, nan, nan, nan, nan])

This is probably not the desired output; instead, let’s say we want to rescale all the values in this array to 1. We can modify the function to properly handle this situation:

def rescale(input_array):
    """Rescales an array from 0 to 1.

    Takes an array as input, and returns a corresponding array scaled so that 0
    corresponds to the minimum and 1 to the maximum value of the input array.
    """
    L = np.min(input_array)
    H = np.max(input_array)
    if np.allclose(L, H):
        output_array = input_array / L
    else:
        output_array = (input_array - L) / (H - L)
    return output_array

Now, when we call rescale (no need to reinstall or upgrade the package, since we previously installed using editable mode):

import numpy as np
from package.rescale import rescale
a = 2 * np.ones(5)
rescale(a)

we get the desired behavior:

array([1., 1., 1., 1., 1.])

Great! Let’s commit that change using Git, with a message and perhaps update the version to 0.1.1 to indicate the package has changed (more to come on that in a later episode on versioning).

That may be enough for us to record the change, but how will a user of your package know that the functionality has changed? It’s not exactly easy to hunt through Git logs and try to find which commit message(s) align with the changes since the last version.

Instead, we can should keep a changelog in a CHANGELOG.md file, also at the top level of your package’s directory. In this Markdown-formatted file, you should record major changes to the package made since the last released version. Then, when you decide to release a new version, you add a new section to the file above this list of changes.

Changes should be grouped together based on the type; suggestions for these come from the Keep a Changelog project by Olivier Lacan:

For example, our initial release was version 0.1.0, and we have now changed the functionality. Our CHANGELOG should look something like:

# Changelog
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## Unreleased
### Changed
- rescale function now scales constant arrays to 1

## [0.1.0] - 2022-08-09
### Added
- Created rescale() function and released package

If at this point you want to increment the version to 0.1.1 to indicate this small fix to the behavior, you would add a new section for this version:

## Unreleased

## [0.1.1] - 2022-08-10
### Changed
- rescale function now scales constant arrays to 1

## [0.1.0] - 2022-08-09
### Added
- Created rescale() function and released package

Note that the version numbers are shown as links in these examples, although the links are not included in the file snippets. You should add definitions of these links at the bottom of the file, using (for example) GitHub’s ability to compare between tagged versions:

[0.1.1]: https://github.com/<username>/package/compare/v0.1.1...v0.1.0

Additional files for Git

At this point, your package has most of the supplemental files that it needs to be shared with the world. However, there are some additional files you can add to help with your Git workflow.

gitignore

After adding and committing the files above, you might have noticed that git status points out a few files/directories that you do not want it to track:

On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/
	src/package/__pycache__/
	tests/__pycache__/

nothing added to commit but untracked files present (use "git add" to track)

Fortunately, you can instruct Git to ignore these files, and others that you will never want to track, using a .gitignore file, which goes at the main directory level of your package.

This file tells Git to ignore either specific files or directories, or those that match a certain pattern via the wildcard character (e.g., *.so). The Git reference manual has very detailed documentation of possible .gitignore file syntax, but for convenience GitHub maintains a collection of .gitignore files for various languages and tools.

For your project, you should copy or download the Python-specific .gitignore file file into a local .gitignore:

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# poetry
#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
#   This is especially recommended for binary packages to ensure reproducibility, and is more
#   commonly ignored for libraries.
#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
#   in version control.
#   https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

You can see that a few patterns are commented out, and can be uncommented if they apply to your project and/or workflow. You can also clean up sections of the file that do not apply to your situation, but there’s no real need to do so since you likely won’t look at this file again.

Once you have added this to the top level of your project (alongside the .git directory), and told Git to track it (git add .gitignore, git commit -m 'adds gitignore file'), Git will automatically begin following your rules:

$ git status
On branch main
nothing to commit, working tree clean

pre-commit hook

Adding the .gitignore file is helpful both for keeping your git status messages clean and also avoiding accidentally committing compiled or cache files. Another helpful step is to have Git run some checks prior to committing, to ensure things that go against standards and style preferences are not committed (and need to be fixed later).

One way to do this is to use the pre-commit framework, which when installed can check for things like trailing whitespace, merge conflicts, and so on. It can also even perform a spellcheck for you.

There are three steps needed: installing the pre-commit framework, creating the configuration file for pre-commit (.pre-commit-config.yaml), and then installing the Git hook scripts.

You can install pre-commit itself using pip, homebrew (on a Mac), or conda. Let’s assume you’ll install with pip or pipx:

pip install pre-commit

Then, create a .pre-commit-config.yaml file. You can copy this example file:

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
  rev: v4.3.0
  hooks:
  - id: check-added-large-files
  - id: check-case-conflict
  - id: check-merge-conflict
  - id: check-symlinks
  - id: debug-statements
  - id: end-of-file-fixer
  - id: mixed-line-ending
  - id: trailing-whitespace

- repo: https://github.com/codespell-project/codespell
  rev: 'v2.1.0'
  hooks:
  - id: codespell

In this YAML file, we have a repos list, which has one or more items. Each repo points to a repository that holds a supported hook. Here, we are using the hooks that come with pre-commit along with the codespell project. For each, you specify the version with the rev field, and then a list of hooks.

For the pre-commit plugin,

The codespell hook checks for common mispellings in source code and related files.

There are additional plugins and hooks available, and there are also lots of configuration options you can customize.

Summary

At this point, if you have added all of these files, your package’s file structure should look something like this:

.
├── .git
├── .gitignore
├── .nox
├── .pre-commit-config.yaml
├── .venv
├── docs
├── noxfile.py
├── pyproject.toml
├── src
│   └── package
│   │   ├── __init__.py
│   │   └── rescale.py
└── tests
    └── test_rescale.py

The .git, .nox. and .venv directories would have been automatically generated by Git, Nox, and Virtualenv, respectively. You may also see additional directories like __pycache__ and .pytest_cache.

Key Points

  • In addition to the source code and project specification, packages should include a README, LICENSE, and CHANGELOG.

  • Do not create a custom software license or modify an existing license; instead, choose from the list of available licenses.

  • You can also include .gitignore to avoid from committing non-source files and .pre-commit-config.yaml to automatically check simple issues with your code before committing it.


Metadata

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • What metadata is important to add to your package?

  • How to I add common functionality, like executable scripts?

Objectives
  • Learn about the project table.

In a previous lesson, we left the metadata in our project.toml quite minimal; we just had a name and a version. There are quite a few other fields that can really help your package on PyPI, however. We’ll look at them, split into categories: Informational (like author, description) and Functional (like requirements). There’s also a special dynamic field that lets you list values that are going to come from some other source.

Informational metadata

Name

Required. ., -, and _ are all equivalent characters, and may be normalized to _. Case is unimportant. This is the only field that must exist statically in this table.

name = "some_project"

Version

Required. Many backends provide ways to read this from a file or from a version control system, so in those cases you would add "version" to the dynamic field and leave it off here.

version = "1.2.3"
version = "0.2.1b1"

Description

A string with a short description of your project.

description = "This is a very short summary of a very cool project."

Readme

The name of the readme. Most of the time this is README.md or README.rst, though there is a more complex mechanism if a user really desires to embed the readme into your pyproject.toml file (you don’t).

readme = "README.md"
readme = "README.rst"

Authors and maintainers

This is a list of authors (or maintainers) as (usually inline) tables. A TOML table is very much like a Python dict.

authors = [
    {name="Me Myself", email="email@mail.com"},
    {name="You Yourself", email="email2@mail.com"},
]
maintainers = [
    {name="It Itself", email="email3@mail.com"},
]

Note that TOML supports two ways two write tables and two ways to write arrays, so you might see this in a different form, but it should be recognizable.

Keywords

A list of keywords for the project. This is mostly used to improve searchability.

keywords = ["example", "tutorial"]

URLs

A set of links to help users find various things for your code; some common ones are Homepage, Source Code, Documentation, Bug Tracker, Changelog, Discussions, and Chat. It’s a free-form name, though many common names get recognized and have nice icons on PyPI.

# Inline form
urls.Homepage = "https://pypi.org"
urls."Source Code" = "https://pypi.org"

# Sectional form
[project.urls]
Homepage = "https://pypi.org"
"Source Code" = "https://pypi.org"

Classifiers

This is a collection of classifiers as listed at https://pypi.org/classifiers/. You select the classifiers that match your projects from there. Usually, this includes a “Development Status” to tell users how stable you think your project is, and a few things like “Intended Audience” and “Topic” to help with search engines. There are some important ones though: the “License” (s) is used to indicate your license. You also can give an idea of supported Python versions, Python implementations, and “Operating System”s as well. If you have statically typed Python code, you can tell users about that, too.

classifiers = [
    "Development Status :: 5 - Production/Stable",
    "Intended Audience :: Developers",
    "Intended Audience :: Science/Research",
    "License :: OSI Approved :: BSD License",
    "Operating System :: OS Independent",
    "Programming Language :: Python",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3 :: Only",
    "Programming Language :: Python :: 3.7",
    "Programming Language :: Python :: 3.8",
    "Programming Language :: Python :: 3.9",
    "Programming Language :: Python :: 3.10",
    "Topic :: Scientific/Engineering",
    "Topic :: Scientific/Engineering :: Information Analysis",
    "Topic :: Scientific/Engineering :: Mathematics",
    "Topic :: Scientific/Engineering :: Physics",
    "Typing :: Typed",
]

License (special mention)

There also is a license field, but that was rather inadequate; it didn’t support multiple licenses, for example. Currently, it’s best to indicate the license with a Trove Classifier, and make sure your file is called LICENSE* so build backends pick it up and include it in SDist and wheels. There’s work on standardizing an update to the format in the future. You can manually specify a license file if you want:

license = {file = "LICENSE"}

Verify file contents

Always verify the contents of your SDist and Wheel(s) manually to make sure the license file is included.

tar -tvf dist/package-0.0.1.tar.gz
unzip -l dist/package-0.0.1-py3-none-any.whl

Functional metadata

The remaining fields actually change the usage of the package.

Requires-Python

This is an important and sometimes misunderstood field. It looks like this:

requires-python = ">=3.7"

Pip will see if the current version of Python it’s installing for passes this expression. If it doesn’t, pip will start checking older versions of the package until it finds on that passes. This is how pip install numpy still works on Python 3.7, even though NumPy has already dropped support for it.

You need to make sure you always have this and it stays accurate, since you can’t edit metadata after releasing - you can only yank or delete release(s) and try again.

Upper caps

Upper caps are generally discouraged in the Python ecosystem, but they are (even more that usual) broken here, since this field was added to help users drop old Python versions, and the idea it would be used to restrict newer versions was not considered. The above procedures is not the right one for an upper cap! Never upper cap this and instead use Trove Classifiers to tell users what versions of Python your code was tested with.

Dependencies

Your package likely will need other packages from PyPI to run.

dependencies = [
  "numpy>=1.18",
]

You can list dependencies here without minimum versions, but if you have a lot of users, you might want minimum versions; pip will only upgrade an installed package if it’s no longer viable via your requirements. You can also use a variety of markers to specify operating system specific packages.

project.dependencies vs. build-system.requires

What is the difference between project.dependencies vs. build-system.requires?

Answer

build-system.requires describes what your project needs to “build”, that is, produce an SDist or wheel. Installing a built wheel will not install anything from build-system.requires, in fact, the pyproject.toml is not even present in the wheel! project.dependencies, on the other hand, is added to the wheel metadata, and pip will install anything in that field if not already present when installing your wheel.

Optional Dependencies

Sometimes you have dependencies that are only needed some of the time. These can be specified as optional dependencies. Unlike normal dependencies, these are specified in a table, with the key being the option you pass to pip to install it. For example:

[project.optional-dependenices]
test = ["pytest>=6"]
check = ["flake8"]
plot = ["matplotlib"]

Now, you can run pip install 'package[test,check]', and pip will install both the required and optional dependencies pytest and flake8, but not matplotlib.

Entry Points

A Python package can have entry points. There are three kinds: command-line entry points (scripts), graphical entry points (gui-scripts), and general entry points (entry-points). As an example, let’s say you have a main() function inside __main__.py that you want to run to create a command project-cli. You’d write:

[project.scripts]
project-cli = "project.__main__:main"

The command line name is the table key, and the form of the entry point is package.module:function. Now, when you install your package, you’ll be able to type project-cli on the command line and it will run your Python function.

Dynamic

Any field from above that are specified by your build backend instead should be listed in the special dynamic field. For example, if you want hatchling to read __version__.py from src/package/__init__.py:

[project]
name = "package"
dynamic = ["version"]

[tool.hatch]
version.path = "src/package/__init__.py"

All together

Now let’s take our previous example and expand it with more information. Here’s an example:

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "package"
version = "0.0.1"
authors = [
  { name="Example Author", email="author@example.com" },
]
description = "A small example package"
readme = "README.md"
license = { file="LICENSE" }
requires-python = ">=3.7"
classifiers = [
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: MIT License",
    "Operating System :: OS Independent",
]

[project.urls]
"Homepage" = "https://github.com/pypa/sampleproject"
"Bug Tracker" = "https://github.com/pypa/sampleproject/issues"

Key Points

  • There are a variety of useful bits of metadata you should add.


Versioning

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • First learning objective. (FIXME)

FIXME

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Documentation

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • First learning objective. (FIXME)

FIXME

Key Points

  • First key point. Brief Answer to questions. (FIXME)


CI with GitHub Actions

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How do you ensure code keeps passing

Objectives
  • Use a CI service to run your tests

Continuous Integration (CI) allows you to perform tasks on a server for various events on your repository (called triggers). For example, you can use GitHub Actions (GHA) to run a test suite on every pull request.

GHA is made up of workflows which consist of actions. Workflows are files in the .github/workflows folder ending in .yml.

Triggers

Workflows start with triggers, which define when things run. Here are three triggers:

on:
  pull_request:
  push:
    branches:
      - main

This will run on all pull requests and pushes to main. You can also specify specific branches for pull requests instead of running on all PRs (will run on PRs targeting those branches only).

Running unit tests

Let’s set up a basic test. We will define a jobs dict, with a single job named “tests”. For all jobs, you need to select an image to run on - there are images for Linux, macOS, and Windows. We’ll use ubuntu-latest.

jobs:
  tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - uses: actions/setup-python@v4
        with:
          python-version: "3.10"

      - name: Install package
        run: python -m pip install -e .[test]

      - name: Test package
        run: python -m pytest

This has five steps:

  1. Checkout the source (your repo).
  2. Prepare Python 3.10 (will use a preinstalled version if possible, otherwise will download a binary).
  3. Install your package with testing extras - this is just an image that will be removed at the end of the run, so “global” installs are fine. We also provide a nice name for the step.
  4. Run your package’s tests.

By default, if any step fails, the run immediately quits and fails.

Running in a matrix

TODO.

tests:
  strategy:
    fail-fast: false
    matrix:
      python-version: ["3.7", "3.10"]
      runs-on: [ubuntu-latest, windows-latest, macos-latest]
  name: Check Python ${{ matrix.python-version }} on ${{ matrix.runs-on }}
  runs-on: ${{ matrix.runs-on }}
  steps:
    - uses: actions/checkout@v3
      with:
        fetch-depth: 0 # Only needed if using setuptools-scm

    - name: Setup Python ${{ matrix.python-version }}
      uses: actions/setup-python@v4
      with:
        python-version: ${{ matrix.python-version }}

    - name: Install package
      run: python -m pip install -e .[test]

    - name: Test package
      run: python -m pytest

Other actions

GitHub Actions has the concept of actions, which are just GitHub repositories of the form org/name@tag, and there are lots of useful actions to choose from (and you can write your own by composing other actions, or you can also create them with JavaScript or Dockerfiles). Here are a few:

There are some GitHub supplied ones:

And many other useful ones:

Key Points

  • Set up GitHub Actions on your project


Publishing package and citation

Overview

Teaching: 20 min
Exercises: 20 min
Questions
  • How do I publish a package?

  • How do I make my work citable?

Objectives
  • Learn about publishing a package on PyPI

  • Learn about making work citable

Building SDists and wheels

The build package builds SDists (source distributions) and wheels (build distributions). The SDist usually contains most of your repository, and requires your build backend (hatchling, in this case) to build. The wheel is “built”, the contents are ready to unpack into standard locations (usually site-packages), and does not contain configuration files like pyproject.toml. Usually you do not include things like tests in the wheel. Wheels also can contain binaries for packages with compiled portions.

You can build an SDist and a wheel (from that SDist) with pipx & build:

pipx run build

The module is named build, so python -m build is how you’d run it from nox. The executable is actually named pyproject-build, since installing a build executable would likely conflict with other things on your system.

This produces the wheel and sdist in ./dist.

Conda

Building for conda is quite different. If you just have a pure Python package, you should just use pip to install in conda environments until you have a conda package that depends on your package and wants to add it into it’s requirements.

If you do need to build a conda package, you’ll need to either propose a new recipe to conda-forge, or set up the build infrastructure yourself and publish to an anaconda.org channel.

Manually publishing

Do you need to publish to PyPI?

Not every package needs to go on PyPI. You can pip install directly from git, or from a URL to a package hosted somewhere else, or you can set up your own wheelhouse and point pip at that. Also an “application” like a website or other code you deploy probably does not need to be on PyPI.

You can publish files manually with twine:

pipx run twine upload -r testpypi dist/*

The -r testpypi tells twine to upload to TestPyPI instead of the real PyPI - remove this if you are not in a tutorial. You’ll also need to setup a token to upload the package with. However, the best way to publish is from CI. This has several benefits: you are always in a clean checkout, so you won’t accidentally include added or changed files, you have a simpler deployment procedure, and you have more control over who can publish in GitHub.

Create a noxfile to build

Given what you’ve learned about nox and build, write a session that builds packages for you.

Solution

import nox

@nox.session()
def build(session):
    session.install("build")
    session.run("python", "-m", "build")  # can use pyproject-build instead

Building in GitHub Actions

GitHub Actions can be used for any sort of automation task, not just building tests. You can use it to make your releases too! Combined with the version control feature from the previous lesson, making a new release can be a simple procedure.

Let’s first set up a job that builds the file in a new workflow:

# .github/workflows/cd.yml
on:
  workflow_dispatch:
  release:
    types:
    - published

This has two triggers. The first, workflow_dispatch, allows you to manually trigger the workflow from the GitHub web UI for testing. The second will trigger whenever you make a GitHub Release, which will be covered below. You might want to add builds for your main branch, as well. We will make sure uploads to PyPI only happen on releases later.

Now, we need to set up the builder job:

jobs:
  dist:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
      with:
        fetch-depth: 0

    - name: Build SDist & wheel
      run: pipx run build

    - uses: actions/upload-artifact@v3
      with:
        path: dist/*

We’ve seen the setup before. We are calling the job dist, using an Ubuntu runner, and checking out the code, including the git history so the version can be computed with fetch-depth: 0 (which can be removed if you are not using git versioning).

The next step builds the wheel and SDist. Pipx is a supported package manager on all GitHub Actions runners.

The final step uploads an Actions “artifact”. This allows you to download the produced files from the GitHub Actions UI, and these files are also available to other jobs. The default name is artifact, which is as good as any other name for the moment.

We could have combined the build and publish jobs if we really wanted to, but they are cleaner when separate, so we have a publish job as well.

publish:
  needs: [dist]
  runs-on: ubuntu-latest
  if: github.event_name == 'release' && github.event.action == 'published'

  steps:
  - uses: actions/download-artifact@v3
    with:
      name: artifact
      path: dist

  - uses: pypa/gh-action-pypi-publish@v1.5.1
    with:
      password: ${{ secrets.pypi_password }}

This job requires that the previous job completes successfully with needs:. It has an if: block as well that ensures that it only runs when you publish. Note that Actions usually requires ${{ ... }} to evaluate code, like github.event_name, but blocks that always are evaluated, like if:, don’t require manually wrapping in this syntax.

Then we download the artifact. You need to tell it the name: to download (otherwise it will download all artifacts into named folders). We used the default artifact so that’s needed here. We want to unpack it into ./dist, so we set the path: to that.

Finally, we use the PyPA’s publish action, which uses twine internally, and we set the password to a GitHub Secret, PYPI_PASSWORD, that we’ll generate from PyPI and add next. We will be using TestPyPI for this exercise.

PyPI Tokens and GitHub Secrets

To make a release from GitHub, you’ll need to setup authentication with PyPI by generating a token. Log into your test.pypi.org account and generate a token. If it’s the first time you’ve uploaded the project (it will be if you are following along), then it will need to be account-scoped. After you upload your first release, you can delete the old token and upload a project-scoped one instead, which is safer.

API token button

Click your profile -> account settings, then click Add API token, under API tokens.

API token generation

Give it a name so you can identify it later, and select “Entire account” (you should replace it after uploading a package with a scoped token for that package).

API token view

Now copy it and go to GitHub, to your repository’s setting. Select Secrets -> Actions. Click “New repository secret”. Paste in your token here and give it the name PYPI_PASSWORD.

GitHub Actions Secrets page

You can now deploy to PyPI! Remember to delete your token and repeat this process with a scoped token one you’ve uploaded a package and can select it in the scope drop down.

Making a release

A release on GitHub corresponds to two things: a git tag, and a GitHub Release. If you create the release first, a lightweight tag will be generated for you. If you tag manually, remember to create the GitHub release too, so users can see the most recent release in the UI and will be notified if they are watching your releases.

Click Releases -> Draft a new release. Type in or select a tag; the recommended format is v1.2.3; that is, a “v” followed by a version number. Give it a title, like “Version 1.2.3”; keep this short so that it will be readable on the web UI. Finally, fill in the description (there’s an autogenerate button that might be helpful).

When you release, this will trigger the GitHub Action workflow we developed and upload your package to TestPyPI!

Adding Zenodo

The CITATION.cff file

Key Points

  • CI can publish Python packages

  • Tagging and GitHub Releases are used to publish versions

  • Zenodo and CITATION.cff are useful for citations