Building a Python Package

A simple and easy to follow recipe!

Courtesy of Google: Possibly the most suitable image for this blog :)

Few weeks earlier, I was tasked with building an application that had to be hosted as a python package. After numerous iterations of code reviews and pull requests, we deployed the final version and it was much different than the initial version.

So, what changed? The answer lies in my favorite quote from Martin Fowler:

Any fool can write code that a computer can understand. Good programmers write code that humans can understand.

In this blog, I am going to the share the recipe for creating a python package the right way. So, let’s get started…

A Python package is a collection of modules. Modules that are related to each other are mainly put in the same package. When a module from an external package is required in a program, that package can be imported and its modules can be put to use.

Elements:

Before we dive deep into the code structuring, let’s look at the basic housekeeping items you would need. Just a few items that I specifically wanted to talk about (of course not the entire list):

The key components in the journey

Code Editors: While the coding-nerds might disagree, I highly recommend using a editor (IDE) for all your coding purposes. Code Editors like Pycharm (my fav) were built for a purpose and they have been around for their efficiency. With just the click of a button, you can perform essential activities like auto-indenting the code lines, reformatting the entire code file, optimizing the imports and much more.

I am a huge fan of the inline warnings that helps you identify the mistakes then and there. If I had to speak for “my why”, the editors do a fabulous job of aligning the code to the Python standards (PEP-8) or atleast act as a catalyst in doing so. There’s just so much that I can write about it.

And needless to say, the integration of external scripts (say for linting) and version control (git, etc.) provides the muscle to do everything at one place. Now I understand, why editors are often termed as “Integrated” Development Environment (IDE).

Bonus: Checkout my blog on the cookie cutter instructions to integrate Pylint with Pycharm along with the different stages of using pylint efficiently here.

Version Control: Apart from safeguarding your development efforts from vanishing away because of a system failure, version control systems (like Github) is a must have for all code changes, especially with feature releases and bug fixes. Even after rigorous test cases and detailed code reviews, stumbling across unpredicted issues and bugs isn’t a surprise. Feasibility of rolling back to the previous versions is a non-negotiable ask.

Cloud hosting platforms for version control like Github lets you and others work together on projects from anywhere. The service hooks enable seamless continuous integrations (CI). Integrating Github with a CI/CD platform (like Codefresh, Azure Devops, etc.) can automate the entire build pipeline. You can set up the pipelines to run the test cases, check for lints, build the package and deploy it to artifactory with every pushed commits.

In the below flow chart, I have outlined the important steps and how they fit into each components that we are discussing.

Python Package Development Lifecycle

CI-CD Platform: The major chunk of packaging and shipping the package is undertaken by this component. As shown in the the above diagram, the CI cycle is kicked off by the Git service hooks (Push Commits/PR Merges/etc). The CI pipelines should consist of but no limited to the steps as shown, i.e. the unit tests, lint checks and build and publish of the packages.

The seamless integration of the modern version control hosts with the cloud based CI platforms make the entire lifecycle as continuous as it could be. I will continue to stress more on the Unit Tests and Lint Checks in the below sections.

Artifactory: Also known as Repository, this is the last component that falls into a the package development lifecycle. Think of it as a library where people can search for books and borrow them for use. Artifactories can be public (pypi) as well as private (inter-organizations, ex: JFROG platform hosted with a company’s IT infra). Artifactories store the published packages and make them available for use. It also maintains the versioning of the package, which is implicitly set by the CI pipelines.

Now that we are good with the housekeeping stuff, let’s proceed to look deeper into the package itself…

Structure

The bare bone structure for a package code should look like the below. Let’s call the package in discussion as “my_simple_package”.

my-simple-package-root
├── my_simple_package
│ ├── __init__.py
│ ├── __main__.py
│ ├── module_a.py
│ ├── module_b.py
│ ├── -----------
│ ├── module_z.py
│ ├── my_simple_sub_package
│ | ├── module_a.py
│ | ├── module_b.py
├── tests
│ ├── test_module_a.py
│ ├── test_module_b.py
│ ├── ----------------
│ ├── test_module_z.py
│ ├── test_my_simple_sub_package
│ | ├── test_module_a.py
│ | ├── test_module_b.py
├── CHANGELOG.md
├── DEVELOPING.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── requirements.txt
├── setup.cfg
├── setup.py

Root

As illustrated above, the root directory should not contain any actual code files that goes into the package. Rather, the content to deployed as a part of the package should be present in a folder named as per the package name. The root folder should mostly contain all that’s needed to develop, support and test the package contained within it.

Let’s discuss each of the files briefly:

my_simple_package:

This is the core folder containing all the code lines that goes into the package. It is suggested that you design your modules (ex: module_a, module_b, etc.) in such a way that they can be independently tested. Further, the modules should be logically collated.

__init__.py

The __init__.py files are required to make Python treat the directories as containing packages; this is done to prevent directories with a common name, such as string, from unintentionally hiding valid modules that occur later on the module search path.

It is recommended to have the __init__.py empty. You can read more about the various structures here.

I liked the below example referenced from Reddit:

Assuming your package looks like this:

package
|-- __init__.py
|-- subpackage.py

An empty __init__.py won't let you do this:

import package
instance = package.subpackage.Class()

But it will let you do this:

import package.subpackage
instance = package.subpackage.Class()

__main__.py

To execute a single file python program, we write the below if statements. Say you have a file named test_main.py

def main():
print("Hello Main")
if __name__ == __main__:
main()

When you run a program by calling the Python interpreter on it, the magic global variable __name__ gets set to __main__. Thus, when we execute python test_main.py, it executes fine.

But what if we had multiple modules (python package) and want to define an entry point to make the package ? This is done using the __main__.py. The magic file __main__.py is called when you run your project with the -m module flag. If you code is intended to be used a module first, and command line interface second, this makes perfect sense to use. Think of it as a place we can put whatever would be in our body of our if __name__ == __main__ statement.

Thus to run our example package, we can invoke it as :

$ python -m my_simple_package

Module Files (module_a.py, etc.)

Here goes all the core code for your package. Few things to consider while developing this:

  1. Keep functions independent of each other.
  2. Ensure functions are logically grouped in modules.
  3. Name the functions as such to convey what they do.
  4. For functions, use lowercase with words separated by underscores as necessary to improve readability.
  5. Use one leading underscore only for non-public methods and instance variables.
  6. It’s a good practice to have the package split into multiple modules if the application is large enough

Tests (test_module_a.py, etc.)

Perhaps the most important part of any package or application. Test cases assert the correctness of the code. You must have at least 75% code coverage. Code coverage is basically a percentage of count of lines that the test cases traversed to the overall lines in the code.

While the core unit test module would suffice writing and executing test cases, I highly recommend using pytest for unit tests. It makes writing tests super easy with all the friendly features like fixtures, parameterization and much more.

Let’s look at an example from the pytest docs:

import pytest


@pytest.mark.parametrize("test_input,expected", [("3+5", 8), ("2+4", 6), ("6*9", 42)])
def test_eval(test_input, expected):
assert eval(test_input) == expected

In the above example, pytest-parameterize runs the test_eval method against the 3 sets of input and expected outputs. With other test frameworks, you would have to write these individually thrice! Read more about pytest here.

By rule of thumb, every change must be accompanied by test cases. And all test cases must be run to ensure coverage and certain that the code works correctly. This helps to detect and protect against bugs in the future.

CHANGELOG.MD

In the software industry a changelog, as the name suggests, is a file that logs all the changes made to a specific software program. The reason for creating and keeping a changelog is simple; when a contributor or end-user wants to see if any changes have been made to a software program, they can do that easily and precisely by reading the changelog. All they need to do is go to the changelog and it will show what, and when, any changes were made between the different versions or releases of the particular software. — Changelog.md

DEVELOPING.MD

This file is intended to set the guidelines for all the contributors to the package. It should cover the below points:

  1. How to set up the development environment.
  2. The development lifecycle.
  3. Any development dependencies.
  4. Running Examples.
  5. Running Tests.
  6. Adding Examples.
  7. Requirements for merging PR.

You can refer to sample ones on Github here.

LICENSE

It’s important for every package uploaded to the Python Package Index to include a license. This tells users who install your package the terms under which they can use your package. As suggested in the python-packaging, for help picking a license, you can refer to choosealicense.com

MANIFEST.in

A MANIFEST.in file consists of commands, one per line, instructing setuptools to add or remove some set of files from the sdist. By default only a minimal set of files are included in the . You may find yourself wanting to include extra files in the source distribution, such as an authors/contributors file, data files used for testing purposes, screenshots for references. All these extra files must be added in the MANIFEST.in file.

Clarify, if there are some files to add or some files to exclude. If neither is needed, then there is no need for using MANIFEST.in.

README.md

A README is a text file that introduces and explains a project. It contains information that is commonly required to understand what the project is about.

The contents of the README.md typically includes one or more of the following:

  • Configuration instructions
  • Installation instructions
  • Operating instructions
  • A file manifest (list of files included)
  • Copyright and licensing information
  • Contact information for the distributor or programmer
  • Known bugs
  • Troubleshooting
  • Credits and acknowledgments
  • A changelog (usually for programmers)
  • A news section (usually for users)

Do look at makeareadme.com

requirements.txt

Python’s package is all about reusability. This requirements.txt file is used for specifying what other python packages is the current project dependent on.

Example requirements.txt:

Flask==0.8
Jinja2==2.6
Werkzeug==0.8.3
certifi==0.0.8
chardet==1.0.1
distribute==0.6.24
gunicorn==0.14.2
requests==0.11.1

Note:

  1. It should contain only the required packages.
  2. Specify a minimum version for each dependent package.
    i.e. use Flask==0.8 instead of just Flask. This binds the codebase to a minimal version of the dependent packages and hence avoiding any code breaks or bugs.

setup.cfg & setup.py

Python projects are packaged using setuptools. setup.py is the build script for setuptools. It tells setuptools about your package (such as the name and version) as well as which code files to include.Setuptools allows using configuration files (usually setup.cfg) to define a package’s metadata and other options that are normally supplied to the setup() function (declarative config).

Example setup.cfg :

[metadata]
name = my_simple_package
version = 1.0.0
description = My Simple Package's Description
long_description = file: README.md, CHANGELOG.MD, DEVELOPING.MD, LICENSE
[options]
zip_safe = False
include_package_data = True
packages = find:
install_requires =
requests==2.24.0
[options.packages.find]
where =
my_simple_package
[options.extras_require]
pdf = ReportLab>=1.2; RXP
rest = docutils>=0.3; pack ==1.1, ==1.3

Example setup.py :

import setuptools

if __name__ == "__main__":
setuptools.setup()

setuptools.setup() by default, looks for the presence of setup.cfg in the same folder. Configs provided in the setup.cfg are essentially the parameters to be passed to setuptools.setup . Hence, if the project contains a setup.cfg as well as parameters passed to setuptools.setup(), the parameters override the values passed in the setup.cfg.

An extensive reference sheet for the keywords available to use under setup.cfg can be found here.

This summarizes the extensive list of files/resources that goes into a standard Python Package. Let’s move to the last section that details on the steps to build and publish your developed package to the artifactory.

Build & Publish

After you have tested and versioned your code, it’s now ready for publishing it to an artifactory. The steps for publishing a package are same irrespective of the artifactory.

The final step; build and deploy.

Build Package

You can choose to make a source distribution of your package by running:

python setup.py sdist

or

You can also choose to make a build distribution wheel for your package by running:

python setup.py bdist_wheel

Publish Package

You can use twine to upload your package to PyPI or any other artifactory. Install twine package first, and then publish the package using the command below:

twine upload dist/*

You will be prompted to enter your username and password. If the upload is successful, it will spit out the URL for your package.

And now, your package is available on the artifactory!

Building a Python Package may seem challenging and horrifying for the first time… or at least I felt it that way when I first built one. However, with the right approach, things are fairly simple and easy to understand.

I hope this blog was helpful. Looking forward to hear your comments and suggestions.

Cheers!

Programmer & Architect @ Deloitte in Python, Big Data, Azure/AWS by Profession. Biker, Chef & Philanthrope by Passion. Student Pilot.