First Things First
As I mentioned in my last blog post, the first step in my GSoC project is a minor rewrite of panedr to make pandas an optional dependency, as the amount of mandatory dependencies to MDAnalysis should be kept as small as possible. In today’s blog post, I’ll report about my progress in this endeavour, and mention a number of the things I have learned in the process.
Automated Testing with GitHub Actions
Before starting work on the actual code, we wanted to change the continuous integration workflow. In particular, we wanted to make the switch from Travis CI to GitHub Actions. On my part, this involved creating a .yml file at panedr/.github/workflows/gh-ci.yaml as part of PR #32. I am happy to report that this worked out of the gate, and every commit on pull requests is now tested in GH Actions. Also, badges on the repository’s front page now inform the user of the passing tests and a respectable 82 % test coverage, with most misses happening because only the most recent file format (used for over 15 years at this point) is currently tested.
Managing Dependencies
The goal of this work is to have the functionality of panedr
available to use without
the requirement of installing pandas. Initially, I changed the package in such a way
that pandas was an optional dependency that could be installed along panedr like so:
python -m pip install panedr[pandas]
, where ommitting the [pandas]
would install panedr without it. However, as
discussed in more detail in #34,
this would be a breaking change to the way panedr currently works. Instead, my current
goal is to have two packages in the repository, panedrlite
and panedr
. All of the
code will be part of panedrlite
. panedr
will be a metapackage, with panedrlite[pandas]
as one of its dependencies. This way, users can still install and import panedr just
as they can now, and MDAnalysis can include panedrlite without pandas.
The dependencies of panedr are managed in requirements.txt
and setup.cfg
, and these files
have been changed / will be changed according to these new specifications. One addition
to the requirements is NumPy, which is now installed with the same requirements that
MDAnalysis specifies.
Refactoring the panedr
Code
The other steps described in this blog post are more (learning about) package management than coding. This is the Summer of Code, however, so now I’ll talk about the changes to the panedr code that I have made thus far.
GROMACS writes EDR files according to the eXternal Data Representation (XDR)
protocol. Under the hood, panedr uses the xdrlib
Python package to decode the
binary files.
When the user calles the edr_to_df()
function, an EDRFile object is created.
During object instantiation, the EDR file passed to edr_to_df
is read and the
binary content passed to an instance of GMX_Unpacker
, which inherits from xdrlib’s Unpacker.
Next, the first few bytes of the binary files are read. They contain information
on the file version, precision (single or double), and which energy terms are present
in the file. After initialisation, the file is read iteratively, and the energy
values of each frame are stored in a nested list.
Once the last byte is read, the iteration stops. edr_to_df
then returns the nested
list as a pandas dataframe.
Luckily, this setup lends itself well to refactoring. Pandas is only used in the very last step, after the file has been read in its entirety. In PR #33, I’ve been working on just this.
To make panedr functional without pandas, I moved the parsing of the file to a new
function read_edr()
. This function returns the nested lists that contain the energy values
for each frame.
edr_to_df()
then just has to call read_edr()
and assemble its return values
into the dataframe. The function can also check whether pandas is installed in a try-except statement.
Another new function contains the functionality that I want to use for my auxiliary
reader in MDAnalysis: edr_to_dict()
takes the nested lists that read_edr()
returns and assembles them into a dictionary of NumPy arrays. This data structure
will be straightforward to work with further downstream.
I have also written a test for edr_to_dict()
. The test compares
the dictionary to the dataframe, and because these two are the same, we can
be confident that the function is indeed working as intended.
Lessons Learned
In the last two weeks, I have learned quite a lot, including:
- more details on what actually makes a Python package
- how to manage dependencies of packages
- how to make packages installable
- CI workflows in GitHub actions
- What the XDR protocol is and how it is read
Also: I already knew this, of course, but I was once again reminded that tasks often take much longer than I anticipate before starting out.
While improving my understanding of Python packaging was and is certainly important, I am now looking forward to wrapping up this part of the project and moving on to the implementation in MDAnalysis.
Future Goals
The immediate next steps are still concerning the panedr package:
- Set up the panedr metapackage and panedrlite which holds the code
- Get PR #33 merged to get panedr-lite ready for inclusion in MDAnalysis
- Write documentation for the new functions
I am hoping to finish this work over the next week, so I can then use panedr-lite in MDAnalysis and get started on the EDRReader in earnest.