Revenge of the Package
In my last post, I wrote about the work I had done on panedr to make it importable in MDAnalysis, and the challenges that remained to be addressed. Over the last two weeks, I have made enough progress on the refactoring and repackaging of panedr that I am now able to start work on the EDRReader in earnest.
Refactoring of panedr
In my last blog post, I described how panedr works under the hood by
progressively reading bytes from the binary EDR file and sorting the information
in appropriate Python data structures, ultimately returning a pandas DataFrame.
I am happy to say that my refactoring of this process in PR #33
is now merged. In addition to a DataFrame, returning the energy data as a dictionary
of NumPy arrays is now supported through edr_to_dict()
. This function does not
require pandas, making the package optional for this use case.
Repackaging panedr for dependency management
In order to actually allow the module to be used without pandas installed, the package structure had to be changed a bit. I worked on this in PR #42. The boundary conditions to be met were the following:
pip install panedr
should work as it does now, installing all dependencies for all functions- Installing the package without pandas needs to be possible
- The code should be in one location and not duplicated
The way I have addressed these conditions is by creating two packages that sit
in parallel in the panedr repository: panedr
and panedrlite. All the code has moved from panedr to panedrlite, but it’s setup.cfg
was modified to make pandas an extras_require. Installing panedrlite thus installs
the package and provides most of the functionality out of the box, but does not
install pandas by default. panedr, on the other hand, is an empty metapackage. It
bundles panedrlite and pandas in its dependencies, allowing full functionality on
installation.
However, this is unfortunately not yet the final state for panedr packaging, as my
solution comes with a significant, difficult to fix, problem: installing panedrlite
exposes import panedrlite
, not import panedr
. One of the problems this causes
is that it means panedrlite would be installed for users who want to use the functionality
even if they have panedr already installed. There are a number of possible solutions
to this problem, but they each have drawbacks, and the discussion of which solution is best
is still ongoing.
Codecov growing pains
When I moved the code from panedr to panedrlite as part of PR #42, codecov started reporting 0 % coverage. This is weird, because the unit tests definitely still run and pass (as shown by CI). Initially, we thought the issue might solve itself on merging the PR, but this was unfortunately not the case, and at the time of writing, the repository proudly reports 0 % coverage. In the end, IAlibay found a solution for the problem, but I have to admit that the reason for why codecov stopped working and how these changes fix it again are beyond me.
Work on the EDRReader has started
While some changes are yet to come to panedr, the refactoring is now implemented and unlikely to change in the near future. As such, I was now finally able to start work on an EDRReader implementation. This is very much a work in progress, but this skeleton now already allows EDR files to be read in MDAnalysis, which I am happy about.
Lessons Learned
Now four weeks into my GSoC project, some of the things I learned include:
- there is always more to be learned about packaging
- CI is dark magic and CI experts are wizards
- Having worked on the EDRReader, I have now understood Python’s
super()
function better
Future Goals
The next part of my project will now focus on working on the EDRReader in MDAnalysis. Because panedr now returns data as a dictionary of numpy arrays, I am also hoping I might be able to apply some of my work on the EDRReader on a NumPy AuxReader as well.