The EDRReader is (almost) done
The last few weeks saw me working on the implementation of an auxiliary reader for energy data in the EDR format.
panedr and pyedr
Last time, I reported on my work on panedr and panedrlite, two packages with the same functionality but different dependencies. While this solution worked, it was not ideal. A discussion on the pros and cons of different improvements can be found here and is recommended reading for anyone trying similar things with packages and their dependencies. In the end, IAlibay implemented the generally preferred solution in PR #50 and created the panedr and pyedr packages. A huge Thank You for all the work!
Problem Statement: EDRReader
The light-weight pyedr package now exists and is ready to be used as a dependency in MDAnalysis. Finally, the time had come for work on the new AuxReaders to start in earnest.
The new EDRReader needs to do a number of things:
- Use pyedr to read EDR files and obtain their contents as a dictionary of NumPy arrays
- Make that data accessible via the Reader itself
- Allow association of the data with trajectories
- The addition of single terms, multiple terms, and all terms should be possible
The base classes in the auxiliary module already take care of a lot of the work here.
Importantly, I did not have to worry much about the association with the correct trajectory frames
itself, thanks to fiona-naughton’s work which she explained in her blog.
The main work lay in adapting these structures to handle the dictionary data type,
and to allow the specification of multiple terms at once. Work on the former point
mostly was focussed within the EDRReader itself,
while the latter point required modifying the add_auxiliary
method in the
coordinate-base module.
Implementation Details
This work is found in PR #3749.
Previously, the assumption was that the files to be read by AuxReaders would contain time-value pairs. It would be called as follows:
u.trajectory.add_auxiliary(auxname, auxdata, format=None, **kwargs):
The data found in auxdata
(either the file itself or an instance of an appropriate AuxReader)
would be added to appropriate time steps with the attribute name auxname
. This attribute can be
accessed through
u.trajectory.ts.aux.auxname
u.trajectory.ts.aux[auxname]
This does not work for EDR files. The problem is that with a larger number of terms,
adding everything under the same auxname
is not convenient. Therefore, I modified the method
as follows:
u.trajectory.add_auxiliary(auxname, auxdata, auxterm=None, format=None, **kwargs):
Users can now specify which of the energy terms to add to the trajectory under auxname
by providing an auxterm
argument to add_auxiliary
. I had to make it an optional argument
to preserve the XVGReader’s functionality.
A list of possible selections for auxterm
can be obtained from the EDRReader as follows:
aux = MDAnalysis.auxiliary.EDR.EDRReader("some_edr_file.edr")
aux.terms
This method now works as follows:
- When no
auxterm
is provided, the auxname must precisely match one of (or a list of)aux.terms
- to add a single term,
auxname
and optionallyauxterm
are provided as strings - to add many terms,
auxname
and optionallyauxterm
are provided as lists of strings - to add all terms found in the file, the auxname
"*"
can be provided.
Examples:
aux = MDAnalysis.auxiliary.EDR.EDRReader("some_edr_file.edr")
# Add the Temperature term to timesteps as a "temp" attribute
u.trajectory.add_auxiliary("temp", aux, "Temperature")
# no auxterm provided, auxname must match one of aux.terms precisely.
# Adds the data as the "Temperature" attribute.
u.trajectory.add_auxiliary("Temperature", aux)
# Adding multiple terms at once (with and without auxterm)
u.trajectory.add_auxiliary(["bond", "temp"], aux, ["Bond", "Temperature"])
u.trajectory.add_auxiliary(["Bond", "Temperature"], aux)
# Adding all data that's in the file
u.trajectory.add_auxiliary("*", aux)
Under the hood, a separate instance of an EDRReader is created for each term
that is added to a trajectory. This is necessary to allow updating all auxiliary data
when iterating over the trajectory or otherwise changing the time step. Before I did this,
when an instance of an EDRReader was passed to add_auxiliary
multiple times,
only the last energy term to be added would be updated when the time step changes.
The EDRReader now is fully functional and allows the association of EDR data with the appropriate time steps.
Testing
I have adapted the tests for the auxiliary module for testing the functionality of the EDRReader and also added further tests for handling more than one term at a time. There was a small challenge here: Since EDR is a binary file format, I could not generate a basic test file as used for testing the XVGReader. I therefore had to create a small EDR file from a real simulation, and adapt the tests appropriately. The most significant changes were that the timestep was off, and that I only added one of the many energy terms for most of the tests.
Lessons Learned
I was very glad to properly sink my teeth into the code these past few weeks. This is my largest contribution yet, and I have learned a lot about working with other peoples’ code and building on existing functionality.
Future Goals
The next steps in this project will be:
- adding type hinting to the new modules/functions
- Implementing a method to obtain the energy data without associating with a trajectory
- Describe how to slice trajectories based on auxiliary data
- Writing the Documentation
Once this is taken care of, I will then move on to an AuxReader for NumPy arrays