Overhauling the Auxiliary API
My work on the EDRReader has given us some ideas on how the process of attaching auxiliary data could be changed in general, and that is what I have been working on this week.
Adding Auxiliary Data, Take One
As a reminder, this is how adding EDR data worked as of my previous blog post on my work in PR #3749:
u.trajectory.add_auxiliary(auxname, auxdata, auxterm=None, format=None, **kwargs):
- When no
auxtermis provided, the auxname must precisely match one of (or a list of)aux.terms - to add a single term,
auxnameand optionallyauxtermare provided as strings - to add many terms,
auxnameand optionallyauxtermare provided as lists of strings - to add all terms found in the file, the auxname
"*"can be provided.
This worked, but it wasn’t ideal. A number of problems with this approach existed, including but not limited to:
- It was confusing to have
auxnamehave different requirements depending on whether or notauxtermwas provided - The fact that strings and lists could be provided for both of these arguments required a number of checks, adding a lot of code
- Using
"*"to signify ‘everything’ is not very pythonic. A more natural way to add all data found in the file is to simply specifyauxdata. This meansauxnameshould be optional and default toNone. MDAnalysis.coordinates.base.add_auxiliarywas made quite complicated by checking for the type of AuxReader inauxdataand calling a separate method for attaching the data for each Reader. This would have been especially problematic on addition of more and more readers.
Adding Auxiliary Data, Take Two
AuxReader who?
One thing I did to improve on this is to move where in the code the auxiliary
data actually is associated with the trajectory. Previously, MDAnalysis.coordinates.base.add_auxiliary
would add the auxiliary data itself, and I had then changed it so the method would instead
call the appropriate file format specific method.
We have discussed this change, and really, there is no reason why add_auxiliary()
should need to know about which format it is dealing with, or even which formats
currently are supported. Instead, the code now makes use of a useful code pattern:
class Parent:
def upper_method(self, child_obj, ...):
child_obj.lower_method(self, ...)
class Child:
def lower_method(self, parent_obj):
parent_obj.add_attribute(...)
Applied to MDAnalysis, the Parent class is the trajectory of the Universe to which
auxiliary data is added, and the upper_method is add_auxiliary. The role of the Child class
is taken by the AuxReader Base class.
What this pattern allows is the manipulation of the Parent object through the Child
object, as it passes itself to its child through upper_method. It is thus available to
the Child.
I have now written an attach_auxiliary() method of the base AuxReader class, which
receives its parent trajectory as an argument and allows attaching correctly configured
AuxReader instances to this parent. This simplified add_auxiliary massively. Also, the
auxiliary module is a more natural home for the code that does the actual association than the coordinate module.
Resolving the auxterm-auxname-Confusion
To avoid the confusion that the inconsisten requirements of what auxname should be,
the two parameters were replaced by a single aux_spec parameter.
aux_spec should be a dictionary that maps the desired attribute names to the precise
names of the data in the auxiliary file (formerly auxterm). At the same time, this
removed the need to check if one or many terms are to be added, so whether a string or a list
is provided.
The way this works now is as follows:
import MDAnalysis as mda
from MDAnalysisTests.datafiles import AUX_EDR, AUX_EDR_TPR, AUX_EDR_XTC
term_dict = {"temp": "Temperature", "epot": "Potential"}
aux = mda.auxiliary.EDR.EDRReader(AUX_EDR)
u = mda.Universe(AUX_EDR_TPR, AUX_EDR_XTC)
u.trajectory.add_auxiliary(term_dict, aux)
There is one exception to this behaviour: For reasons of backwards compatibility,
it is also possible to provide a string as aux_spec. In this case, the attach_auxiliary
method will create a dictionary mapping this string value to None. This maintains the current default
behaviour of the XVGReader, where passing a string as auxname causes everything found in the XVG
file to be added under auxname. The value of the aux_spec key-value-pairs is
set to the AuxReaders data_selector, with None causing everything to be added.
This pattern will also generalise well to other file formats. The values of the dictionary could, for
example, also be column indices of a CSV file.
More Pythonic Syntax
Adding all terms by providing "*" as an argument is not very pythonic. A better
behaviour would be one were providing no aux_spec at all would add everything.
This could be implemented by making aux_spec an optional argument that defaults to None.
However, there is a problem: auxdata isn’t optional, so the order of the function parameters has to be reversed:
add_auxiliary(auxname, auxdata)
would become
add_auxiliary(auxdata, auxname=None)
But because the XVGReader currently expects auxname as the first argument and auxdata as the second,
this change would break the XVGReader.
I identified four possible solutions for this:
- We change the behaviour of the XVGReader
- We use unpythonic syntax
- We explicitly provide None as an argument to add everything, which is the opposite of what I would expect.
- We make both
auxnameandauxdataoptional and default toNone, but raise an exception when noauxdatais provided.
Currently, the PR reflects solution 4 and looks like this:
def add_auxiliary(self,
aux_spec: Union[str, Dict[str, str]] = None,
auxdata: Union[str, AuxReader] = None,
format: str = None,
**kwargs) -> None:
This implementation seemed to me the best option. The XVGReader still works as before because
the order of the parameters is maintained. The "*" no longer is an option, therefore the code is more pythonic.
And not specifying aux_spec and therefore passing None implicitly works as expected, adding all data.
There is a minor downside: Because auxdata is the second parameter of add_auxiliary (and cannot be first parameter to maintain XVGReader functionality),
auxdata=aux has to be passed as argument, rather than just aux, when no aux_spec is provided.
To add everything, the method is now used as follows :
u.trajectory.add_auxiliary(auxdata=aux)
Lessons Learned
The code pattern of a Parent class instance passing itself to a Child class instance is super useful!
Saying something is “(almost) done” before review is unwise
Future Goals
With these changes, the most critical issues mentioned in the PR reviews are addressed. Next, I’ll address the remaining comments, which should then move the EDRReader closer to being merged. Thereafter, work on the NumPy array reader will start.