Overhauling the Auxiliary API
My work on the EDRReader has given us some ideas on how the process of attaching auxiliary data could be changed in general, and that is what I have been working on this week.
Adding Auxiliary Data, Take One
As a reminder, this is how adding EDR data worked as of my previous blog post on my work in PR #3749:
u.trajectory.add_auxiliary(auxname, auxdata, auxterm=None, format=None, **kwargs):
- When no
auxterm
is provided, the auxname must precisely match one of (or a list of)aux.terms
- to add a single term,
auxname
and optionallyauxterm
are provided as strings - to add many terms,
auxname
and optionallyauxterm
are provided as lists of strings - to add all terms found in the file, the auxname
"*"
can be provided.
This worked, but it wasn’t ideal. A number of problems with this approach existed, including but not limited to:
- It was confusing to have
auxname
have different requirements depending on whether or notauxterm
was provided - The fact that strings and lists could be provided for both of these arguments required a number of checks, adding a lot of code
- Using
"*"
to signify ‘everything’ is not very pythonic. A more natural way to add all data found in the file is to simply specifyauxdata
. This meansauxname
should be optional and default toNone
. MDAnalysis.coordinates.base.add_auxiliary
was made quite complicated by checking for the type of AuxReader inauxdata
and calling a separate method for attaching the data for each Reader. This would have been especially problematic on addition of more and more readers.
Adding Auxiliary Data, Take Two
AuxReader who?
One thing I did to improve on this is to move where in the code the auxiliary
data actually is associated with the trajectory. Previously, MDAnalysis.coordinates.base.add_auxiliary
would add the auxiliary data itself, and I had then changed it so the method would instead
call the appropriate file format specific method.
We have discussed this change, and really, there is no reason why add_auxiliary()
should need to know about which format it is dealing with, or even which formats
currently are supported. Instead, the code now makes use of a useful code pattern:
class Parent:
def upper_method(self, child_obj, ...):
child_obj.lower_method(self, ...)
class Child:
def lower_method(self, parent_obj):
parent_obj.add_attribute(...)
Applied to MDAnalysis, the Parent
class is the trajectory of the Universe to which
auxiliary data is added, and the upper_method
is add_auxiliary
. The role of the Child
class
is taken by the AuxReader Base class.
What this pattern allows is the manipulation of the Parent
object through the Child
object, as it passes itself
to its child through upper_method
. It is thus available to
the Child
.
I have now written an attach_auxiliary()
method of the base AuxReader class, which
receives its parent trajectory as an argument and allows attaching correctly configured
AuxReader instances to this parent. This simplified add_auxiliary
massively. Also, the
auxiliary module is a more natural home for the code that does the actual association than the coordinate module.
Resolving the auxterm
-auxname
-Confusion
To avoid the confusion that the inconsisten requirements of what auxname
should be,
the two parameters were replaced by a single aux_spec
parameter.
aux_spec
should be a dictionary that maps the desired attribute names to the precise
names of the data in the auxiliary file (formerly auxterm
). At the same time, this
removed the need to check if one or many terms are to be added, so whether a string or a list
is provided.
The way this works now is as follows:
import MDAnalysis as mda
from MDAnalysisTests.datafiles import AUX_EDR, AUX_EDR_TPR, AUX_EDR_XTC
term_dict = {"temp": "Temperature", "epot": "Potential"}
aux = mda.auxiliary.EDR.EDRReader(AUX_EDR)
u = mda.Universe(AUX_EDR_TPR, AUX_EDR_XTC)
u.trajectory.add_auxiliary(term_dict, aux)
There is one exception to this behaviour: For reasons of backwards compatibility,
it is also possible to provide a string as aux_spec
. In this case, the attach_auxiliary
method will create a dictionary mapping this string value to None
. This maintains the current default
behaviour of the XVGReader, where passing a string as auxname
causes everything found in the XVG
file to be added under auxname
. The value of the aux_spec
key-value-pairs is
set to the AuxReaders data_selector
, with None
causing everything to be added.
This pattern will also generalise well to other file formats. The values of the dictionary could, for
example, also be column indices of a CSV file.
More Pythonic Syntax
Adding all terms by providing "*"
as an argument is not very pythonic. A better
behaviour would be one were providing no aux_spec
at all would add everything.
This could be implemented by making aux_spec
an optional argument that defaults to None
.
However, there is a problem: auxdata
isn’t optional, so the order of the function parameters has to be reversed:
add_auxiliary(auxname, auxdata)
would become
add_auxiliary(auxdata, auxname=None)
But because the XVGReader currently expects auxname
as the first argument and auxdata
as the second,
this change would break the XVGReader.
I identified four possible solutions for this:
- We change the behaviour of the XVGReader
- We use unpythonic syntax
- We explicitly provide None as an argument to add everything, which is the opposite of what I would expect.
- We make both
auxname
andauxdata
optional and default toNone
, but raise an exception when noauxdata
is provided.
Currently, the PR reflects solution 4 and looks like this:
def add_auxiliary(self,
aux_spec: Union[str, Dict[str, str]] = None,
auxdata: Union[str, AuxReader] = None,
format: str = None,
**kwargs) -> None:
This implementation seemed to me the best option. The XVGReader still works as before because
the order of the parameters is maintained. The "*"
no longer is an option, therefore the code is more pythonic.
And not specifying aux_spec
and therefore passing None
implicitly works as expected, adding all data.
There is a minor downside: Because auxdata
is the second parameter of add_auxiliary
(and cannot be first parameter to maintain XVGReader functionality),
auxdata=aux
has to be passed as argument, rather than just aux
, when no aux_spec
is provided.
To add everything, the method is now used as follows :
u.trajectory.add_auxiliary(auxdata=aux)
Lessons Learned
The code pattern of a Parent class instance passing itself to a Child class instance is super useful!
Saying something is “(almost) done” before review is unwise
Future Goals
With these changes, the most critical issues mentioned in the PR reviews are addressed. Next, I’ll address the remaining comments, which should then move the EDRReader closer to being merged. Thereafter, work on the NumPy array reader will start.