Jekyll2023-01-05T13:00:03+00:00/feed.xmlBjarne’s blogThis blog is a chronicle of my contribution to MDAnalysis as part of the Google Summer of Code 2022.The End of GSoC: How I Added Energy Reading to MDAnalysis2022-09-26T14:00:00+00:002022-09-26T14:00:00+00:00/jekyll/update/2022/09/26/the-end<p>I cannot believe that it’s already the end of GSoC. Over the last few months,
I have expanded the auxiliary data framework of <a href="https://www.mdanalysis.org">MDAnalysis</a>,
in the end mostly focussing on reading and working with the <a href="https://manual.gromacs.org/documentation/2016/user-guide/file-formats.html#energy-files">GROMACS EDR file format</a>.</p>
<h1 id="motivation">Motivation</h1>
<p>In molecular dynamics simulations, users frequently have to inspect energy-like terms such as potential or kinetic energy, temperature, or pressure. This is so common a task that even small inefficiencies add up. Currently, users have to create intermediate files from their MD simulation’s output files to obtain plot-able data, and this quickly becomes cumbersome when multiple terms are to be inspected. Being able to read in the energy output files directly would make this more convenient.</p>
<p>Therefore, I wanted to add readers for energy-type files (output files containing information on potential and kinetic energy, temperature, pressure, and other such terms) from a number of MD engines to the auxiliary module of MDAnalysis in this project. This would make quality control of MD simulations much more convenient, and allow users to analyse the energy data without the need for switching windows or writing intermediate files directly from within their scripts or jupyter notebooks.</p>
<p>I wanted the new energy readers to be able to parse energy output files of different MD engines, for example <a href="https://manual.gromacs.org/documentation/2016/user-guide/file-formats.html#energy-files">GROMACS</a>, <a href="https://ambermd.org/doc12/Amber20.pdf">Amber</a>, <a href="https://www.ks.uiuc.edu/Training/Tutorials/namd/namd-tutorial-html/node27.html">NAMD</a> or <a href="http://docs.openmm.org/7.0.0/api-python/generated/simtk.openmm.app.statedatareporter.StateDataReporter.html">OpenMM</a>, and return the data these files contain in a way that conducive to analysis, for example as NumPy arrays. I further wanted to make use of the <a href="https://userguide.mdanalysis.org/stable/formats/auxiliary.html">auxiliary data framework</a> in MDAnalysis. This would allow the association of energy data with trajectory time steps, enabling further utilisation such as filtering of time steps by certain auxiliary data values.</p>
<p>The ideal outcome of this project was the creation of several energy readers to make analyses easier and more convenient, while at the same time creating more interest in and use cases for the auxiliary readers.</p>
<h1 id="panedr-and-pyedr">Panedr and Pyedr</h1>
<p>In the beginning, I focussed on adapting the <a href="https://github.com/MDAnalysis/panedr">panedr</a> python package for use in MDAnalysis.
Some refactoring was needed so <a href="https://bfedder.github.io/jekyll/update/2022/06/26/panedr.html">pandas would not be introduced as a dependency in MDAnalysis</a>.
This refactoring was done in <a href="https://github.com/MDAnalysis/panedr/pull/33">pull request # 33</a>, and led ultimately to
a restructuring of panedr and the creation of a second Python package that does not rely on pandas
(PRs <a href="https://github.com/MDAnalysis/panedr/pull/42">#42</a> and <a href="https://github.com/MDAnalysis/panedr/pull/50">#50</a>).
As part of this work, I learned a lot about CI workflows (<a href="https://github.com/MDAnalysis/panedr/pull/32">#32</a>) and package management (<a href="https://github.com/MDAnalysis/panedr/pull/28">#28</a>, <a href="https://github.com/MDAnalysis/panedr/pull/36">#36</a>, <a href="https://github.com/MDAnalysis/panedr/pull/41">#41</a>).</p>
<p>Finally, an additional function was added to also read the units of all energy terms found in the EDR file (<a href="https://github.com/MDAnalysis/panedr/pull/56">PR # 56</a>).</p>
<p>As a direct outcome of this GSoC project, the Pyedr python package was created. It is <a href="https://pypi.org/project/panedr/">available
on PyPI</a> and exposes the <code class="language-plaintext highlighter-rouge">edr_to_dict</code> and <code class="language-plaintext highlighter-rouge">get_unit_dictionary</code> functions. The former reads
the whole EDR file and returns the data stored in it as a dictionary mapping the names given by GROMACS to
NumPy arrays that hold the data. The latter reads the unit information from the EDR file and maps GROMACS-given
names to strings of unit names. Pyedr is the basis of the <a href="https://docs.mdanalysis.org/2.4.0-dev0/documentation_pages/auxiliary/EDR.html">EDRReader in MDAnalysis</a>, which I wrote as the next step of my GSoC project.</p>
<h1 id="edrreader">EDRReader</h1>
<p>The <a href="https://github.com/MDAnalysis/mdanalysis/pull/3749">code for the EDRReader</a> is the largest code contribution I have made to date.
While working on the reader, it quickly became apparent that the auxiliary API would need to be changed to accommodate
the large number of terms found in EDR files. While the <a href="https://docs.mdanalysis.org/stable/documentation_pages/auxiliary/XVG.html">XVGReader</a> still works as previously, the new base case for adding auxiliary data assumes a dictionary to be passed. The dictionary maps the name to be used in MDAnalysis to the names read from the EDR file. This is shown in the following minimal working example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">MDAnalysis</span> <span class="k">as</span> <span class="n">mda</span>
<span class="kn">from</span> <span class="nn">MDAnalysisTests.datafiles</span> <span class="kn">import</span> <span class="n">AUX_EDR</span><span class="p">,</span> <span class="n">AUX_EDR_TPR</span><span class="p">,</span> <span class="n">AUX_EDR_XTC</span>
<span class="n">term_dict</span> <span class="o">=</span> <span class="p">{</span><span class="s">"temp"</span><span class="p">:</span> <span class="s">"Temperature"</span><span class="p">,</span> <span class="s">"epot"</span><span class="p">:</span> <span class="s">"Potential"</span><span class="p">}</span>
<span class="n">aux</span> <span class="o">=</span> <span class="n">mda</span><span class="p">.</span><span class="n">auxiliary</span><span class="p">.</span><span class="n">EDR</span><span class="p">.</span><span class="n">EDRReader</span><span class="p">(</span><span class="n">AUX_EDR</span><span class="p">)</span>
<span class="n">u</span> <span class="o">=</span> <span class="n">mda</span><span class="p">.</span><span class="n">Universe</span><span class="p">(</span><span class="n">AUX_EDR_TPR</span><span class="p">,</span> <span class="n">AUX_EDR_XTC</span><span class="p">)</span>
<span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="n">term_dict</span><span class="p">,</span> <span class="n">aux</span><span class="p">)</span>
</code></pre></div></div>
<p>Aside from this API change, the EDRReader can do everything the XVGReader can. In addition to that, it
has some new functionality.</p>
<ul>
<li>Because EDR files can become reasonably large, a memory warning will be issued when more than a gigabyte of storage is used by the auxiliary data. This default value of 1 GB can be changed by passing a value as <code class="language-plaintext highlighter-rouge">memory_limit</code> when creating the EDRReader object.</li>
<li>EDR files store data of a large number of different quantities, so it is important to know their units as well. The EDRReader therefore has a <code class="language-plaintext highlighter-rouge">unit_dict</code> attribute that contains this information. By default, units found in the EDR file will be converted to <a href="https://docs.mdanalysis.org/stable/documentation_pages/units.html#id68">MDAnalysis base units</a> on reading. This can be disabled by setting <code class="language-plaintext highlighter-rouge">convert_units</code> to False on creation of the reader.</li>
<li>In addition to associating data with trajectories, the EDRReader can also return the NumPy arrays of selected data, which is useful for plotting, for example. This is done via the EDRReader’s <code class="language-plaintext highlighter-rouge">get_data</code> method.</li>
</ul>
<p>I have explained the EDRReader’s functionality in more detail in the <a href="https://userguide.mdanalysis.org/2.4.1/formats/auxiliary.html#edr-files">MDAnalysis User Guide</a></p>
<h1 id="challenges-and-ongoing-discussions">Challenges and Ongoing discussions</h1>
<p>Throughout the implementation, I had to ensure that the EDRReader itself is defined in a sensible, easy to understand way, while also maintaining backwards compatibility and not changing the behaviour of the <a href="https://docs.mdanalysis.org/2.3.0/documentation_pages/auxiliary/XVG.html">XVGReader</a>. I want to highlight one challenge in particular that I had to overcome to meet both of these goals.</p>
<p>Auxiliary data is added to trajectories by calling</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="n">aux_spec</span><span class="p">,</span> <span class="n">auxdata</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
</code></pre></div></div>
<p>As described in <a href="https://github.com/MDAnalysis/mdanalysis/issues/3811">Issue #3811</a>, the expected way to specify that all data found in a file should be added would be to have <code class="language-plaintext highlighter-rouge">None</code> be the default value for aux_spec and not specify any term to be added. Only the AuxReader instance that holds the data needs to be named. However, making <code class="language-plaintext highlighter-rouge">aux_spec</code> optional causes a problem because <code class="language-plaintext highlighter-rouge">auxdata</code>, the AuxReader instance, is always needed. In Python, it is not possible to define optional function arguments before mandatory ones. Reversing the order in which <code class="language-plaintext highlighter-rouge">aux_spec</code> and <code class="language-plaintext highlighter-rouge">auxdata</code> are defined and expected would solve this problem, but this is a breaking change for the XVGReader. What I have done instead is make <code class="language-plaintext highlighter-rouge">auxdata</code> aund <code class="language-plaintext highlighter-rouge">aux_spec</code> be technically optional and default to <code class="language-plaintext highlighter-rouge">None</code>. This way, the XVGReader still works as previously, and users can add all data by not specifying <code class="language-plaintext highlighter-rouge">aux_spec</code> as expected, with the small caveat that they have to type <code class="language-plaintext highlighter-rouge">u.trajectory.add_auxiliary(auxdata=aux)</code> instead of <code class="language-plaintext highlighter-rouge">u.trajectory.add_auxiliary(aux)</code>.</p>
<p>We have agreed that the order of function arguments will be changed. In line with semantic versioning, DeprecationWarnings of this breaking change will be raised in the next minor release, and in the 3.0.0 major release, the change will take place.</p>
<p>During this work, a few further issues were raised. Two of these sparked discussions that are ongoing.</p>
<p>With the expansion of auxiliary readers, MDAnalysis now encounters a more diverese set of units. This started a discussion on unit handling in general (<a href="https://github.com/MDAnalysis/mdanalysis/issues/3792">Issue #3792</a>). Because of this complication, we are now discussing whether a unit management system like <a href="https://pint.readthedocs.io/en/stable/">pint</a> should be adapted for use in MDAnalysis.</p>
<p>With the introduction of the EDRReader, a new internal data structure for AuxReaders was needed. Where previously auxiliary data was stored in a plain NumPy array, now it was stored in a dictionary of NumPy arrays. This required a change to a method for calculating average values. This method now supports both cases, but if future readers need their own internal data structure as well, this will quickly become unsustainable. Therefore, <a href="https://github.com/MDAnalysis/mdanalysis/issues/3830">Issue #3830</a> contains a discussion of where to define this method - in the base class or in the individual readers.</p>
<h1 id="numpy-reader">NumPy Reader</h1>
<p>While working on the project, the idea of an auxiliary reader to handle NumPy arrays occured to me. The plan I had to allow filtering of trajectory frames by auxiliary data values, I realised, really lent itself to be applied to the results of several of the analysis methods available through MDAnalysis. One application I pictured, for example, was that of a conformational change observed in MD simulations. Such a change can be monitored by calculating the RMSD of the atom positions, which MDAnalysis returns in the form of a NumPy array. By directly associating the RMSD values with trajectory timesteps, it is possible to select only sufficiently (dis-)similar structures.</p>
<p>I therefore decided to make a NumPy reader my second priority after finishing the EDRReader, but unfortunately, I was unable to complete it. Work on it has started, however, and can be found in <a href="https://github.com/MDAnalysis/mdanalysis/pull/3853">PR #3853</a>. I will continue work on this after GSoC ends.</p>
<h1 id="outcome">Outcome:</h1>
<p><a href="https://pypi.org/project/panedr/">Pyedr</a> and the <a href="https://docs.mdanalysis.org/2.4.0-dev0/documentation_pages/auxiliary/EDR.html">EDRReader</a> are now available to make the life of GROMACS users easier. They allow users to more quickly and conveniently verify that their simulations are properly equilibrated without the writing of intermediate files. The <a href="https://userguide.mdanalysis.org/2.4.1/formats/auxiliary.html#edr-files">MDAnalysis User Guide</a> contains a step-by-step guide of how to use the EDRReader and explains the functionality it has.</p>
<p>I will keep working on MDAnalysis in general and the AuxReaders specifically, with the NumPy reader as my next work</p>
<h1 id="lessons-learned-during-gsoc">Lessons learned during GSoC</h1>
<p>Participating in the Summer of Code was a great opportunity for me. I learned a lot, from small things like individual code patterns to larger points concerning overall best practices, the value of test-driven development, and package management. Also, because I worked on GSoC part-time alongside my PhD work, my time management was challenged and improved as a result. Overall, both my coding and my confidence in my coding have gotten much better this summer.</p>
<h1 id="conclusion">Conclusion</h1>
<p>I am very glad to have had the opportunity to contribute to MDAnalysis, which I use for my PhD every day. This made for a great synergy between my different projects this summer. I sincerely want to thank my mentors for all their advice, feedback, and support. I look forward to working with you in the future!</p>I cannot believe that it’s already the end of GSoC. Over the last few months, I have expanded the auxiliary data framework of MDAnalysis, in the end mostly focussing on reading and working with the GROMACS EDR file format.Making Pyedr Optional2022-09-12T09:00:00+00:002022-09-12T09:00:00+00:00/jekyll/update/2022/09/12/wrapping-up<p>The EDRReader itself is done and approved now. The final thing left to do at this
point is to make pyedr an optional, not a required dependency for MDAnalysis. This is because
the pyedr package is too large in its current state, with large test files included
in the installation.</p>
<h1 id="unit-handling-in-the-edrreader">Unit Handling in the EDRReader</h1>
<p>My <a href="https://github.com/MDAnalysis/panedr/pull/56">change to pyedr</a> which made
units available as an output was merged and released, so I was able to include unit handling
in the EDRReader now.</p>
<p>The reader now has an additional <code class="language-plaintext highlighter-rouge">unit_dict</code> attribute which stores the units of
all terms contained in the <code class="language-plaintext highlighter-rouge">data_dict</code>. On initialisation, this dictionary is
populated with the output from pyedr, and then checked against the base units of
MDAnalysis. Where the units from the file differ from the MDAnalysis base units,
they are converted automatically. For unit types like pressure or surface tensions,
where no MDAnalysis base units are defined yet, a warning is issued to the user to point
out the potential discrepancies between introduced in the data. It is possible to turn off
automatic unit conversion by setting the <code class="language-plaintext highlighter-rouge">convert_units</code> kwarg to <code class="language-plaintext highlighter-rouge">False</code>.</p>
<p>This was the last big change still requested for the EDRReader. With unit handling
implemented, all the desired functionality is now included.</p>
<h1 id="finalising-the-documentation">Finalising the Documentation</h1>
<p>In the 2 weeks since my last blog post, I went through the code documentation for
the new features and overhauled and restructured it. It now explains how to use the new
EDRReader much better, and reflects the code properly.</p>
<h1 id="making-pyedr-optional">Making Pyedr Optional</h1>
<p>Until such a time when the pyedr package is much smaller than it is now (<1 MB compared
to the current 20 MB), it cannot be a required dependency for MDAnalysis. This might be the case
in the future, for example after <a href="https://github.com/MDAnalysis/panedr/pull/55">PR #55</a> is merged. For now, to make
pyedr optional (and make sure the tests run or are skipped as appropriate), I applied
what hmacdope has done in <a href="https://github.com/MDAnalysis/mdanalysis/pull/3765">one of his PRs</a>
for the optional pytng dependency. Basically, the <code class="language-plaintext highlighter-rouge">import pyedr</code> statement is
now part of a try statement, which assigns a <code class="language-plaintext highlighter-rouge">HAS_PYEDR</code> the appropriate boolean
depending on whether an <code class="language-plaintext highlighter-rouge">ImportError</code> is raised or not. Tests that rely on pyedr
are skipped if <code class="language-plaintext highlighter-rouge">HAS_PYEDR</code> is False, and users who want to use the EDRReader
are advised to install pyedr themselves.</p>
<p>Making pyedr optional also involved changing setup.py, requirements.txt, and the
various CI workflow files.</p>
<h1 id="lessons-learned">Lessons Learned</h1>
<p>When changing the code, it is important to check if the documentation is still
accurate. It is much easier to develop a workflow where the documentation is updated
on the go. I learned this the hard way these past two weeks, when a lot of the documentation
I wrote initially had become obsolete and required a rewrite.</p>
<h1 id="future-goals">Future Goals</h1>
<p>The end of my Summer of Code project is rapidly approaching. My obvious first priority for
now is to get the <a href="https://github.com/MDAnalysis/mdanalysis/pull/3749">EDRReader PR</a> merged.
Afterwards, it will almost be time to work on my final product for the GSoC evaluation, but if there is time
I will at least start work on the NumPy reader. I will continue work on this project in the future,
so expect to see more AuxReader work to be done even after September.</p>The EDRReader itself is done and approved now. The final thing left to do at this point is to make pyedr an optional, not a required dependency for MDAnalysis. This is because the pyedr package is too large in its current state, with large test files included in the installation.Measuring progress2022-08-30T11:00:00+00:002022-08-30T11:00:00+00:00/jekyll/update/2022/08/30/getting-close<p>Once again, progress on the EDRReader has been made over the last 2 weeks.
It was improved thanks to the incorporation of reviewer feedback, and an important
gap to be filled was identified: Units.</p>
<h1 id="monitoring-memory-usage">Monitoring Memory Usage</h1>
<p>EDR files can get somewhat large, but the entire file is loaded into memory
for each instance of the EDRReader. This makes monitoring the memory usage useful.
However, there is no straightforward, built-in way in Python to determine the memory footprint of objects,
as <a href="https://towardsdatascience.com/the-strange-size-of-python-objects-in-memory-ce87bdfbb97f">this blog post</a> neatly illustrates.<br />
This is because almost everything in Python is an object, and in addition to its value
it has methods and attributes that also take up memory, so there is some overhead to consider.</p>
<p>In the case of the EDRReader, though, this overhead is negligible compared with the size
of the data stored in the EDR files and read into NumPy arrays. In an approximation of
the memory usage, we can therefore rely on the <a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.nbytes.html">nbytes attribute</a>
of NumPy arrays.</p>
<p>In this solution, each <code class="language-plaintext highlighter-rouge">AuxReader</code> now needs a <code class="language-plaintext highlighter-rouge">_memory_usage</code> method (and raises <code class="language-plaintext highlighter-rouge">NotImplementedError</code> if it is missing).
This method returns the total size of the NumPy arrays associated with the <code class="language-plaintext highlighter-rouge">AuxReader</code>.
When a new instance of an <code class="language-plaintext highlighter-rouge">AuxReader</code> is associated with a trajectory, the <code class="language-plaintext highlighter-rouge">_memory_usage</code>
method of all present <code class="language-plaintext highlighter-rouge">AuxReader</code>s is called and the collective sum compared to
a warning threshold (which defaults to 1 GB). If the memory usage is higher than the threshold,
a warning is issued to the user. This allows for more transparency and better memory control.</p>
<p>The warning threshold can be set via the <code class="language-plaintext highlighter-rouge">memory_limit</code> kwarg on the creation
of an <code class="language-plaintext highlighter-rouge">AuxReader</code> instance. This is generally useful, but I needed to implement this
to allow the limit to be lowered sufficiently so tests could trigger the warning without
actually taking up an entire gigabyte of memory. The optional <code class="language-plaintext highlighter-rouge">memory_limit</code> kwarg
is stored on the level of the <code class="language-plaintext highlighter-rouge">AuxStep</code>.</p>
<h1 id="making-the-edrreader-unit-aware">Making the EDRReader Unit-Aware</h1>
<p>Part of the discussion on <a href="https://github.com/MDAnalysis/mdanalysis/pull/3749">PR #3749</a>
centred on the handling of units. This discussion then sparked a discussion how to handle units of auxiliary data
<a href="https://github.com/MDAnalysis/mdanalysis/issues/3792">in general</a>.
It was pointed out that it is desirable for the
data to be converted to MDAnalyis base units on reading. This is currently the last
major change that is requested for the EDRReader before it looks ready to be merged.
Addressing this is not straightforward, though, because the EDRReader is completely unaware of units
at this state.</p>
<p>To change this, I had to go back to the level of <a href="https://github.com/MDAnalysis/panedr">panedr</a>.
(Once again learning the lesson that nothing is ever truly done; I was too optimistic
when I opened PR #3749 and said pyedr’s “actual code itself [was] less likely to still change”)
This is because in the process of reading the EDR file, panedr/pyedr were at some point
aware of the units of each entry, but did not retain this information.</p>
<p>During the execution of pyedr, the <code class="language-plaintext highlighter-rouge">nms</code> variable is populated with the names and units found in the
EDR file by the <code class="language-plaintext highlighter-rouge">edr_strings()</code> function. I have changed the package in <a href="https://github.com/MDAnalysis/panedr/pull/56">PR #56</a>
to also make the unit information available as an output.</p>
<p>Now, users of panedr and pyedr can call the <code class="language-plaintext highlighter-rouge">get_unit_dictionary</code> function to obtain
a dictionary of the units of each energy term found in the EDR file as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pyedr</span>
<span class="n">edr_file</span> <span class="o">=</span> <span class="s">"path/to/some/edrfile.edr"</span>
<span class="n">unit_dict</span> <span class="o">=</span> <span class="n">pyedr</span><span class="p">.</span><span class="n">get_unit_dictionary</span><span class="p">(</span><span class="n">edr_file</span><span class="p">)</span>
<span class="n">unit_dict</span><span class="p">[</span><span class="s">"Temperature"</span><span class="p">]</span> <span class="c1"># Returns "K"
</span></code></pre></div></div>
<p>I am hoping that this PR can be merged and the updated package released soon,
so that I can then work on making the units available on the level of the <code class="language-plaintext highlighter-rouge">EDRReader</code>
as well.</p>
<h1 id="lessons-learned">Lessons Learned</h1>
<p>I learned more about how to accommodate <code class="language-plaintext highlighter-rouge">kwargs</code> that are meant to be passed as
optional arguments, and I learned about the <a href="https://pint.readthedocs.io/en/stable/">pint package</a>
for unit handling in Python.</p>
<h1 id="future-goals">Future Goals</h1>
<ul>
<li>Get <a href="https://github.com/MDAnalysis/panedr/pull/56">PR #56</a> merged</li>
<li>Change the EDRReader to store unit information alongside the data themselves</li>
<li>Implement automatic conversion to MDAnalysis base units</li>
<li>Start work on the auxiliary reader for NumPy arrays</li>
</ul>Once again, progress on the EDRReader has been made over the last 2 weeks. It was improved thanks to the incorporation of reviewer feedback, and an important gap to be filled was identified: Units.Overhauling the Auxiliary API2022-08-14T11:00:00+00:002022-08-14T11:00:00+00:00/jekyll/update/2022/08/14/aux-api-overhaul<p>My work on the EDRReader has given us some ideas on how the process of attaching
auxiliary data could be changed in general, and that is what I have been working
on this week.</p>
<h2 id="adding-auxiliary-data-take-one">Adding Auxiliary Data, Take One</h2>
<p>As a reminder, this is how adding EDR data worked as of my previous blog post on my work in
<a href="https://github.com/MDAnalysis/mdanalysis/pull/3749">PR #3749</a>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="n">auxname</span><span class="p">,</span> <span class="n">auxdata</span><span class="p">,</span> <span class="n">auxterm</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
</code></pre></div></div>
<ul>
<li>When no <code class="language-plaintext highlighter-rouge">auxterm</code> is provided, the auxname must precisely match one of (or a list of) <code class="language-plaintext highlighter-rouge">aux.terms</code></li>
<li>to add a single term, <code class="language-plaintext highlighter-rouge">auxname</code> and optionally <code class="language-plaintext highlighter-rouge">auxterm</code> are provided as strings</li>
<li>to add many terms, <code class="language-plaintext highlighter-rouge">auxname</code> and optionally <code class="language-plaintext highlighter-rouge">auxterm</code> are provided as lists of strings</li>
<li>to add all terms found in the file, the auxname <code class="language-plaintext highlighter-rouge">"*"</code> can be provided.</li>
</ul>
<p>This worked, but it wasn’t ideal. A number of problems with this approach existed,
including but not limited to:</p>
<ul>
<li>It was confusing to have <code class="language-plaintext highlighter-rouge">auxname</code> have different requirements depending on whether
or not <code class="language-plaintext highlighter-rouge">auxterm</code> was provided</li>
<li>The fact that strings and lists could be provided for both of these arguments
required a number of checks, adding a lot of code</li>
<li>Using <code class="language-plaintext highlighter-rouge">"*"</code> to signify ‘everything’ is not very pythonic. A more natural way
to add all data found in the file is to simply specify <code class="language-plaintext highlighter-rouge">auxdata</code>. This means
<code class="language-plaintext highlighter-rouge">auxname</code> should be optional and default to <code class="language-plaintext highlighter-rouge">None</code>.</li>
<li><code class="language-plaintext highlighter-rouge">MDAnalysis.coordinates.base.add_auxiliary</code> was made quite complicated by checking
for the type of AuxReader in <code class="language-plaintext highlighter-rouge">auxdata</code> and calling a separate method for attaching
the data for each Reader. This would have been especially problematic on addition of
more and more readers.</li>
</ul>
<h2 id="adding-auxiliary-data-take-two">Adding Auxiliary Data, Take Two</h2>
<h1 id="auxreader-who">AuxReader who?</h1>
<p>One thing I did to improve on this is to move where in the code the auxiliary
data actually is associated with the trajectory. Previously, <code class="language-plaintext highlighter-rouge">MDAnalysis.coordinates.base.add_auxiliary</code>
would add the auxiliary data itself, and I had then changed it so the method would instead
call the appropriate file format specific method.</p>
<p>We have discussed this change, and really, there is no reason why <code class="language-plaintext highlighter-rouge">add_auxiliary()</code>
should need to know about which format it is dealing with, or even which formats
currently are supported. Instead, the code now makes use of a useful code pattern:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Parent</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">upper_method</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">child_obj</span><span class="p">,</span> <span class="p">...):</span>
<span class="n">child_obj</span><span class="p">.</span><span class="n">lower_method</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="p">...)</span>
<span class="k">class</span> <span class="nc">Child</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">lower_method</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">parent_obj</span><span class="p">):</span>
<span class="n">parent_obj</span><span class="p">.</span><span class="n">add_attribute</span><span class="p">(...)</span>
</code></pre></div></div>
<p>Applied to MDAnalysis, the <code class="language-plaintext highlighter-rouge">Parent</code> class is the trajectory of the Universe to which
auxiliary data is added, and the <code class="language-plaintext highlighter-rouge">upper_method</code> is <code class="language-plaintext highlighter-rouge">add_auxiliary</code>. The role of the <code class="language-plaintext highlighter-rouge">Child</code> class
is taken by the AuxReader Base class.</p>
<p>What this pattern allows is the manipulation of the <code class="language-plaintext highlighter-rouge">Parent</code> object through the <code class="language-plaintext highlighter-rouge">Child</code>
object, as it passes it<code class="language-plaintext highlighter-rouge">self</code> to its child through <code class="language-plaintext highlighter-rouge">upper_method</code>. It is thus available to
the <code class="language-plaintext highlighter-rouge">Child</code>.</p>
<p>I have now written an <code class="language-plaintext highlighter-rouge">attach_auxiliary()</code> method of the base AuxReader class, which
receives its parent trajectory as an argument and allows attaching correctly configured
AuxReader instances to this parent. This simplified <code class="language-plaintext highlighter-rouge">add_auxiliary</code> massively. Also, the
auxiliary module is a more natural home for the code that does the actual association than the coordinate module.</p>
<h1 id="resolving-the-auxterm-auxname-confusion">Resolving the <code class="language-plaintext highlighter-rouge">auxterm</code>-<code class="language-plaintext highlighter-rouge">auxname</code>-Confusion</h1>
<p>To avoid the confusion that the inconsisten requirements of what <code class="language-plaintext highlighter-rouge">auxname</code> should be,
the two parameters were replaced by a single <code class="language-plaintext highlighter-rouge">aux_spec</code> parameter.</p>
<p><code class="language-plaintext highlighter-rouge">aux_spec</code> should be a dictionary that maps the desired attribute names to the precise
names of the data in the auxiliary file (formerly <code class="language-plaintext highlighter-rouge">auxterm</code>). At the same time, this
removed the need to check if one or many terms are to be added, so whether a string or a list
is provided.</p>
<p>The way this works now is as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">MDAnalysis</span> <span class="k">as</span> <span class="n">mda</span>
<span class="kn">from</span> <span class="nn">MDAnalysisTests.datafiles</span> <span class="kn">import</span> <span class="n">AUX_EDR</span><span class="p">,</span> <span class="n">AUX_EDR_TPR</span><span class="p">,</span> <span class="n">AUX_EDR_XTC</span>
<span class="n">term_dict</span> <span class="o">=</span> <span class="p">{</span><span class="s">"temp"</span><span class="p">:</span> <span class="s">"Temperature"</span><span class="p">,</span> <span class="s">"epot"</span><span class="p">:</span> <span class="s">"Potential"</span><span class="p">}</span>
<span class="n">aux</span> <span class="o">=</span> <span class="n">mda</span><span class="p">.</span><span class="n">auxiliary</span><span class="p">.</span><span class="n">EDR</span><span class="p">.</span><span class="n">EDRReader</span><span class="p">(</span><span class="n">AUX_EDR</span><span class="p">)</span>
<span class="n">u</span> <span class="o">=</span> <span class="n">mda</span><span class="p">.</span><span class="n">Universe</span><span class="p">(</span><span class="n">AUX_EDR_TPR</span><span class="p">,</span> <span class="n">AUX_EDR_XTC</span><span class="p">)</span>
<span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="n">term_dict</span><span class="p">,</span> <span class="n">aux</span><span class="p">)</span>
</code></pre></div></div>
<p>There is one exception to this behaviour: For reasons of backwards compatibility,
it is also possible to provide a string as <code class="language-plaintext highlighter-rouge">aux_spec</code>. In this case, the <code class="language-plaintext highlighter-rouge">attach_auxiliary</code>
method will create a dictionary mapping this string value to <code class="language-plaintext highlighter-rouge">None</code>. This maintains the current default
behaviour of the XVGReader, where passing a string as <code class="language-plaintext highlighter-rouge">auxname</code> causes everything found in the XVG
file to be added under <code class="language-plaintext highlighter-rouge">auxname</code>. The value of the <code class="language-plaintext highlighter-rouge">aux_spec</code> key-value-pairs is
set to the AuxReaders <code class="language-plaintext highlighter-rouge">data_selector</code>, with <code class="language-plaintext highlighter-rouge">None</code> causing everything to be added.
This pattern will also generalise well to other file formats. The values of the dictionary could, for
example, also be column indices of a CSV file.</p>
<h1 id="more-pythonic-syntax">More Pythonic Syntax</h1>
<p>Adding all terms by providing <code class="language-plaintext highlighter-rouge">"*"</code> as an argument is not very pythonic. A better
behaviour would be one were providing no <code class="language-plaintext highlighter-rouge">aux_spec</code> at all would add everything.
This could be implemented by making <code class="language-plaintext highlighter-rouge">aux_spec</code> an optional argument that defaults to <code class="language-plaintext highlighter-rouge">None</code>.
However, there is a problem: <code class="language-plaintext highlighter-rouge">auxdata</code> isn’t optional, so the order of the function parameters has to be reversed:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">add_auxiliary</span><span class="p">(</span><span class="n">auxname</span><span class="p">,</span> <span class="n">auxdata</span><span class="p">)</span>
</code></pre></div></div>
<p>would become</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">add_auxiliary</span><span class="p">(</span><span class="n">auxdata</span><span class="p">,</span> <span class="n">auxname</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span>
</code></pre></div></div>
<p>But because the XVGReader currently expects <code class="language-plaintext highlighter-rouge">auxname</code> as the first argument and <code class="language-plaintext highlighter-rouge">auxdata</code> as the second,
this change would break the XVGReader.</p>
<p>I identified four possible solutions for this:</p>
<ol>
<li>We change the behaviour of the XVGReader</li>
<li>We use unpythonic syntax</li>
<li>We explicitly provide None as an argument to add everything, which is the opposite of what I would expect.</li>
<li>We make both <code class="language-plaintext highlighter-rouge">auxname</code> and <code class="language-plaintext highlighter-rouge">auxdata</code> optional and default to <code class="language-plaintext highlighter-rouge">None</code>, but raise an exception when no <code class="language-plaintext highlighter-rouge">auxdata</code> is provided.</li>
</ol>
<p>Currently, the <a href="https://github.com/MDAnalysis/mdanalysis/pull/3749">PR</a> reflects solution 4 and looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">add_auxiliary</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span>
<span class="n">aux_spec</span><span class="p">:</span> <span class="n">Union</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
<span class="n">auxdata</span><span class="p">:</span> <span class="n">Union</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">AuxReader</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
<span class="nb">format</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
<span class="o">**</span><span class="n">kwargs</span><span class="p">)</span> <span class="o">-></span> <span class="bp">None</span><span class="p">:</span>
</code></pre></div></div>
<p>This implementation seemed to me the best option. The XVGReader still works as before because
the order of the parameters is maintained. The <code class="language-plaintext highlighter-rouge">"*"</code> no longer is an option, therefore the code is more pythonic.
And not specifying <code class="language-plaintext highlighter-rouge">aux_spec</code> and therefore passing <code class="language-plaintext highlighter-rouge">None</code> implicitly works as expected, adding all data.
There is a minor downside: Because <code class="language-plaintext highlighter-rouge">auxdata</code> is the second parameter of <code class="language-plaintext highlighter-rouge">add_auxiliary</code> (and cannot be first parameter to maintain XVGReader functionality),
<code class="language-plaintext highlighter-rouge">auxdata=aux</code> has to be passed as argument, rather than just <code class="language-plaintext highlighter-rouge">aux</code>, when no <code class="language-plaintext highlighter-rouge">aux_spec</code> is provided.
To add everything, the method is now used as follows :</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="n">auxdata</span><span class="o">=</span><span class="n">aux</span><span class="p">)</span>
</code></pre></div></div>
<h1 id="lessons-learned">Lessons Learned</h1>
<p>The code pattern of a Parent class instance passing itself to a Child class instance is super useful!</p>
<p>Saying something is <a href="https://bfedder.github.io/jekyll/update/2022/07/31/EDRReader-done.html">“(almost) done”</a> before review is unwise</p>
<h1 id="future-goals">Future Goals</h1>
<p>With these changes, the most critical issues mentioned in the PR reviews are addressed.
Next, I’ll address the remaining comments, which should then move the EDRReader closer to being merged.
Thereafter, work on the <a href="https://github.com/MDAnalysis/mdanalysis/issues/3750">NumPy array reader</a> will start.</p>My work on the EDRReader has given us some ideas on how the process of attaching auxiliary data could be changed in general, and that is what I have been working on this week.The EDRReader is (almost) done2022-07-31T11:00:00+00:002022-07-31T11:00:00+00:00/jekyll/update/2022/07/31/EDRReader-done<p>The last few weeks saw me working on the implementation of an auxiliary reader
for energy data in the EDR format.</p>
<h1 id="panedr-and-pyedr">panedr and pyedr</h1>
<p>Last time, I reported on my work on panedr and panedrlite, two packages with the
same functionality but different dependencies. While this solution worked, it was
not ideal. A discussion on the pros and cons of different improvements <a href="https://github.com/MDAnalysis/panedr/issues/48">can be found here</a>
and is recommended reading for anyone trying similar things with packages and their dependencies.
In the end, <a href="https://github.com/IAlibay">IAlibay</a> implemented the generally preferred solution
in <a href="https://github.com/MDAnalysis/panedr/pull/50">PR #50</a> and created the panedr
and pyedr packages. A huge Thank You for all the work!</p>
<h1 id="problem-statement-edrreader">Problem Statement: EDRReader</h1>
<p>The light-weight pyedr package now exists and is ready to be used as a dependency
in MDAnalysis. Finally, the time had come for work on the new AuxReaders to start
in earnest.</p>
<p>The new EDRReader needs to do a number of things:</p>
<ul>
<li>Use pyedr to read EDR files and obtain their contents as a dictionary of NumPy arrays</li>
<li>Make that data accessible via the Reader itself</li>
<li>Allow association of the data with trajectories
<ul>
<li>The addition of single terms, multiple terms, and all terms should be possible</li>
</ul>
</li>
</ul>
<p>The base classes in the auxiliary module already take care of a lot of the work here.
Importantly, I did not have to worry much about the association with the correct trajectory frames
itself, thanks to <a href="https://github.com/fiona-naughton">fiona-naughton’s</a> work which she explained <a href="https://fiona-naughton.github.io/blog/2016/06/04/Auxiliary-power-then-full-steam-ahead">in her blog</a>.
The main work lay in adapting these structures to handle the dictionary data type,
and to allow the specification of multiple terms at once. Work on the former point
mostly was focussed within the <a href="https://github.com/MDAnalysis/mdanalysis/blob/ca39385b472ba7ecf9b2e99cc3ddac1800db7590/package/MDAnalysis/auxiliary/EDR.py">EDRReader</a> itself,
while the latter point required modifying the <code class="language-plaintext highlighter-rouge">add_auxiliary</code> method in the
<a href="https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/coordinates/base.py">coordinate-base</a> module.</p>
<h1 id="implementation-details">Implementation Details</h1>
<p>This work is found in <a href="https://github.com/MDAnalysis/mdanalysis/pull/3749">PR #3749</a>.</p>
<p>Previously, the assumption was that the files to be read by AuxReaders would contain
time-value pairs. It would be called as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="n">auxname</span><span class="p">,</span> <span class="n">auxdata</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
</code></pre></div></div>
<p>The data found in <code class="language-plaintext highlighter-rouge">auxdata</code> (either the file itself or an instance of an appropriate AuxReader)
would be added to appropriate time steps with the attribute name <code class="language-plaintext highlighter-rouge">auxname</code>. This attribute can be
accessed through</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">ts</span><span class="p">.</span><span class="n">aux</span><span class="p">.</span><span class="n">auxname</span>
<span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">ts</span><span class="p">.</span><span class="n">aux</span><span class="p">[</span><span class="n">auxname</span><span class="p">]</span>
</code></pre></div></div>
<p>This does not work for EDR files. The problem is that with a larger number of terms,
adding everything under the same <code class="language-plaintext highlighter-rouge">auxname</code> is not convenient. Therefore, I modified the method
as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="n">auxname</span><span class="p">,</span> <span class="n">auxdata</span><span class="p">,</span> <span class="n">auxterm</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
</code></pre></div></div>
<p>Users can now specify which of the energy terms to add to the trajectory under <code class="language-plaintext highlighter-rouge">auxname</code>
by providing an <code class="language-plaintext highlighter-rouge">auxterm</code> argument to <code class="language-plaintext highlighter-rouge">add_auxiliary</code>. I had to make it an optional argument
to preserve the XVGReader’s functionality.
A list of possible selections for <code class="language-plaintext highlighter-rouge">auxterm</code> can be obtained from the EDRReader as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">aux</span> <span class="o">=</span> <span class="n">MDAnalysis</span><span class="p">.</span><span class="n">auxiliary</span><span class="p">.</span><span class="n">EDR</span><span class="p">.</span><span class="n">EDRReader</span><span class="p">(</span><span class="s">"some_edr_file.edr"</span><span class="p">)</span>
<span class="n">aux</span><span class="p">.</span><span class="n">terms</span>
</code></pre></div></div>
<p>This method now works as follows:</p>
<ul>
<li>When no <code class="language-plaintext highlighter-rouge">auxterm</code> is provided, the auxname must precisely match one of (or a list of) <code class="language-plaintext highlighter-rouge">aux.terms</code></li>
<li>to add a single term, <code class="language-plaintext highlighter-rouge">auxname</code> and optionally <code class="language-plaintext highlighter-rouge">auxterm</code> are provided as strings</li>
<li>to add many terms, <code class="language-plaintext highlighter-rouge">auxname</code> and optionally <code class="language-plaintext highlighter-rouge">auxterm</code> are provided as lists of strings</li>
<li>to add all terms found in the file, the auxname <code class="language-plaintext highlighter-rouge">"*"</code> can be provided.</li>
</ul>
<p>Examples:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">aux</span> <span class="o">=</span> <span class="n">MDAnalysis</span><span class="p">.</span><span class="n">auxiliary</span><span class="p">.</span><span class="n">EDR</span><span class="p">.</span><span class="n">EDRReader</span><span class="p">(</span><span class="s">"some_edr_file.edr"</span><span class="p">)</span>
<span class="c1"># Add the Temperature term to timesteps as a "temp" attribute
</span><span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="s">"temp"</span><span class="p">,</span> <span class="n">aux</span><span class="p">,</span> <span class="s">"Temperature"</span><span class="p">)</span>
<span class="c1"># no auxterm provided, auxname must match one of aux.terms precisely.
# Adds the data as the "Temperature" attribute.
</span><span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="s">"Temperature"</span><span class="p">,</span> <span class="n">aux</span><span class="p">)</span>
<span class="c1"># Adding multiple terms at once (with and without auxterm)
</span><span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">([</span><span class="s">"bond"</span><span class="p">,</span> <span class="s">"temp"</span><span class="p">],</span> <span class="n">aux</span><span class="p">,</span> <span class="p">[</span><span class="s">"Bond"</span><span class="p">,</span> <span class="s">"Temperature"</span><span class="p">])</span>
<span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">([</span><span class="s">"Bond"</span><span class="p">,</span> <span class="s">"Temperature"</span><span class="p">],</span> <span class="n">aux</span><span class="p">)</span>
<span class="c1"># Adding all data that's in the file
</span><span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="s">"*"</span><span class="p">,</span> <span class="n">aux</span><span class="p">)</span>
</code></pre></div></div>
<p>Under the hood, a separate instance of an EDRReader is created for each term
that is added to a trajectory. This is necessary to allow updating all auxiliary data
when iterating over the trajectory or otherwise changing the time step. Before I did this,
when an instance of an EDRReader was passed to <code class="language-plaintext highlighter-rouge">add_auxiliary</code> multiple times,
only the last energy term to be added would be updated when the time step changes.</p>
<p>The EDRReader now is fully functional and allows the association of EDR data with the
appropriate time steps.</p>
<h1 id="testing">Testing</h1>
<p>I have adapted the tests for the auxiliary module for testing the functionality of
the EDRReader and also added further tests for handling more than one term at a time.
There was a small challenge here: Since EDR is a binary file format, I could not generate
a <a href="https://github.com/MDAnalysis/mdanalysis/blob/86efe83d77209ef7c6c555a2370cd07589b64a50/testsuite/MDAnalysisTests/auxiliary/base.py#L36">basic test file</a> as used for testing the XVGReader. I therefore had to create a small EDR file from a real simulation,
and adapt the tests appropriately. The most significant changes were that the timestep was off,
and that I only added one of the many energy terms for most of the tests.</p>
<h1 id="lessons-learned">Lessons Learned</h1>
<p>I was very glad to properly sink my teeth into the code these past few weeks.
This is my largest contribution yet, and I have learned a lot about working with
other peoples’ code and building on existing functionality.</p>
<h1 id="future-goals">Future Goals</h1>
<p>The next steps in this project will be:</p>
<ul>
<li>adding type hinting to the new modules/functions</li>
<li>Implementing a method to obtain the energy data without associating with a trajectory</li>
<li>Describe how to slice trajectories based on auxiliary data</li>
<li>Writing the Documentation</li>
</ul>
<p>Once this is taken care of, I will then move on to an AuxReader for <a href="https://github.com/MDAnalysis/mdanalysis/issues/3750">NumPy arrays</a></p>The last few weeks saw me working on the implementation of an auxiliary reader for energy data in the EDR format.Revenge of the Package2022-07-09T11:00:00+00:002022-07-09T11:00:00+00:00/jekyll/update/2022/07/09/more-package-headache<p>In my last post, I wrote about the work I had done on panedr to make it
importable in MDAnalysis, and the challenges that remained to be addressed.
Over the last two weeks, I have made enough progress on the refactoring and
repackaging of panedr that I am now able to start work on the EDRReader in
earnest.</p>
<h1 id="refactoring-of-panedr">Refactoring of panedr</h1>
<p>In my last blog post, I described how panedr works under the hood by
progressively reading bytes from the binary EDR file and sorting the information
in appropriate Python data structures, ultimately returning a pandas DataFrame.
I am happy to say that my refactoring of this process in <a href="https://github.com/MDAnalysis/panedr/pull/33">PR #33</a>
is now merged. In addition to a DataFrame, returning the energy data as a dictionary
of NumPy arrays is now supported through <code class="language-plaintext highlighter-rouge">edr_to_dict()</code>. This function does not
require pandas, making the package optional for this use case.</p>
<h1 id="repackaging-panedr-for-dependency-management">Repackaging panedr for dependency management</h1>
<p>In order to actually allow the module to be used without pandas installed, the
package structure had to be changed a bit. I worked on this in <a href="https://github.com/MDAnalysis/panedr/pull/42">PR #42</a>.
The boundary conditions to be met were the following:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">pip install panedr</code> should work as it does now, installing all dependencies
for all functions</li>
<li>Installing the package without pandas needs to be possible</li>
<li>The code should be in one location and not duplicated</li>
</ul>
<p>The way I have addressed these conditions is by creating two packages that sit
in parallel in the <a href="https://github.com/MDAnalysis/panedr">panedr repository</a>: panedr
and panedrlite. All the code has moved from panedr to panedrlite, but it’s <code class="language-plaintext highlighter-rouge">setup.cfg</code>
was modified to make pandas an extras_require. Installing panedrlite thus installs
the package and provides most of the functionality out of the box, but does not
install pandas by default. panedr, on the other hand, is an empty metapackage. It
bundles panedrlite and pandas in its dependencies, allowing full functionality on
installation.</p>
<p>However, this is unfortunately not yet the final state for panedr packaging, as my
solution comes with a significant, difficult to fix, problem: installing panedrlite
exposes <code class="language-plaintext highlighter-rouge">import panedrlite</code>, not <code class="language-plaintext highlighter-rouge">import panedr</code>. One of the problems this causes
is that it means panedrlite would be installed for users who want to use the functionality
even if they have panedr already installed. There are a <a href="https://github.com/MDAnalysis/panedr/issues/48">number of possible solutions</a>
to this problem, but they each have drawbacks, and the discussion of which solution is best
is still ongoing.</p>
<h1 id="codecov-growing-pains">Codecov growing pains</h1>
<p>When I moved the code from panedr to panedrlite as part of <a href="https://github.com/MDAnalysis/panedr/pull/42">PR #42</a>,
codecov started reporting 0 % coverage. This is weird, because the unit tests
definitely still run and pass (as shown by CI). Initially, we thought the issue might
solve itself on merging the PR, but this was unfortunately not the case, and at the time
of writing, the repository proudly reports 0 % coverage. In the end, IAlibay found a <a href="https://github.com/MDAnalysis/panedr/pull/47">solution</a>
for the problem, but I have to admit that the reason for why codecov stopped working
and how these changes fix it again are beyond me.</p>
<h1 id="work-on-the-edrreader-has-started">Work on the EDRReader has started</h1>
<p>While some changes are yet to come to panedr, the refactoring is now implemented
and unlikely to change in the near future. As such, I was now finally able to start
work on an <a href="https://github.com/MDAnalysis/mdanalysis/pull/3749">EDRReader implementation</a>.
This is very much a work in progress, but this skeleton now already allows EDR
files to be read in MDAnalysis, which I am happy about.</p>
<h1 id="lessons-learned">Lessons Learned</h1>
<p>Now four weeks into my GSoC project, some of the things I learned include:</p>
<ul>
<li>there is <em>always</em> more to be learned about packaging</li>
<li>CI is dark magic and CI experts are wizards</li>
<li>Having worked on the EDRReader, I have now understood Python’s <code class="language-plaintext highlighter-rouge">super()</code> function better</li>
</ul>
<h1 id="future-goals">Future Goals</h1>
<p>The next part of my project will now focus on working on the EDRReader in MDAnalysis.
Because panedr now returns data as a dictionary of numpy arrays, I am also hoping I
might be able to apply some of my work on the EDRReader on a <a href="https://github.com/MDAnalysis/mdanalysis/issues/3750">NumPy AuxReader</a> as well.</p>In my last post, I wrote about the work I had done on panedr to make it importable in MDAnalysis, and the challenges that remained to be addressed. Over the last two weeks, I have made enough progress on the refactoring and repackaging of panedr that I am now able to start work on the EDRReader in earnest.First Things First2022-06-26T11:00:00+00:002022-06-26T11:00:00+00:00/jekyll/update/2022/06/26/panedr<p>As I mentioned in my last blog post, the first step in my GSoC project is a minor
rewrite of <a href="https://github.com/MDAnalysis/panedr">panedr</a> to make pandas an
optional dependency, as the amount of mandatory dependencies to MDAnalysis should
be kept as small as possible. In today’s blog post, I’ll report about my progress
in this endeavour, and mention a number of the things I have learned in the process.</p>
<h1 id="automated-testing-with-github-actions">Automated Testing with GitHub Actions</h1>
<p>Before starting work on the actual code, we wanted to change the continuous integration
workflow. In particular, we wanted to make the switch from Travis CI to GitHub Actions.
On my part, this involved creating a .yml file at panedr/.github/workflows/gh-ci.yaml
as part of <a href="https://github.com/MDAnalysis/panedr/pull/32">PR #32</a>. I am happy to report
that this worked out of the gate, and every commit on pull requests is now tested in
GH Actions. Also, badges on the repository’s front page now inform the user of the passing
tests and a respectable 82 % test coverage, with most misses happening because
only the most recent file format (used for over 15 years at this point) is currently tested.</p>
<h1 id="managing-dependencies">Managing Dependencies</h1>
<p>The goal of this work is to have the functionality of <code class="language-plaintext highlighter-rouge">panedr</code> available to use without
the requirement of installing pandas. Initially, I changed the package in such a way
that pandas was an optional dependency that could be installed along panedr like so:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-m</span> pip <span class="nb">install </span>panedr[pandas]
</code></pre></div></div>
<p>, where ommitting the <code class="language-plaintext highlighter-rouge">[pandas]</code> would install panedr without it. However, as
discussed in more detail in <a href="https://github.com/MDAnalysis/panedr/issues/34">#34</a>,
this would be a breaking change to the way panedr currently works. Instead, my current
goal is to have two packages in the repository, <code class="language-plaintext highlighter-rouge">panedrlite</code> and <code class="language-plaintext highlighter-rouge">panedr</code>. All of the
code will be part of <code class="language-plaintext highlighter-rouge">panedrlite</code>. <code class="language-plaintext highlighter-rouge">panedr</code> will be a metapackage, with <code class="language-plaintext highlighter-rouge">panedrlite[pandas]</code>
as one of its dependencies. This way, users can still install and import panedr just
as they can now, and MDAnalysis can include panedrlite without pandas.
The dependencies of panedr are managed in <code class="language-plaintext highlighter-rouge">requirements.txt</code> and <code class="language-plaintext highlighter-rouge">setup.cfg</code>, and these files
have been changed / will be changed according to these new specifications. One addition
to the requirements is NumPy, which is now installed with the same requirements that
MDAnalysis specifies.</p>
<h1 id="refactoring-the-panedr-code">Refactoring the <code class="language-plaintext highlighter-rouge">panedr</code> Code</h1>
<p>The other steps described in this blog post are more (learning about) package
management than coding. This is the Summer of Code, however, so now I’ll talk about
the changes to the panedr code that I have made thus far.</p>
<p>GROMACS writes EDR files according to the <code class="language-plaintext highlighter-rouge">eXternal Data Representation (XDR)</code>
protocol. Under the hood, panedr uses the <code class="language-plaintext highlighter-rouge">xdrlib</code> Python package to decode the
binary files.</p>
<p>When the user calles the <code class="language-plaintext highlighter-rouge">edr_to_df()</code> function, an EDRFile object is created.
During object instantiation, the EDR file passed to <code class="language-plaintext highlighter-rouge">edr_to_df</code> is read and the
binary content passed to an instance of <code class="language-plaintext highlighter-rouge">GMX_Unpacker</code>, which inherits from xdrlib’s Unpacker.
Next, the first few bytes of the binary files are read. They contain information
on the file version, precision (single or double), and which energy terms are present
in the file. After initialisation, the file is read iteratively, and the energy
values of each frame are stored in a nested list.</p>
<p>Once the last byte is read, the iteration stops. <code class="language-plaintext highlighter-rouge">edr_to_df</code> then returns the nested
list as a pandas dataframe.</p>
<p>Luckily, this setup lends itself well to refactoring. Pandas is only used in the very
last step, after the file has been read in its entirety. In <a href="https://github.com/MDAnalysis/panedr/pull/33">PR #33</a>,
I’ve been working on just this.</p>
<p>To make panedr functional without pandas, I moved the parsing of the file to a new
function <code class="language-plaintext highlighter-rouge">read_edr()</code>. This function returns the nested lists that contain the energy values
for each frame.
<code class="language-plaintext highlighter-rouge">edr_to_df()</code> then just has to call <code class="language-plaintext highlighter-rouge">read_edr()</code> and assemble its return values
into the dataframe. The function can also check whether pandas is installed in a try-except statement.</p>
<p>Another new function contains the functionality that I want to use for my auxiliary
reader in MDAnalysis: <code class="language-plaintext highlighter-rouge">edr_to_dict()</code> takes the nested lists that <code class="language-plaintext highlighter-rouge">read_edr()</code>
returns and assembles them into a dictionary of NumPy arrays. This data structure
will be straightforward to work with further downstream.</p>
<p>I have also written a test for <code class="language-plaintext highlighter-rouge">edr_to_dict()</code>. The test compares
the dictionary to the dataframe, and because these two are the same, we can
be confident that the function is indeed working as intended.</p>
<h1 id="lessons-learned">Lessons Learned</h1>
<p>In the last two weeks, I have learned quite a lot, including:</p>
<ul>
<li>more details on what actually makes a Python package</li>
<li>how to manage dependencies of packages</li>
<li>how to make packages installable</li>
<li>CI workflows in GitHub actions</li>
<li>What the XDR protocol is and how it is read</li>
</ul>
<p>Also: I already knew this, of course, but I was once again reminded that tasks often
take much longer than I anticipate before starting out.</p>
<p>While improving my understanding of Python packaging was and is certainly important, I am now
looking forward to wrapping up this part of the project and moving on to the implementation
in MDAnalysis.</p>
<h1 id="future-goals">Future Goals</h1>
<p>The immediate next steps are still concerning the panedr package:</p>
<ul>
<li>Set up the panedr metapackage and panedrlite which holds the code</li>
<li>Get <a href="https://github.com/MDAnalysis/panedr/pull/33">PR #33</a> merged to get panedr-lite
ready for inclusion in MDAnalysis</li>
<li>Write documentation for the new functions</li>
</ul>
<p>I am hoping to finish this work over the next week, so I can then use panedr-lite
in MDAnalysis and get started on the EDRReader in earnest.</p>As I mentioned in my last blog post, the first step in my GSoC project is a minor rewrite of panedr to make pandas an optional dependency, as the amount of mandatory dependencies to MDAnalysis should be kept as small as possible. In today’s blog post, I’ll report about my progress in this endeavour, and mention a number of the things I have learned in the process.New AuxReaders! But why?2022-06-10T11:00:00+00:002022-06-10T11:00:00+00:00/jekyll/update/2022/06/10/intro-to-project<h1 id="introduction">Introduction</h1>
<p>Has this ever happened to you? You run an MD simulation with your favourite MD
engine, and you want to make sure the system actually has equilibrated properly
as part of the analysis. In order to inspect the related energy terms, you have
a few hoops to jump through. Depending on your engine of choice you might have to
use a program to extract terms from a binary file, or you might have plain text
file of some formatting or another to deal with. Either way, you will likely have
to write the terms you are interested in to a new file before you can actually work
with them, and you’ll have to alt-tab out of your Jupyter Notebook / IDE / what have you.
This may only be mildly annoying, but because it is such a frequent occurence,
even small inefficiencies add up.</p>
<p>In my <a href="https://summerofcode.withgoogle.com/proposals/details/iYc3bcfl">GSoC project</a>,
I will write readers for such energy files and implement them
as part of <a href="https://www.mdanalysis.org">MDAnalysis’</a> framework for handling
<a href="https://userguide.mdanalysis.org/stable/formats/auxiliary.html">auxiliary data.</a>
This framework was developed to allow the association of non-trajectory timeseries data
with the frames of a trajectory. One of the key features of the AuxReaders is their
ability to associate auxiliary data with trajectory timesteps when the trajectory data
and the auxiliary data are written with different frequency. If the auxiliary data is
saved, say, half or twice as often, then the AuxReader intelligently assigns it to the
closest trajectory frame. Currently, the XVG format that GROMACS
uses for certain output, for example timeseries data of reaction coordinates in
umbrella sampling or steered MD simulations, is supported.</p>
<p>The new energy readers I will work on this summer will make use of this
framework. Thus, they will not only make inspecting and evaluating energy terms
more convenient, but will also open the door to new ways of utilising energy
data in analysis.</p>
<h1 id="concept">Concept</h1>
<p>The following bits of pseudocode highlight how a new AuxReader for GROMACS EDR files
would work.</p>
<p>Data would be read from files in the same way that the current XVGReader works.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">aux</span> <span class="o">=</span> <span class="n">MDAnalysis</span><span class="p">.</span><span class="n">auxiliary</span><span class="p">.</span><span class="n">EDR</span><span class="p">.</span><span class="n">EDRReader</span><span class="p">(</span><span class="s">"ener.edr"</span><span class="p">)</span>
</code></pre></div></div>
<p>EDR files contain many energy terms, though, where XVG files usually have one data column.
The EDRReader would therefore need to know which data is saved in the energy file.
This information is stored in an attribute.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">aux</span><span class="p">.</span><span class="n">terms</span>
<span class="c1"># Contains a list of all energy terms found in the file
</span></code></pre></div></div>
<p>Because many terms are present in the files, the method for adding auxdata
to trajectories has to be changed slighly. In addition to providing a name for the
aux attribute, users also specify which of the energy terms to add.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u</span> <span class="o">=</span> <span class="n">mda</span><span class="p">.</span><span class="n">Universe</span><span class="p">(</span><span class="n">foo</span><span class="p">,</span> <span class="n">bar</span><span class="p">)</span>
<span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="n">auxname</span><span class="o">=</span><span class="s">"epot"</span><span class="p">,</span> <span class="n">auxterm</span><span class="o">=</span><span class="s">"Potential"</span><span class="p">,</span> <span class="n">auxdata</span><span class="o">=</span><span class="n">aux</span><span class="p">)</span>
</code></pre></div></div>
<p>It will also be possible to add multiple or even all terms found in the file at
the same time.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="n">auxname</span><span class="o">=</span><span class="p">[</span><span class="s">"epot"</span><span class="p">,</span> <span class="s">"temp"</span><span class="p">],</span>
<span class="n">auxterm</span><span class="o">=</span><span class="p">[</span><span class="s">"Potential"</span><span class="p">,</span> <span class="s">"Temperature"</span><span class="p">],</span>
<span class="n">auxdata</span><span class="o">=</span><span class="n">aux</span><span class="p">)</span>
<span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">.</span><span class="n">add_auxiliary</span><span class="p">(</span><span class="n">auxterm</span><span class="o">=</span><span class="n">aux</span><span class="p">.</span><span class="n">terms</span><span class="p">,</span> <span class="n">auxdata</span><span class="o">=</span><span class="n">aux</span><span class="p">)</span>
</code></pre></div></div>
<p>Having associated the data with the trajectory, some further analyses become much
easier. It will be very simple, for example, to select subsets of trajectories
based on energy data. For example, to select only those frames of a trajectory
below a certain potential energy threshold:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">selected_frames</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span>
<span class="p">[</span><span class="n">ts</span><span class="p">.</span><span class="n">frame</span> <span class="k">for</span> <span class="n">ts</span> <span class="ow">in</span> <span class="n">u</span><span class="p">.</span><span class="n">trajectory</span> <span class="k">if</span> <span class="n">ts</span><span class="p">.</span><span class="n">aux</span><span class="p">.</span><span class="n">epot</span> <span class="o"><</span> <span class="n">some_threshold</span><span class="p">])</span>
</code></pre></div></div>
<p>This array of frames can then be used for trajectory slicing.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">subset</span> <span class="o">=</span> <span class="n">u</span><span class="p">.</span><span class="n">trajectory</span><span class="p">[</span><span class="n">selected_frames</span><span class="p">]</span>
</code></pre></div></div>
<p>Should a user not be interested in such functionality, the energy readers can
also be used to merely “unpack” selected energy terms for plotting.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">epot</span> <span class="o">=</span> <span class="n">aux</span><span class="p">.</span><span class="n">unpack</span><span class="p">(</span><span class="s">"Potential"</span><span class="p">)</span>
<span class="n">bond_terms</span><span class="p">,</span> <span class="n">angle_terms</span> <span class="o">=</span> <span class="n">aux</span><span class="p">.</span><span class="n">unpack</span><span class="p">([</span><span class="s">"Bond"</span><span class="p">,</span> <span class="s">"Angle"</span><span class="p">])</span>
</code></pre></div></div>
<p>Having written a few energy readers, I want to compile the lessons I will have learned
doing to into a clear, easy-to-follow tutorial on how to write new AuxReaders to facilitate
further development of this useful but underutilised system (It was first implemented
6 years ago, and only now will new formats be supported).</p>
<h1 id="strategy"><a href="https://github.com/MDAnalysis/mdanalysis/issues/3714">Strategy</a></h1>
<p>To start this project off, I will work on the implementation of an
<a href="https://github.com/MDAnalysis/mdanalysis/issues/3629">AuxReader for GROMACS’ EDR format</a> for the
selfish reason that GROMACS is the engine I use in my work the most.
<a href="https://github.com/jbarnoud">JBarnoud’s</a> <a href="https://github.com/MDAnalysis/panedr">panedr</a>
will be crucial for this work. It is a Python package that can read the binary EDR files and
return their content as Pandas DataFrames. It works perfectly with all recent versions
of GROMACS, but Pandas as a dependency should not be introduced to MDAnalysis. My
first task will therefore be a minor rework of panedr so that it returns the energy data
as a dictionary of arrays instead of a DataFrame. Once that is done, I can start working
on a new AuxReader for EDR data that uses the modified panedr and the AuxReader base class
to make the energy data available within MDAnalysis. This will be accompanied by
rigorous documentation and test writing.</p>
<p>In general, two main groups of tests will be needed for the energy readers. For one,
the actual parsing of the energy files has to be tested to make sure the files are
read properly. Secondly, the association of the correct data with the appropriate
trajectory time steps has to be verified.</p>
<p>After finishing work on the EDRReader, I will move on to different MD engines. The
next priority will be either Amber or NAMD. Both of these write energy data to
plain text files, and I can look to a large number of other parsers in MDAnalysis
for inspiration of how to tackle them. I will originally limit myself to one of the formats,
with the option of coming back to work on the other one if I still have time towards
the end of the summer. The reason for this is that these formats are quite similar,
and I ideally want to implement a number of different things.</p>
<p>Next on my list is an AuxReader not for energy data, but instead a general reader
for NumPy arrays. Such a reader would be very interesting because it would greatly
increase the flexibility of the aux module, allowing users to associate any data
of their choice with their trajectories. As an example, users could link the
results of Analysis objects like RMSD values to the timesteps.</p>
<p>Lastly, it would be great to have a CSV reader, as well. This would add yet more
flexibility on one hand, and on the other hand provide a way for OpenMM’s energy
data as generated by <a href="http://docs.openmm.org/7.0.0/api-python/generated/simtk.openmm.app.statedatareporter.StateDataReporter.html">StateDataReporters</a>
to be read.</p>
<p>And finally, as mentioned above, the experience I will have gained in implementing
these AuxReaders should be very helpful in writing detailed documentation and a guide
on how to write additional AuxReaders, helping to make the aux system find more use.</p>
<h1 id="timeline">Timeline</h1>
<p>I will be working part-time on this project for a duration of 14 weeks. At the end
of week 6, so roughly in mid- to late-July, I hope to have finished all work on the
EDRReader. Every reader after this one should take less work, and I should be faster, so
that a reader for Amber or NAMD will hopefully be finished by week 9 in mid-August.
This is the bare minimum of what I want to achieve before I start my work on the
guide and documentation, which I will do as the last task before my final submission
deadline on the 26th of September. Depending on how much time I will have left then, I will hopefully
be able to also include the NumPy Reader, the CSV/OpenMM Reader, and the NAMD or Amber
reader, before I start work on the guide.</p>
<p>I am very much looking forward to properly getting started next week! Stay tuned
for the next post on my progress with the EDRReader in a few weeks.</p>IntroductionWelcome!2022-05-26T15:00:00+00:002022-05-26T15:00:00+00:00/jekyll/update/2022/05/26/welcome-to-blog<p>Hello and welcome to my blog!</p>
<p>Here, I will post regular reports on my Google Summer of Code project with MDAnalysis. I am very excited that my project was chosen, and very much looking forward to getting started.</p>
<p>More details to follow soon!</p>Hello and welcome to my blog!