Clustering with md.rmsd() and scipy.cluster.hierarchy()

In this example, we cluster our alanine dipeptide trajectory using the RMSD distance metric and hierarchical clustering.

In [1]:
from __future__ import print_function
%matplotlib inline
import mdtraj as md
import numpy as np
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy
from scipy.spatial.distance import squareform
Matplotlib created a temporary config/cache directory at /build/matplotlib-tl0tztmx because the default path (/homeless-shelter/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.

Let's load up our trajectory. This is the trajectory that we generated in the "Running a simulation in OpenMM and analyzing the results with mdtraj" example. The first step is to build the rmsd cache, which precalculates some values for the RMSD computation.

In [2]:
traj = md.load('ala2.h5')

Lets compute all pairwise rmsds between conformations.

In [3]:
distances = np.empty((traj.n_frames, traj.n_frames))
for i in range(traj.n_frames):
    distances[i] = md.rmsd(traj, traj, i)
print('Max pairwise rmsd: %f nm' % np.max(distances))
Max pairwise rmsd: 0.188493 nm

scipy.cluster implements the average linkage algorithm (among others)

In [4]:
# Clustering only accepts reduced form. Squareform's checks are too stringent
assert np.all(distances - distances.T < 1e-6)
reduced_distances = squareform(distances, checks=False)
In [5]:
linkage = scipy.cluster.hierarchy.linkage(reduced_distances, method='average')

Lets plot the resulting dendrogram.

In [6]:
plt.title('RMSD Average linkage hierarchical clustering')
_ = scipy.cluster.hierarchy.dendrogram(linkage, no_labels=True, count_sort='descendent')

(clustering.ipynb; clustering_eval.ipynb; clustering.py)