class TRAMDataset

class deeptime.markov.msm.TRAMDataset(dtrajs, bias_matrices, ttrajs=None, n_therm_states=None, n_markov_states=None, lagtime=1, count_mode='sliding')

Dataset for organizing data and obtaining properties from data that are needed for TRAM. The minimum required parameters for constructing a TRAMDataset are the dtrajs and bias_matrices. In this case, ttrajs are inferred from the shape of the dtrajs, by assuming each trajectory in dtrajs corresponds to a unique thermodynamic state, with the index corresponding to the index of occurrence in dtrajs.

The values at identical indices in dtrajs, ttrajs and bias_matrices correspond to the sample. For example, at indices (i, n) we find information about the \(n\)-th sample in trajectory \(i\). dtrajs[i][n] gives us the index of the Markov state the sample falls into. ttrajs[i][n] gives us the thermodynamic state the sample was sampled at (which may be different from other samples in the trajectory due to a replica exchange swap occurring at index \(n\)). Finally, bias_matrices[i][n] gives us for each thermodynamic state, the energy of the sample evaluated at that thermodynamic state. In other words: bias_matrices[i][n][k] gives us \(b^k(x_i^n)\), i.e. the bias energy as if the sample were observed in thermodynamic state \(k\).

Parameters:
  • dtrajs (array-like(ndarray(n)), int) – The discrete trajectories in the form of a list or array of numpy arrays. dtrajs[i] contains one trajectory. dtrajs[i][n] contains the Markov state index that the \(n\)-th sample from the \(i\)-th trajectory was binned into. Each of the dtrajs can be of variable length.

  • bias_matrices (array-like(ndarray(n,m)), float) – The bias energy matrices. bias_matrices[i][n, k] contains the bias energy of the \(n\)-th sample from the \(i\)-th trajectory, evaluated at thermodynamic state \(k\), i.e. \(b^k(x_{i,n})\). The bias energy matrices should have the same size as dtrajs in both the first and second dimension. The third dimension is of size n_therm_state, i.e. for each sample, the bias energy in every thermodynamic state is calculated and stored in the bias_matrices.

  • ttrajs (array-like(ndarray(n)), int, optional) – ttrajs[i] contains for each sample in the \(i\)-th trajectory the index of the thermodynamic state that sample was sampled at. If ttrajs = None, we assume no replica exchange was done. In this case we assume each trajectory corresponds to a unique thermodynamic state, and n_therm_states equals the size of dtrajs.

  • n_therm_states (int, optional) – if n_therm_states is given, the indices in ttrajs are checked to lie within n_therm_states bound. Otherwise, n_therm_states are inferred from the highest occurring index in ttrajs.

  • n_markov_states (int, optional) – if n_markov_states is given, the indices in dtrajs are checked to lie within n_markov_states bound. Otherwise, n_markov_states are inferred from the highest occurring index in dtrajs.

  • lagtime (int, optional, default=1) – Integer lag time at which transitions are counted.

  • count_mode (str, optional, default='sliding') –

    One of “sample”, “sliding”, “sliding-effective”, and “effective”.

    • ”sample” strides the trajectory with lagtime \(\tau\) and uses the strided counts as transitions.

    • ”sliding” uses a sliding window approach, yielding counts that are statistically correlated and too large by a factor of \(\tau\); in uncertainty estimation this yields wrong uncertainties.

    • ”sliding-effective” takes “sliding” and divides it by \(\tau\), which can be shown to provide a likelihood that is the geometrical average over shifted subsamples of the trajectory, \((s_1,\:s_{tau+1},\:...),\:(s_2,\:t_{tau+2},\:...),\) etc. This geometrical average converges to the correct likelihood in the statistical limit [1].

    • ”effective” uses an estimate of the transition counts that are statistically uncorrelated. Recommended when estimating Bayesian MSMs.

Attributes

connectivity_options

All possible connectivity modes

n_markov_states

The number of Markov states.

n_therm_states

The number of thermodynamic states.

state_counts

ndarray(n, m) The state counts histogram.

tram_input

The TRAMInput object containing the data needed for estimation.

transition_counts

The transition counts matrices.

Methods

check_against_model(model)

Check the number of thermodynamic states of the model against that of the dataset.

restrict_to_largest_connected_set([...])

Find the largest connected set of Markov states based on the connectivity settings and the input data, and restrict the input data to this connected set.

restrict_to_submodel(submodel)

Restrict the count matrices and dtrajs to the given submodel.

setflags([write])

Set writeable flags for contained arrays.

check_against_model(model)

Check the number of thermodynamic states of the model against that of the dataset. The number of thermodynamic states in the dataset have to be smaller than or equal to those of the model, otherwise the ttrajs and/or dtrajs would be out of bounds.

Parameters:

model (TRAMModel) – The model to check the data against.

Raises:

ValueError – If the data encompasses more Markov and/or thermodynamic states than the model.

restrict_to_largest_connected_set(connectivity='post_hoc_RE', connectivity_factor=1.0, progress=None)

Find the largest connected set of Markov states based on the connectivity settings and the input data, and restrict the input data to this connected set. The largest connected set is computed based on the data, and self.connectivity and self.connectivity_factor. The data is then restricted to the largest connected set and count models are recomputed.

Parameters:
  • connectivity (str, optional, default="post_hoc_RE") –

    One of None, “post_hoc_RE”, “BAR_variance”, or “summed_count_matrix”. Defines what should be considered a connected set in the joint (product) space of conformations and thermodynamic ensembles. If connectivity=None, the data is not restricted.

    • ”post_hoc_RE” : It is required that every state in the connected set can be reached by following a pathway of reversible transitions or jumping between overlapping thermodynamic states while staying in the same Markov state. A reversible transition between two Markov states (within the same thermodynamic state \(k\)) is a pair of Markov states that belong to the same strongly connected component of the count matrix (from thermodynamic state \(k\)). Two thermodynamic states \(k\) and \(l\) are defined to overlap at Markov state \(i\) if a replica exchange simulation [2] restricted to state \(i\) would show at least one transition from \(k\) to \(l\) or one transition from from \(l\) to \(k\). The expected number of replica exchanges is estimated from the simulation data. The minimal number required of replica exchanges per Markov state can be increased by decreasing connectivity_factor.

    • ”BAR_variance” : like ‘post_hoc_RE’ but with a different condition to define the thermodynamic overlap based on the variance of the BAR estimator [3]. Two thermodynamic states \(k\) and \(l\) are defined to overlap at Markov state \(i\) if the variance of the free energy difference \(\Delta f_{kl}\) computed with BAR (and restricted to conformations form Markov state \(i\)) is less or equal than one. The minimally required variance can be controlled with connectivity_factor.

    • ”summed_count_matrix” : all thermodynamic states are assumed to overlap. The connected set is then computed by summing the count matrices over all thermodynamic states and taking its largest strongly connected set. Not recommended!

  • connectivity_factor (float, optional, default=1.0) – Only needed if connectivity=”post_hoc_RE” or “BAR_variance”. Values greater than 1.0 weaken the connectivity conditions. For ‘post_hoc_RE’ this multiplies the number of hypothetically observed transitions. For ‘BAR_variance’ this scales the threshold for the minimal allowed variance of free energy differences.

  • progress (object) – Progress bar object that TRAMDataset will call to indicate progress to the user. Tested for a tqdm progress bar. The interface is checked via supports_progress_interface.

Raises:

ValueError – If the connectivity type is unknown.

restrict_to_submodel(submodel)

Restrict the count matrices and dtrajs to the given submodel. The submodel is either given in form of a list of Markov state indices, or as a TransitionCountModel. All dtrajs sample indices that do not occur in the submodel will be set to -1. The count_models are recomputed after restricting the dtrajs to the submodel.

Parameters:

submodel (TransitionCountModel or list(int) or ndarray(int)) – The TransitionCountModel, or the Markov states indices that the dataset needs to be restricted to, .

setflags(write=True)

Set writeable flags for contained arrays.

connectivity_options = ['post_hoc_RE', 'BAR_variance', 'summed_count_matrix', None]

All possible connectivity modes

property n_markov_states

The number of Markov states.

property n_therm_states

The number of thermodynamic states.

property state_counts

ndarray(n, m) The state counts histogram. state_counts[k] contains the state histogram for thermodynamic state \(k\), based on the TransitionCountModel of state \(k\). state_counts[k][i]` equals the number of samples that fall into Markov state \(i\), sampled at thermodynamic state \(k\).

The state counts for every thermodynamic state are shaped to contain all possible markov states, even the ones without counts in that thermodynamic state. This is done so that the underlying c++ estimator receives count matrices that are all the same shape, which is easier to handle (matrices are padded with zeros for all empty states that got dropped by the TransitionCountModels).

property tram_input

The TRAMInput object containing the data needed for estimation. The data is restructured to allow parallelization over Markov states. The dtrajs are only used to see which Markov state the sample biases belong to. Ordering of the data doesn’t matter (transition information is already stored in the count matrices) so we can restructure the data. After restructuring, each bias matrix in the bias_list corresponds to all sample biases for samples that fell into the Markov state of which the index corresponds to the index of the bias matrix in the list. Or: bias_list[i] contains biases for all samples that fell in Markov state i.

property transition_counts

The transition counts matrices. transition_counts[k] contains the transition counts for thermodynamic state \(k\), based on the TransitionCountModel of state \(k\). transition_counts[k][i][j] equals the number of observed transitions from Markov state \(i\) to Markov state \(j\), in thermodynamic state \(k\).

The transition counts for every thermodynamic state are shaped to contain all possible markov states, even the ones without counts in that thermodynamic state. This is done so that the underlying c++ estimator receives count matrices that are all the same shape, which is easier to handle (matrices are padded with zeros for all empty states that got dropped by the TransitionCountModels).

Getter:

the transition counts

Type:

ndarray(n, m, m)