class KMeans

class deeptime.clustering.KMeans(n_clusters: int, max_iter: int = 500, metric='euclidean', tolerance=1e-05, init_strategy: str = 'kmeans++', fixed_seed=False, n_jobs=None, initial_centers=None, progress=None)

Clusters the data in a way that minimizes the cost function

\[C(S) = \sum_{i=1}^{k} \sum_{\mathbf{x}_j \in S_i} \left\| \mathbf{x}_j - \boldsymbol\mu_i \right\|^2 \]

where \(S_i\) are clusters with centers of mass \(\mu_i\) and \(\mathbf{x}_j\) data points associated to their clusters.

The outcome is very dependent on the initialization, in particular we offer “kmeans++” and “uniform”. The latter picks initial centers random-uniformly over the provided data set. The former tries to find an initialization which is covering the spatial configuration of the dataset more or less uniformly. For details see [1].

Parameters:
  • n_clusters (int) – amount of cluster centers.

  • max_iter (int, default=500) – maximum number of iterations before stopping.

  • metric (str, default='euclidean') – Metric to use during clustering, default evaluates to euclidean metric. For a list of available metrics, see the metric registry.

  • tolerance (float, default=1e-5) –

    Stop iteration when the relative change in the cost function (inertia)

    \[C(S) = \sum_{i=1}^{k} \sum_{\mathbf x \in S_i} \left\| \mathbf x - \boldsymbol\mu_i \right\|^2 \]

    is smaller than tolerance.

  • init_strategy (str, default='kmeans++') – one of ‘kmeans++’, ‘uniform’; determining how the initial cluster centers are being chosen

  • fixed_seed (bool or int, default=False) – if True, the seed gets set to 42. Use time based seeding otherwise. If an integer is given, use this to initialize the random generator.

  • n_jobs (int or None, default=None) – Number of threads to use during clustering and assignment of data. If None, one core will be used.

  • initial_centers (None or np.ndarray[k, dim], default=None) – This is used to resume the kmeans iteration. Note, that if this is set, the init_strategy is ignored and the centers are directly passed to the kmeans iteration algorithm.

  • progress (object) – Progress bar object that KMeans will call to indicate progress to the user. Tested for a tqdm progress bar. The interface is checked via supports_progress_interface.

References

Attributes

fixed_seed

seed for random choice of initial cluster centers.

has_model

Property reporting whether this estimator contains an estimated model.

init_strategy

Strategy to get an initial guess for the centers.

initial_centers

Yields initial centers which override the init_strategy().

max_iter

Maximum number of clustering iterations before stop.

metric

The metric that is used for clustering.

model

Shortcut to fetch_model().

n_clusters

The number of cluster centers to use.

n_jobs

Number of threads to use during clustering and assignment of data.

tolerance

Stopping criterion for the k-means iteration.

Methods

fetch_model()

Fetches the current model.

fit(data[, initial_centers, ...])

Perform the clustering.

fit_fetch(data, **kwargs)

Fits the internal model on data and subsequently fetches it in one call.

fit_transform(data[, fit_options, ...])

Fits a model which simultaneously functions as transformer and subsequently transforms the input data.

get_params([deep])

Get the parameters.

set_params(**params)

Set the parameters of this estimator.

transform(data, **kw)

Transforms a trajectory to a discrete trajectory by assigning each frame to its respective cluster center.

__call__(*args, **kwargs)

Call self as a function.

fetch_model() Optional[KMeansModel]

Fetches the current model. Can be None in case fit() was not called yet.

Returns:

model – the latest estimated model

Return type:

KMeansModel or None

fit(data, initial_centers=None, callback_init_centers=None, callback_loop=None, n_jobs=None)

Perform the clustering.

Parameters:
  • data (np.ndarray or list(np.ndarray)) – data to be clustered, shape of ndarray(s) should be (N, D), where N is the number of data points, D the dimension. In case of one-dimensional data, a shape of (N,) also works. If data is in shape of a list of ndarrays, the ndarrays will be concatenated to one ndarray.

  • initial_centers (np.ndarray or None) – Optional cluster center initialization that supersedes the estimator’s initial_centers attribute

  • callback_init_centers (function or None) – used for kmeans++ initialization to indicate progress, called once per assigned center.

  • callback_loop (function or None) – used to indicate progress on kmeans iterations, called once per iteration.

  • n_jobs (None or int) – if not None, supersedes the n_jobs attribute of the estimator instance; must be non-negative

Returns:

self – reference to self

Return type:

KMeans

fit_fetch(data, **kwargs)

Fits the internal model on data and subsequently fetches it in one call.

Parameters:
  • data (array_like) – Data that is used to fit the model.

  • **kwargs – Additional arguments to fit().

Returns:

The estimated model.

Return type:

model

fit_transform(data, fit_options=None, transform_options=None)

Fits a model which simultaneously functions as transformer and subsequently transforms the input data. The estimated model can be accessed by calling fetch_model().

Parameters:
  • data (array_like) – The input data.

  • fit_options (dict, optional, default=None) – Optional keyword arguments passed on to the fit method.

  • transform_options (dict, optional, default=None) – Optional keyword arguments passed on to the transform method.

Returns:

output – Transformed data.

Return type:

array_like

get_params(deep=False)

Get the parameters.

Returns:

params – Parameter names mapped to their values.

Return type:

mapping of string to any

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

object

transform(data, **kw) ndarray

Transforms a trajectory to a discrete trajectory by assigning each frame to its respective cluster center.

Parameters:
  • data ((T, n) ndarray) – trajectory with T frames and data points in n dimensions.

  • **kw – ignored kwargs for scikit-learn compatibility

Returns:

discrete_trajectory – discrete trajectory

Return type:

(T, 1) ndarray

See also

ClusterModel.transform

transform method of cluster model, implicitly called.

property fixed_seed

seed for random choice of initial cluster centers.

Fix this to get reproducible results in conjunction with n_jobs=0. The latter is needed, because parallel execution causes non-deterministic behaviour again.

property has_model: bool

Property reporting whether this estimator contains an estimated model. This assumes that the model is initialized with None otherwise.

Type:

bool

property init_strategy

Strategy to get an initial guess for the centers.

Getter:

Yields the strategy, can be one of “kmeans++” or “uniform”.

Setter:

Setter for the initialization strategy that is used when no initial centers are provided.

Type:

string

property initial_centers: Optional[ndarray]

Yields initial centers which override the init_strategy(). Can be used to resume k-means iterations.

Getter:

The initial centers or None.

Setter:

Sets the initial centers. If not None, the array is expected to have length n_clusters.

Type:

(k, n) ndarray or None

property max_iter: int

Maximum number of clustering iterations before stop.

Getter:

Yields the maximum number of clustering iterations

Setter:

Sets the max. number of clustering iterations

Type:

int

property metric: str

The metric that is used for clustering.

See also

_clustering_bindings.Metric

The metric class, can be subclassed

metrics

Metrics registry which maps from metric label to actual implementation

property model

Shortcut to fetch_model().

property n_clusters: int

The number of cluster centers to use.

Getter:

Yields the number of cluster centers.

Setter:

Sets the number of cluster centers.

Type:

int

property n_jobs: int

Number of threads to use during clustering and assignment of data.

Getter:

Yields the number of threads. If -1, all available threads are used.

Setter:

Sets the number of threads to use. If -1, use all, if None, use 1.

Type:

int

property tolerance: float

Stopping criterion for the k-means iteration. When the relative change of the cost function between two iterations is less than the tolerance, the algorithm is considered to be converged.

Getter:

Yields the currently set tolerance.

Setter:

Sets a new tolerance.

Type:

float