class KMeans¶
- class deeptime.clustering.KMeans(n_clusters: int, max_iter: int = 500, metric='euclidean', tolerance=1e-05, init_strategy: str = 'kmeans++', fixed_seed=False, n_jobs=None, initial_centers=None, progress=None)¶
Clusters the data in a way that minimizes the cost function
\[C(S) = \sum_{i=1}^{k} \sum_{\mathbf{x}_j \in S_i} \left\| \mathbf{x}_j - \boldsymbol\mu_i \right\|^2 \]where \(S_i\) are clusters with centers of mass \(\mu_i\) and \(\mathbf{x}_j\) data points associated to their clusters.
The outcome is very dependent on the initialization, in particular we offer “kmeans++” and “uniform”. The latter picks initial centers random-uniformly over the provided data set. The former tries to find an initialization which is covering the spatial configuration of the dataset more or less uniformly. For details see [1].
- Parameters:
n_clusters (int) – amount of cluster centers.
max_iter (int, default=500) – maximum number of iterations before stopping.
metric (str, default='euclidean') – Metric to use during clustering, default evaluates to euclidean metric. For a list of available metrics, see the
metric registry
.tolerance (float, default=1e-5) –
Stop iteration when the relative change in the cost function (inertia)
\[C(S) = \sum_{i=1}^{k} \sum_{\mathbf x \in S_i} \left\| \mathbf x - \boldsymbol\mu_i \right\|^2 \]is smaller than tolerance.
init_strategy (str, default='kmeans++') – one of ‘kmeans++’, ‘uniform’; determining how the initial cluster centers are being chosen
fixed_seed (bool or int, default=False) – if True, the seed gets set to 42. Use time based seeding otherwise. If an integer is given, use this to initialize the random generator.
n_jobs (int or None, default=None) – Number of threads to use during clustering and assignment of data. If None, one core will be used.
initial_centers (None or np.ndarray[k, dim], default=None) – This is used to resume the kmeans iteration. Note, that if this is set, the init_strategy is ignored and the centers are directly passed to the kmeans iteration algorithm.
progress (object) – Progress bar object that KMeans will call to indicate progress to the user. Tested for a tqdm progress bar. The interface is checked via
supports_progress_interface
.
References
See also
Attributes
seed for random choice of initial cluster centers.
Property reporting whether this estimator contains an estimated model.
Strategy to get an initial guess for the centers.
Yields initial centers which override the
init_strategy()
.Maximum number of clustering iterations before stop.
The metric that is used for clustering.
Shortcut to
fetch_model()
.The number of cluster centers to use.
Number of threads to use during clustering and assignment of data.
Stopping criterion for the k-means iteration.
Methods
Fetches the current model.
fit
(data[, initial_centers, ...])Perform the clustering.
fit_fetch
(data, **kwargs)Fits the internal model on data and subsequently fetches it in one call.
fit_transform
(data[, fit_options, ...])Fits a model which simultaneously functions as transformer and subsequently transforms the input data.
get_params
([deep])Get the parameters.
set_params
(**params)Set the parameters of this estimator.
transform
(data, **kw)Transforms a trajectory to a discrete trajectory by assigning each frame to its respective cluster center.
- __call__(*args, **kwargs)¶
Call self as a function.
- fetch_model() Optional[KMeansModel] ¶
Fetches the current model. Can be None in case
fit()
was not called yet.- Returns:
model – the latest estimated model
- Return type:
KMeansModel or None
- fit(data, initial_centers=None, callback_init_centers=None, callback_loop=None, n_jobs=None)¶
Perform the clustering.
- Parameters:
data (np.ndarray or list(np.ndarray)) – data to be clustered, shape of ndarray(s) should be (N, D), where N is the number of data points, D the dimension. In case of one-dimensional data, a shape of (N,) also works. If data is in shape of a list of ndarrays, the ndarrays will be concatenated to one ndarray.
initial_centers (np.ndarray or None) – Optional cluster center initialization that supersedes the estimator’s initial_centers attribute
callback_init_centers (function or None) – used for kmeans++ initialization to indicate progress, called once per assigned center.
callback_loop (function or None) – used to indicate progress on kmeans iterations, called once per iteration.
n_jobs (None or int) – if not None, supersedes the n_jobs attribute of the estimator instance; must be non-negative
- Returns:
self – reference to self
- Return type:
- fit_fetch(data, **kwargs)¶
Fits the internal model on data and subsequently fetches it in one call.
- Parameters:
data (array_like) – Data that is used to fit the model.
**kwargs – Additional arguments to
fit()
.
- Returns:
The estimated model.
- Return type:
model
- fit_transform(data, fit_options=None, transform_options=None)¶
Fits a model which simultaneously functions as transformer and subsequently transforms the input data. The estimated model can be accessed by calling
fetch_model()
.- Parameters:
data (array_like) – The input data.
fit_options (dict, optional, default=None) – Optional keyword arguments passed on to the fit method.
transform_options (dict, optional, default=None) – Optional keyword arguments passed on to the transform method.
- Returns:
output – Transformed data.
- Return type:
array_like
- get_params(deep=False)¶
Get the parameters.
- Returns:
params – Parameter names mapped to their values.
- Return type:
mapping of string to any
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
object
- transform(data, **kw) ndarray ¶
Transforms a trajectory to a discrete trajectory by assigning each frame to its respective cluster center.
- Parameters:
data ((T, n) ndarray) – trajectory with T frames and data points in n dimensions.
**kw – ignored kwargs for scikit-learn compatibility
- Returns:
discrete_trajectory – discrete trajectory
- Return type:
(T, 1) ndarray
See also
ClusterModel.transform
transform method of cluster model, implicitly called.
- property fixed_seed¶
seed for random choice of initial cluster centers.
Fix this to get reproducible results in conjunction with n_jobs=0. The latter is needed, because parallel execution causes non-deterministic behaviour again.
- property has_model: bool¶
Property reporting whether this estimator contains an estimated model. This assumes that the model is initialized with None otherwise.
- Type:
bool
- property init_strategy¶
Strategy to get an initial guess for the centers.
- Getter:
Yields the strategy, can be one of “kmeans++” or “uniform”.
- Setter:
Setter for the initialization strategy that is used when no initial centers are provided.
- Type:
string
- property initial_centers: Optional[ndarray]¶
Yields initial centers which override the
init_strategy()
. Can be used to resume k-means iterations.- Getter:
The initial centers or None.
- Setter:
Sets the initial centers. If not None, the array is expected to have length
n_clusters
.- Type:
(k, n) ndarray or None
- property max_iter: int¶
Maximum number of clustering iterations before stop.
- Getter:
Yields the maximum number of clustering iterations
- Setter:
Sets the max. number of clustering iterations
- Type:
int
- property metric: str¶
The metric that is used for clustering.
See also
_clustering_bindings.Metric
The metric class, can be subclassed
metrics
Metrics registry which maps from metric label to actual implementation
- property model¶
Shortcut to
fetch_model()
.
- property n_clusters: int¶
The number of cluster centers to use.
- Getter:
Yields the number of cluster centers.
- Setter:
Sets the number of cluster centers.
- Type:
int
- property n_jobs: int¶
Number of threads to use during clustering and assignment of data.
- Getter:
Yields the number of threads. If -1, all available threads are used.
- Setter:
Sets the number of threads to use. If -1, use all, if None, use 1.
- Type:
int
- property tolerance: float¶
Stopping criterion for the k-means iteration. When the relative change of the cost function between two iterations is less than the tolerance, the algorithm is considered to be converged.
- Getter:
Yields the currently set tolerance.
- Setter:
Sets a new tolerance.
- Type:
float