MiniBatchKMeans

simbsig.cluster.MiniBatchKMeans.MiniBatchKMeans(n_clusters=5, metric='euclidean', metric_params=None, feature_weights=None, max_iter=100, tol=0.01, device='cpu', mode='arrays', n_jobs=0, batch_size=None, random_state=None, init='random', alpha=0.95, verbose=True, **kwargs)

KMeans class, implementing MiniBatchKMeans as described by Scully [1], batched data loading for big datasets and optional GPU accelerated computations.

Parameters

Parameters
  • n_clusters – int, default=5

  • metric – str or callable, default=’minkowski’ The distance metric used to quantify similarity between objects, with default metric being minkowski. Other available metrics include [‘euclidean’, ‘manhattan’, ‘minkowski’,’fractional’,’cosine’,’mahalanobis’]. When metric=’precomputed’, provide X as a distance matrix which will be square during fit.

  • metric_params – dict, default=None Additional metric-specific keyword arguments.

  • feature_weights – np.array of floats, default=None Vector giving user-defined weights to every feature. Must be of similar length as the number of features n_features_in. If feature_weights=None, uniform weights are applied.

  • max_iter – int, default=100 Maximum number of iterations of the KMeans algorithm. Algorithm might terminate earlier, if tol is satisfied.

  • tol – float, default=1e-5. Tolerance upon which KMeans stops iterating. If tolerance is not reached after max_iter many iterations, the algorithm terminates.

  • device – str, default=’cpu’ Which device to use for distance computations. Options supported are: [‘cpu’,’gpu’]

  • mode – str, default=’arrays’ Whether the input data is in memory (as lists, arrays or tensors) or on disk as hdf5 files. The latter should be favored for big datasets. Options supported are: [‘arrays’,’hdf5’]

  • n_jobs – int, default=0 Number of jobs active in torch.dataloader.

  • batch_size – str, default=None Batch size of data chunks that are processed at once for distance computations. Should be optimized for dataset when using device=’gpu’. If batch_size=None, the entire dataset is loaded and processed at once, which may return an error when using device=’gpu’.

  • random_state – int, default=None The random state for the seed of torch.

  • init – obj, default=’random’ If ‘random’, cluster centers are selected uniformly at random from the training set. Alternatively, an array-like X of shape (n_clusters, n_features) can be passed which will be used as cluster initialization

  • verbose – bool, default=True Logging information. If True, progression updates are produced.

[1] Sculley, David. “Web-scale k-means clustering.” Proceedings of the 19th international conference on World wide web. 2010.

simbsig.cluster.MiniBatchKMeans.MiniBatchKMeans.fit(self, X, y=None)

Performs the MiniBatchKMeans algorithm with settings passed during init.

Parameters

Parameters
  • X – array-like or h5py file handle. Training Data of shape (n_samples, n_features) or (n_samples, n_samples) if metric=’precomputed’

  • y – Ignored. Only present by convention.

Returns

Return self

MiniBatchKMeans The MiniBatchKMeans object with computed cluster centers.

simbsig.cluster.MiniBatchKMeans.MiniBatchKMeans.predict(self, X)

Predicts for data of same dimension as training data to which cluster center its points belong.

Parameters

Parameters

X – array-like or h5py file handle. Test Data of shape (n_samples, n_features)

Returns

Return clusters

array of integers The cluster centers of shape (n_samples,)

simbsig.cluster.MiniBatchKMeans.MiniBatchKMeans.fit_predict(self, X, y=None)

Performs fit (MiniBatchKMeans algorithm with settings passed during init) and predict (predicts for data of same dimension as training data to which cluster center its points belong) on the data X.

Parameters

Parameters

X – array-like or h5py file handle. Training Data of shape (n_samples, n_features)

Returns

Return clusters

array of integers The cluster centers of the points, of shape (n_samples,)