Utilities

Data utilities

This module contains helper functions for data loading and processing.

class cellarium.ml.utilities.data.AnnDataField(attr: str, key: list[str] | str | None = None, convert_fn: Callable[[Any], ndarray] | None = None)[source]

Bases: object

Helper class for accessing fields of an AnnData-like object.

Example:

>>> from cellarium.ml.data import DistributedAnnDataCollection
>>> from cellarium.ml.utilities.data import AnnDataField, densify

>>> dadc = DistributedAnnDataCollection(
...     "gs://bucket-name/folder/adata{000..005}.h5ad",
...     shard_size=10_000,
...     max_cache_size=2)

>>> adata = dadc[:100]
>>> field_X = AnnDataField(attr="X", convert_fn=densify)
>>> X = field_X(adata)  # densify(adata.X)

>>> field_total_mrna_umis = AnnDataField(attr="obs", key="total_mrna_umis")
>>> total_mrna_umis = field_total_mrna_umis(adata)  # np.asarray(adata.obs["total_mrna_umis"])

Parameters:

attr (str) – The attribute of the AnnData-like object to access.
key (list[str] | str | None) – The key of the attribute to access. If None, the entire attribute is returned.
convert_fn (Callable[[Any], ndarray] | None) – A function to apply to the attribute before returning it. If None, np.asarray() is used.

cellarium.ml.utilities.data.collate_fn(batch: list[dict[str, dict[str, ndarray] | ndarray]]) → dict[str, dict[str, ndarray | Tensor] | ndarray | Tensor][source]

Collate function for the DataLoader. This function assumes that the batch is a list of dictionaries, where each dictionary has the same keys. If the key ends with _g or _categories, the value of that key is checked to be the same across all dictionaries in the batch and then taken from the first dictionary. Otherwise, the value of that key is concatenated along the first dimension. Then the values which are not strings are converted to a torch.Tensor and returned in a dictionary.

Parameters:: batch (list[dict[str, dict[str, ndarray] | ndarray]]) – List of dictionaries.
Returns:: Dictionary with the same keys as the input dictionaries, but with values concatenated along the batch dimension.
Return type:: dict[str, dict[str, ndarray | Tensor] | ndarray | Tensor]

cellarium.ml.utilities.data.densify(x: csr_matrix) → ndarray[source]

Convert a sparse matrix to a dense matrix.

Parameters:: x (csr_matrix) – Sparse matrix.
Returns:: Dense matrix.
Return type:: ndarray

cellarium.ml.utilities.data.categories_to_codes(x: Series | DataFrame) → ndarray[source]

Convert a pandas Series or DataFrame of categorical data to a numpy array of codes. Returned array is always a copy.

Parameters:: x (Series | DataFrame) – Pandas Series object or a pandas DataFrame containing multiple categorical Series.
Returns:: Numpy array.
Return type:: ndarray

cellarium.ml.utilities.data.get_categories(x: Series) → ndarray[source]

Get the categories of a pandas Series object.

Parameters:: x (Series) – Pandas Series object.
Returns:: Numpy array.
Return type:: ndarray

Distributed utilities

This module contains helper functions for distributed training.

class cellarium.ml.utilities.distributed.GatherLayer(*args, **kwargs)[source]

Bases: Function

Gather tensors from all process, supporting backward propagation.

cellarium.ml.utilities.distributed.get_rank_and_num_replicas() → tuple[int, int][source]

This helper function returns the rank of the current process and the number of processes in the default process group. If distributed package is not available or default process group has not been initialized then it returns rank=0 and num_replicas=1.

Returns:: Tuple of rank and num_replicas.
Return type:: tuple[int, int]

cellarium.ml.utilities.distributed.get_worker_info() → tuple[int, int][source]

This helper function returns worker_id and num_workers. If it is running in the main process then it returns worker_id=0 and num_workers=1.

Returns:: Tuple of worker_id and num_workers.
Return type:: tuple[int, int]

Testing utilities

This module contains helper functions for testing.

cellarium.ml.utilities.testing.assert_positive(name: str, number: float) → None[source]

Assert that a number is positive.

Parameters:

name (str) – The name of the number.
number (float) – The number to check.

Raises:

ValueError – If the number is not positive.

Return type:

None

cellarium.ml.utilities.testing.assert_nonnegative(name: str, number: float) → None[source]

Assert that a number is non-negative.

Parameters:

name (str) – The name of the number.
number (float) – The number to check.

Raises:

ValueError – If the number is negative.

Return type:

None

cellarium.ml.utilities.testing.assert_columns_and_array_lengths_equal(matrix_name: str, matrix: ndarray | Tensor, array_name: str, array: ndarray | Tensor) → None[source]

Assert that the number of columns in a matrix matches the length of an array.

Parameters:

matrix_name (str) – The name of the matrix.
matrix (ndarray | Tensor) – The matrix.
array_name (str) – The name of the array.
array (ndarray | Tensor) – The array.

Raises:

ValueError – If the number of columns in the matrix does not match the length of the array.

Return type:

None

cellarium.ml.utilities.testing.assert_arrays_equal(a1_name: str, a1: ndarray, a2_name: str, a2: ndarray) → None[source]

Assert that two arrays are equal.

Parameters:

a1_name (str) – The name of the first array.
a1 (ndarray) – The first array.
a2_name (str) – The name of the second array.
a2 (ndarray) – The second array.

Raises:

ValueError – If the arrays are not equal.

Return type:

None

cellarium.ml.utilities.testing.assert_slope_equals(data: Series, slope: float, loglog: bool = False, atol: float = 0.0001) → None[source]

Assert that the slope of a series is equal to a given value.

Parameters:

data (Series) – The pandas.Series object to check.
slope (float) – Expected slope.
loglog (bool) – Whether to use log-log scale.
atol (float) – The absolute tolerance.

Raises:

ValueError – If the slope is not equal to the given value.

Return type:

None

cellarium.ml.utilities.testing.get_coord_data(models: dict[int, Callable[[], LightningModule]], layer_name_to_multiplier_name: dict[str, str], train_loader: DataLoader, nsteps: int, nseeds: int) → DataFrame[source]

Get coordinate data for a model.

Parameters:

models (dict[int, Callable[[], LightningModule]]) – A dictionary mapping width to a function that returns a model.
layer_name_to_multiplier_name (dict[str, str]) – A dictionary mapping layer names to their corresponding multipliers.
train_loader (DataLoader) – The training data loader.
nsteps (int) – The number of steps to train for.
nseeds (int) – The number of seeds to use.

Returns:

A pandas.DataFrame containing the coordinate data.

Return type:

DataFrame