Utilities

Data utilities

This module contains helper functions for data loading and processing.

class cellarium.ml.utilities.data.AnnDataField(attr: str, key: str | None = None, convert_fn: Callable[[Any], ndarray] | None = None)[source]

Bases: object

Helper class for accessing fields of an AnnData-like object.

Example:

>>> from cellarium.ml.data import DistributedAnnDataCollection
>>> from cellarium.ml.utilities.data import AnnDataField, densify

>>> dadc = DistributedAnnDataCollection(
...     "gs://bucket-name/folder/adata{000..005}.h5ad",
...     shard_size=10_000,
...     max_cache_size=2)

>>> adata = dadc[:100]
>>> field_X = AnnDataField(attr="X", convert_fn=densify)
>>> X = field_X(adata)  # densify(adata.X)

>>> field_total_mrna_umis = AnnDataField(attr="obs", key="total_mrna_umis")
>>> total_mrna_umis = field_total_mrna_umis(adata)  # np.asarray(adata.obs["total_mrna_umis"])
Parameters:
  • attr (str) – The attribute of the AnnData-like object to access.

  • key (str | None) – The key of the attribute to access. If None, the entire attribute is returned.

  • convert_fn (Callable[[Any], ndarray] | None) – A function to apply to the attribute before returning it. If None, np.asarray() is used.

cellarium.ml.utilities.data.collate_fn(batch: list[dict[str, ndarray]]) dict[str, ndarray | Tensor][source]

Collate function for the DataLoader. This function assumes that the batch is a list of dictionaries, where each dictionary has the same keys. If the key ends with _g or _categories, the value of that key is checked to be the same across all dictionaries in the batch and then taken from the first dictionary. Otherwise, the value of that key is concatenated along the first dimension. Then the values which are not strings are converted to a torch.Tensor and returned in a dictionary.

Parameters:

batch (list[dict[str, ndarray]]) – List of dictionaries.

Returns:

Dictionary with the same keys as the input dictionaries, but with values concatenated along the batch dimension.

Return type:

dict[str, ndarray | Tensor]

cellarium.ml.utilities.data.densify(x: csr_matrix) ndarray[source]

Convert a sparse matrix to a dense matrix.

Parameters:

x (csr_matrix) – Sparse matrix.

Returns:

Dense matrix.

Return type:

ndarray

cellarium.ml.utilities.data.categories_to_codes(x: Series) ndarray[source]

Convert a pandas Series of categorical data to a numpy array of codes. Returned array is always a copy.

Parameters:

x (Series) – Pandas Series object.

Returns:

Numpy array.

Return type:

ndarray

cellarium.ml.utilities.data.get_categories(x: Series) ndarray[source]

Get the categories of a pandas Series object.

Parameters:

x (Series) – Pandas Series object.

Returns:

Numpy array.

Return type:

ndarray

Distributed utilities

This module contains helper functions for distributed training.

class cellarium.ml.utilities.distributed.GatherLayer(*args, **kwargs)[source]

Bases: Function

Gather tensors from all process, supporting backward propagation.

cellarium.ml.utilities.distributed.get_rank_and_num_replicas() tuple[int, int][source]

This helper function returns the rank of the current process and the number of processes in the default process group. If distributed package is not available or default process group has not been initialized then it returns rank=0 and num_replicas=1.

Returns:

Tuple of rank and num_replicas.

Return type:

tuple[int, int]

cellarium.ml.utilities.distributed.get_worker_info() tuple[int, int][source]

This helper function returns worker_id and num_workers. If it is running in the main process then it returns worker_id=0 and num_workers=1.

Returns:

Tuple of worker_id and num_workers.

Return type:

tuple[int, int]

Testing utilities

This module contains helper functions for testing.

cellarium.ml.utilities.testing.assert_positive(name: str, number: float)[source]

Assert that a number is positive.

Parameters:
  • name (str) – The name of the number.

  • number (float) – The number to check.

Raises:

ValueError – If the number is not positive.

cellarium.ml.utilities.testing.assert_nonnegative(name: str, number: float)[source]

Assert that a number is non-negative.

Parameters:
  • name (str) – The name of the number.

  • number (float) – The number to check.

Raises:

ValueError – If the number is negative.

cellarium.ml.utilities.testing.assert_columns_and_array_lengths_equal(matrix_name: str, matrix: ndarray | Tensor, array_name: str, array: ndarray | Tensor)[source]

Assert that the number of columns in a matrix matches the length of an array.

Parameters:
  • matrix_name (str) – The name of the matrix.

  • matrix (ndarray | Tensor) – The matrix.

  • array_name (str) – The name of the array.

  • array (ndarray | Tensor) – The array.

Raises:

ValueError – If the number of columns in the matrix does not match the length of the array.

cellarium.ml.utilities.testing.assert_arrays_equal(a1_name: str, a1: ndarray, a2_name: str, a2: ndarray)[source]

Assert that two arrays are equal.

Parameters:
  • a1_name (str) – The name of the first array.

  • a1 (ndarray) – The first array.

  • a2_name (str) – The name of the second array.

  • a2 (ndarray) – The second array.

Raises:

ValueError – If the arrays are not equal.

cellarium.ml.utilities.testing.assert_slope_equals(data: Series, slope: float, loglog: bool = False, atol: float = 0.0001)[source]

Assert that the slope of a series is equal to a given value.

Parameters:
  • data (Series) – The pandas.Series object to check.

  • slope (float) – Expected slope.

  • loglog (bool) – Whether to use log-log scale.

  • atol (float) – The absolute tolerance.

Raises:

ValueError – If the slope is not equal to the given value.

cellarium.ml.utilities.testing.record_out_coords(records: list[dict], width: int, name: str, t: int) Callable[[Module, Tensor, Tensor], None][source]

Returns a hook to record layer output coordinate size.

Parameters:
  • records (list[dict]) – The list of records to append to.

  • width (int) – The width of the model.

  • name (str) – The name of the layer.

  • t (int) – The time step.

Returns:

A hook to record layer output coordinate size.

Return type:

Callable[[Module, Tensor, Tensor], None]

cellarium.ml.utilities.testing.get_coord_data(models: dict[int, Callable[[], Module]], train_loader: DataLoader, loss_fn: Callable[[Tensor, Tensor], Tensor], optim_fn: type[Optimizer], lr: float, nsteps: int, nseeds: int) DataFrame[source]

Get coordinate data for a model.

Parameters:
  • models (dict[int, Callable[[], Module]]) – A dictionary mapping width to a function that returns a model.

  • train_loader (DataLoader) – The training data loader.

  • loss_fn (Callable[[Tensor, Tensor], Tensor]) – The loss function.

  • optim_fn (type[Optimizer]) – The optimizer class.

  • lr (float) – The learning rate.

  • nsteps (int) – The number of steps to train for.

  • nseeds (int) – The number of seeds to use.

Returns:

A pandas.DataFrame containing the coordinate data.

Return type:

DataFrame