Utilities

Data utilities

This module contains helper functions for data loading and processing.

class cellarium.ml.utilities.data.AnnDataField(attr: str, key: list[str] | str | None = None, convert_fn: Callable[[Any], ndarray] | None = None)[source]

Bases: object

Helper class for accessing fields of an AnnData-like object.

Example:

>>> from cellarium.ml.data import DistributedAnnDataCollection
>>> from cellarium.ml.utilities.data import AnnDataField, densify

>>> dadc = DistributedAnnDataCollection(
...     "gs://bucket-name/folder/adata{000..005}.h5ad",
...     shard_size=10_000,
...     max_cache_size=2)

>>> adata = dadc[:100]
>>> field_X = AnnDataField(attr="X", convert_fn=densify)
>>> X = field_X(adata)  # densify(adata.X)

>>> field_total_mrna_umis = AnnDataField(attr="obs", key="total_mrna_umis")
>>> total_mrna_umis = field_total_mrna_umis(adata)  # np.asarray(adata.obs["total_mrna_umis"])
Parameters:
  • attr (str) – The attribute of the AnnData-like object to access.

  • key (list[str] | str | None) – The key of the attribute to access. If None, the entire attribute is returned.

  • convert_fn (Callable[[Any], ndarray] | None) – A function to apply to the attribute before returning it. If None, np.asarray() is used.

cellarium.ml.utilities.data.collate_fn(batch: list[dict[str, dict[str, ndarray] | ndarray]]) dict[str, dict[str, ndarray | Tensor] | ndarray | Tensor][source]

Collate function for the DataLoader. This function assumes that the batch is a list of dictionaries, where each dictionary has the same keys. If the key ends with _g or _categories, the value of that key is checked to be the same across all dictionaries in the batch and then taken from the first dictionary. Otherwise, the value of that key is concatenated along the first dimension. Then the values which are not strings are converted to a torch.Tensor and returned in a dictionary.

Parameters:

batch (list[dict[str, dict[str, ndarray] | ndarray]]) – List of dictionaries.

Returns:

Dictionary with the same keys as the input dictionaries, but with values concatenated along the batch dimension.

Return type:

dict[str, dict[str, ndarray | Tensor] | ndarray | Tensor]

cellarium.ml.utilities.data.densify(x: csr_matrix) ndarray[source]

Convert a sparse matrix to a dense matrix.

Parameters:

x (csr_matrix) – Sparse matrix.

Returns:

Dense matrix.

Return type:

ndarray

cellarium.ml.utilities.data.series_to_str_list(x: Series) list[str][source]

Convert a pandas Series of strings to a list of strings. :param x: Pandas Series object.

Returns:

List of strings.

Parameters:

x (Series)

Return type:

list[str]

cellarium.ml.utilities.data.keep_sparse(x: spmatrix) spmatrix[source]

Identity function for scipy sparse matrices.

Use as convert_fn for AnnDataField when the sparse matrix should remain sparse inside the dataloader worker. A Filter cpu_transform will then filter the columns and convert to torch.sparse_csr_tensor — keeping the transferred data volume small before it reaches the main process and the PCIe bus.

Parameters:

x (spmatrix) – Sparse matrix.

Returns:

The same sparse matrix, unchanged.

Return type:

spmatrix

cellarium.ml.utilities.data.to_torch_sparse_csr(x: spmatrix) Tensor[source]

Convert a scipy sparse matrix to a torch.sparse_csr_tensor (float32, CPU).

Use as convert_fn for AnnDataField when no Filter cpu_transform is in the pipeline and the full (unfiltered) gene set should still be transferred sparsely. The resulting torch.sparse_csr_tensor is placed in shared memory by dataloader workers for zero-copy transfer to the main process, then moved to GPU and densified by Densify.

Parameters:

x (spmatrix) – Sparse matrix. Converted to CSR format if not already.

Returns:

A torch.sparse_csr_tensor on CPU.

Return type:

Tensor

cellarium.ml.utilities.data.categories_to_codes(x: Series | DataFrame) ndarray[source]

Convert a pandas Series or DataFrame of categorical data to a numpy array of codes. Returned array is always a copy.

Parameters:

x (Series | DataFrame) – Pandas Series object or a pandas DataFrame containing multiple categorical Series.

Returns:

Numpy array.

Return type:

ndarray

cellarium.ml.utilities.data.categories_to_product_codes(x: Series | DataFrame) ndarray[source]

Convert a pandas Series or DataFrame of categorical data to a numpy array of codes. If the input is a DataFrame, the output is created by first combining the codes of each column into a single code representing the Cartesian product of the categories.

Parameters:

x (Series | DataFrame) – Pandas Series object or a pandas DataFrame containing multiple categorical Series.

Returns:

Numpy array.

Return type:

ndarray

cellarium.ml.utilities.data.get_categories(x: Series) ndarray[source]

Get the categories of a pandas Series object.

Parameters:

x (Series) – Pandas Series object.

Returns:

Numpy array.

Return type:

ndarray

cellarium.ml.utilities.data.get_var_names_g_indices(input_var_names_g: ndarray, stored_var_names_g: ndarray) ndarray[source]

Return integer indices that map each gene in input_var_names_g to its position in stored_var_names_g.

This allows parametric transforms (e.g. ZScore, DivideByScale) to accept any subset or reordering of the gene space they were initialized with, by looking up the per-gene statistics for only the genes present in the current batch.

Parameters:
  • input_var_names_g (ndarray) – Gene names arriving at the transform (may be a subset or reordering of stored_var_names_g).

  • stored_var_names_g (ndarray) – The full gene-name schema the transform was initialized with.

Returns:

A 1-D integer array of length len(input_var_names_g) where element i is the index of input_var_names_g[i] in stored_var_names_g.

Raises:

ValueError – If any gene in input_var_names_g is absent from stored_var_names_g.

Return type:

ndarray

cellarium.ml.utilities.data.get_cl_classes_from_owl(owl_uri: str) list[source]

Get CL classes from an OWL file: the ontology IDs, e.g. CL_0000123.

Parameters:

owl_uri (str) – The URI of the OWL file.

Return type:

list

Returns: A list of CL classes

cellarium.ml.utilities.data.get_cl_descendant_tensor_from_owl(owl_uri: str) Tensor[source]

Get a descendant tensor from an OWL file. Include “unknown” a new disconnected category at the end.

Parameters:

owl_uri (str) – The URI of the OWL file.

Returns:

A descendant tensor, where the entry at index (i, j) is True if i=j or cell type j is a descendant of cell type i, and False otherwise.

Return type:

Tensor

cellarium.ml.utilities.data.get_cl_names_from_owl(owl_uri: str) list[str][source]

Get cell type names (e.g., CL:0000123) from an OWL file, with “unknown” appended as a new disconnected category.

Parameters:

owl_uri (str) – The URI of the OWL file.

Returns:

A list of cell type names, where the name at index i corresponds to the cell type at index i in the descendant tensor.

Return type:

list[str]

cellarium.ml.utilities.data.compute_cl_distance_matrix(owl_uri: str) DataFrame[source]

Compute an all-pairs shortest-path distance matrix over the Cell Ontology (CL).

Nodes are all CL classes found in the OWL file. Distances are computed on the undirected ontology graph, so sibling cell types separated by a common parent get a finite distance (2 hops) rather than infinity.

This is a slow, one-time offline pre-computation (typically ~1 minute for the full CL ontology). Save the result to disk and pass it to the scVI model constructor via ontology_distance_matrix. Example:

df = compute_cl_distance_matrix(
    "https://github.com/obophenotype/cell-ontology/releases/download/v2024-01-04/cl.owl"
)
df.to_parquet("cl_distance_matrix.parquet")
Parameters:

owl_uri (str) – URI or local path of the CL OWL file (passed to owlready2.get_ontology(...).load()).

Returns:

A symmetric square pandas.DataFrame of float32 values with CL ID strings (e.g. "CL:0000540") as both index and columns. Diagonal entries are 0.0; disconnected pairs have inf.

Raises:

ImportError – If owlready2 or networkx are not installed. Install with pip install cellarium-ml[ontology].

Return type:

DataFrame

Distributed utilities

This module contains helper functions for distributed training.

class cellarium.ml.utilities.distributed.GatherLayer(*args, **kwargs)[source]

Bases: Function

Gather tensors from all process, supporting backward propagation.

cellarium.ml.utilities.distributed.get_rank_and_num_replicas() tuple[int, int][source]

This helper function returns the rank of the current process and the number of processes in the default process group. If distributed package is not available or default process group has not been initialized then it returns rank=0 and num_replicas=1.

Returns:

Tuple of rank and num_replicas.

Return type:

tuple[int, int]

cellarium.ml.utilities.distributed.get_worker_info() tuple[int, int][source]

This helper function returns worker_id and num_workers. If it is running in the main process then it returns worker_id=0 and num_workers=1.

Returns:

Tuple of worker_id and num_workers.

Return type:

tuple[int, int]

Testing utilities

This module contains helper functions for testing.

cellarium.ml.utilities.testing.assert_positive(name: str, number: float) None[source]

Assert that a number is positive.

Parameters:
  • name (str) – The name of the number.

  • number (float) – The number to check.

Raises:

ValueError – If the number is not positive.

Return type:

None

cellarium.ml.utilities.testing.assert_nonnegative(name: str, number: float) None[source]

Assert that a number is non-negative.

Parameters:
  • name (str) – The name of the number.

  • number (float) – The number to check.

Raises:

ValueError – If the number is negative.

Return type:

None

cellarium.ml.utilities.testing.assert_columns_and_array_lengths_equal(matrix_name: str, matrix: ndarray | Tensor, array_name: str, array: ndarray | Tensor) None[source]

Assert that the number of columns in a matrix matches the length of an array.

Parameters:
  • matrix_name (str) – The name of the matrix.

  • matrix (ndarray | Tensor) – The matrix.

  • array_name (str) – The name of the array.

  • array (ndarray | Tensor) – The array.

Raises:

ValueError – If the number of columns in the matrix does not match the length of the array.

Return type:

None

cellarium.ml.utilities.testing.assert_arrays_equal(a1_name: str, a1: ndarray, a2_name: str, a2: ndarray) None[source]

Assert that two arrays are equal.

Parameters:
  • a1_name (str) – The name of the first array.

  • a1 (ndarray) – The first array.

  • a2_name (str) – The name of the second array.

  • a2 (ndarray) – The second array.

Raises:

ValueError – If the arrays are not equal.

Return type:

None

cellarium.ml.utilities.testing.assert_slope_equals(data: Series, slope: float, loglog: bool = False, atol: float = 0.0001) None[source]

Assert that the slope of a series is equal to a given value.

Parameters:
  • data (Series) – The pandas.Series object to check.

  • slope (float) – Expected slope.

  • loglog (bool) – Whether to use log-log scale.

  • atol (float) – The absolute tolerance.

Raises:

ValueError – If the slope is not equal to the given value.

Return type:

None

cellarium.ml.utilities.testing.get_coord_data(models: dict[int, Callable[[], LightningModule]], layer_name_to_multiplier_name: dict[str, str], train_loader: DataLoader, nsteps: int, nseeds: int) DataFrame[source]

Get coordinate data for a model.

Parameters:
  • models (dict[int, Callable[[], LightningModule]]) – A dictionary mapping width to a function that returns a model.

  • layer_name_to_multiplier_name (dict[str, str]) – A dictionary mapping layer names to their corresponding multipliers.

  • train_loader (DataLoader) – The training data loader.

  • nsteps (int) – The number of steps to train for.

  • nseeds (int) – The number of seeds to use.

Returns:

A pandas.DataFrame containing the coordinate data.

Return type:

DataFrame