Utilities
Data utilities
This module contains helper functions for data loading and processing.
- class cellarium.ml.utilities.data.AnnDataField(attr: str, key: list[str] | str | None = None, convert_fn: Callable[[Any], ndarray] | None = None)[source]
Bases:
object
Helper class for accessing fields of an AnnData-like object.
Example:
>>> from cellarium.ml.data import DistributedAnnDataCollection >>> from cellarium.ml.utilities.data import AnnDataField, densify >>> dadc = DistributedAnnDataCollection( ... "gs://bucket-name/folder/adata{000..005}.h5ad", ... shard_size=10_000, ... max_cache_size=2) >>> adata = dadc[:100] >>> field_X = AnnDataField(attr="X", convert_fn=densify) >>> X = field_X(adata) # densify(adata.X) >>> field_total_mrna_umis = AnnDataField(attr="obs", key="total_mrna_umis") >>> total_mrna_umis = field_total_mrna_umis(adata) # np.asarray(adata.obs["total_mrna_umis"])
- Parameters:
attr (str) – The attribute of the AnnData-like object to access.
key (list[str] | str | None) – The key of the attribute to access. If
None
, the entire attribute is returned.convert_fn (Callable[[Any], ndarray] | None) – A function to apply to the attribute before returning it. If
None
,np.asarray()
is used.
- cellarium.ml.utilities.data.collate_fn(batch: list[dict[str, dict[str, ndarray] | ndarray]]) dict[str, dict[str, ndarray | Tensor] | ndarray | Tensor] [source]
Collate function for the
DataLoader
. This function assumes that the batch is a list of dictionaries, where each dictionary has the same keys. If the key ends with_g
or_categories
, the value of that key is checked to be the same across all dictionaries in the batch and then taken from the first dictionary. Otherwise, the value of that key is concatenated along the first dimension. Then the values which are not strings are converted to atorch.Tensor
and returned in a dictionary.- Parameters:
batch (list[dict[str, dict[str, ndarray] | ndarray]]) – List of dictionaries.
- Returns:
Dictionary with the same keys as the input dictionaries, but with values concatenated along the batch dimension.
- Return type:
dict[str, dict[str, ndarray | Tensor] | ndarray | Tensor]
- cellarium.ml.utilities.data.densify(x: csr_matrix) ndarray [source]
Convert a sparse matrix to a dense matrix.
- Parameters:
x (csr_matrix) – Sparse matrix.
- Returns:
Dense matrix.
- Return type:
ndarray
- cellarium.ml.utilities.data.categories_to_codes(x: Series | DataFrame) ndarray [source]
Convert a pandas Series or DataFrame of categorical data to a numpy array of codes. Returned array is always a copy.
- Parameters:
x (Series | DataFrame) – Pandas Series object or a pandas DataFrame containing multiple categorical Series.
- Returns:
Numpy array.
- Return type:
ndarray
- cellarium.ml.utilities.data.get_categories(x: Series) ndarray [source]
Get the categories of a pandas Series object.
- Parameters:
x (Series) – Pandas Series object.
- Returns:
Numpy array.
- Return type:
ndarray
Distributed utilities
This module contains helper functions for distributed training.
- class cellarium.ml.utilities.distributed.GatherLayer(*args, **kwargs)[source]
Bases:
Function
Gather tensors from all process, supporting backward propagation.
- cellarium.ml.utilities.distributed.get_rank_and_num_replicas() tuple[int, int] [source]
This helper function returns the rank of the current process and the number of processes in the default process group. If distributed package is not available or default process group has not been initialized then it returns
rank=0
andnum_replicas=1
.- Returns:
Tuple of
rank
andnum_replicas
.- Return type:
tuple[int, int]
- cellarium.ml.utilities.distributed.get_worker_info() tuple[int, int] [source]
This helper function returns
worker_id
andnum_workers
. If it is running in the main process then it returnsworker_id=0
andnum_workers=1
.- Returns:
Tuple of
worker_id
andnum_workers
.- Return type:
tuple[int, int]
Testing utilities
This module contains helper functions for testing.
- cellarium.ml.utilities.testing.assert_positive(name: str, number: float)[source]
Assert that a number is positive.
- Parameters:
name (str) – The name of the number.
number (float) – The number to check.
- Raises:
ValueError – If the number is not positive.
- cellarium.ml.utilities.testing.assert_nonnegative(name: str, number: float)[source]
Assert that a number is non-negative.
- Parameters:
name (str) – The name of the number.
number (float) – The number to check.
- Raises:
ValueError – If the number is negative.
- cellarium.ml.utilities.testing.assert_columns_and_array_lengths_equal(matrix_name: str, matrix: ndarray | Tensor, array_name: str, array: ndarray | Tensor)[source]
Assert that the number of columns in a matrix matches the length of an array.
- Parameters:
matrix_name (str) – The name of the matrix.
matrix (ndarray | Tensor) – The matrix.
array_name (str) – The name of the array.
array (ndarray | Tensor) – The array.
- Raises:
ValueError – If the number of columns in the matrix does not match the length of the array.
- cellarium.ml.utilities.testing.assert_arrays_equal(a1_name: str, a1: ndarray, a2_name: str, a2: ndarray)[source]
Assert that two arrays are equal.
- Parameters:
a1_name (str) – The name of the first array.
a1 (ndarray) – The first array.
a2_name (str) – The name of the second array.
a2 (ndarray) – The second array.
- Raises:
ValueError – If the arrays are not equal.
- cellarium.ml.utilities.testing.assert_slope_equals(data: Series, slope: float, loglog: bool = False, atol: float = 0.0001)[source]
Assert that the slope of a series is equal to a given value.
- Parameters:
data (Series) – The
pandas.Series
object to check.slope (float) – Expected slope.
loglog (bool) – Whether to use log-log scale.
atol (float) – The absolute tolerance.
- Raises:
ValueError – If the slope is not equal to the given value.
- cellarium.ml.utilities.testing.record_out_coords(records: list[dict], width: int, name: str, t: int) Callable[[Module, Tensor, Tensor], None] [source]
Returns a hook to record layer output coordinate size.
- Parameters:
records (list[dict]) – The list of records to append to.
width (int) – The width of the model.
name (str) – The name of the layer.
t (int) – The time step.
- Returns:
A hook to record layer output coordinate size.
- Return type:
Callable[[Module, Tensor, Tensor], None]
- cellarium.ml.utilities.testing.get_coord_data(models: dict[int, Callable[[], Module]], train_loader: DataLoader, loss_fn: Callable[[Tensor, Tensor], Tensor], optim_fn: type[Optimizer], lr: float, nsteps: int, nseeds: int) DataFrame [source]
Get coordinate data for a model.
- Parameters:
models (dict[int, Callable[[], Module]]) – A dictionary mapping width to a function that returns a model.
train_loader (DataLoader) – The training data loader.
loss_fn (Callable[[Tensor, Tensor], Tensor]) – The loss function.
optim_fn (type[Optimizer]) – The optimizer class.
lr (float) – The learning rate.
nsteps (int) – The number of steps to train for.
nseeds (int) – The number of seeds to use.
- Returns:
A
pandas.DataFrame
containing the coordinate data.- Return type:
DataFrame