Preprocessing

cellarium.ml.preprocessing.get_highly_variable_genes(gene_names: list, mean: Tensor, var: Tensor, n_top_genes: int | None = None, min_disp: float | None = 0.5, max_disp: float | None = inf, min_mean: float | None = 0.0125, max_mean: float | None = 3, n_bins: int = 20, batch_mean_bg: Tensor | None = None, batch_var_bg: Tensor | None = None, batch_ids: list[str] | None = None) DataFrame[source]

Annotate highly variable genes using the seurat flavor.

Replicates scanpy.pp.highly_variable_genes with flavor='seurat'. Optionally accepts per-batch statistics for batch-aware selection.

References:

  1. Highly Variable Genes from Scanpy.

Parameters:
  • gene_names (list) – Ensembl gene ids.

  • mean (Tensor) – Overall gene expression means in count space (shape n_genes).

  • var (Tensor) – Overall gene expression variances in count space (shape n_genes).

  • n_top_genes (int | None) – Number of highly-variable genes to keep.

  • min_disp (float | None) – Ignored when n_top_genes is set.

  • max_disp (float | None) – Ignored when n_top_genes is set.

  • min_mean (float | None) – Ignored when n_top_genes is set.

  • max_mean (float | None) – Ignored when n_top_genes is set.

  • n_bins (int) – Number of bins for mean-expression binning.

  • batch_mean_bg (Tensor | None) – Per-batch means in count space of shape (n_batch, n_genes).

  • batch_var_bg (Tensor | None) – Per-batch variances in count space of shape (n_batch, n_genes).

  • batch_ids (list[str] | None) – Batch labels of length n_batch.

Returns:

DataFrame indexed by gene_names with columns highly_variable, means, dispersions, dispersions_norm, mean_bin (single-batch), highly_variable_nbatches and highly_variable_intersection (batch mode).

Return type:

DataFrame