velocyto.analysis module

class velocyto.analysis.VelocytoLoom(loom_filepath: str)[source]

Bases: object

A convenient object to store the data of a velocyto loom file.

Data will be stored in memory

Examples

For usage examples consult the documentation

Variables:
  • S (np.ndarray) – Expressed spliced molecules
  • U (np.ndarray) – Unspliced molecule count
  • A (np.ndarray) – Ambiguous molecule count
  • ca (dict) – Column attributes of the loom file
  • ra (dict) – Row attributes of the loom file
  • loom_filepath (str) – The original path the loom files has been read from
  • initial_cell_size (int) – The sum of spliced molecules
  • initial_Ucell_size (int) – The sum of unspliced molecules
to_hdf5(filename: str, **kwargs) → None[source]

Serialize the VelocytoLoom object in its current state

Parameters:
  • filename – The name of the file that will be generated (the suffix hdf5 is suggested but not enforced)
  • **kwargs – The function accepts the arguments of dump_hdf5
Returns:

Return type:

Nothing. It saves a file that can be used to recreate the object in another session.

Note

The object can be reloaded calling load_velocyto_hdf5(filename)

plot_fractions(save2file: str = None) → None[source]

Plots a barplot showing the abundance of spliced/unspliced molecules in the dataset

Parameters:save2file (str (default: None)) – If not None specifies the file path to which plots get saved
Returns:
Return type:Nothing, it plots a barplot
filter_cells(bool_array: numpy.ndarray) → None[source]

Filter cells using a boolean array.

Parameters:bool_array (np.ndarray (size )) – array describing the cells to keep (True).
Returns:
Return type:Nothing but it removes some cells from S and U.
set_clusters(cluster_labels: numpy.ndarray, cluster_colors_dict: Dict[str, List[float]] = None, colormap: Any = None) → None[source]

Utility function to set cluster labels, names and colormap

Parameters:
  • cluster_labels (np.ndarray) – A vector of strings containing the name of the cluster for each cells
  • cluster_colors_dict (dict[str, List[float]]) – A mapping cluster_name -> rgb_color_triplet for example “StemCell”:[0.65,0.1,0.4]
  • colormap – (optional) In alternative to cluster_colors_dict a colormap object (e.g. from matplotlib or similar callable) can be passed
Returns:

Return type:

Nothing, the attributes cluster_labels, colorandum, cluster_ix, cluster_uid are created.

cluster_uid
cluster_ix
score_cv_vs_mean(N: int = 3000, min_expr_cells: int = 2, max_expr_avg: float = 20, min_expr_avg: int = 0, svr_gamma: float = None, winsorize: bool = False, winsor_perc: Tuple[float, float] = (1, 99.5), sort_inverse: bool = False, which: str = 'S', plot: bool = False) → numpy.ndarray[source]

Rank genes on the basis of a CV vs mean fit, it uses a nonparametric fit (Support Vector Regression)

Parameters:
  • N (int) – the number to select
  • min_expr_cells (int, (default=2)) – minimum number of cells that express that gene for it to be considered in the fit
  • min_expr_avg (int, (default=0)) – The minimum average accepted before discarding from the the gene as not expressed
  • max_expr_avg (float, (default=20)) – The maximum average accepted before discarding from the the gene as house-keeping/outlier
  • svr_gamma (float) – the gamma hyper-parameter of the SVR
  • winsorize (bool) – Wether to winsorize the data for the cv vs mean model
  • winsor_perc (tuple, default=(1, 99.5)) – the up and lower bound of the winsorization
  • sort_inverse (bool, (default=False)) – if True it sorts genes from less noisy to more noisy (to use for size estimation not for feature selection)
  • which (bool, (default="S")) – it performs the same cv_vs mean procedure on spliced “S” or unspliced “U” count “both” is NOT supported here because most often S the two procedure would have different parameters (notice that default parameters are good heuristics only for S)
  • plot (bool, default=False) – whether to show a plot
Returns:

  • Nothing but it creates the attributes
  • cv_mean_score (np.ndarray) – How much the observed CV is higher than the one predicted by a noise model fit to the data
  • cv_mean_selected (np.ndarray bool) – on the basis of the N parameter
  • Note (genes excluded from the fit will have in the output the same score as the lowest scoring gene in the dataset.)
  • To perform the filtering use the method filter_genes

robust_size_factor(pc: float = 0.1, which: str = 'both') → None[source]

Calculates a size factor in a similar way of Anders and Huber 2010

Parameters:
  • pc (float, default=0.1) – The pseudocount to add to the expression before taking the log for the purpose of the size factor calculation
  • which (str, default="both") – For which counts estimate the normalization size factor. It can be “both”, “S” or “U”
Returns:

  • Nothing but it creates the attribute self.size_factor and self.Usize_factor
  • normalization is self.S / self.size_factor and is performed by using self.normalize(relative_size=self.size_factor)

Note

Before running this method score_cv_vs_mean need to be run with sort_inverse=True, since only lowly variable genes are used for this size estimation

score_cluster_expression(min_avg_U: float = 0.02, min_avg_S: float = 0.08) → numpy.ndarray[source]

Prepare filtering genes on the basis of cluster-wise expression threshold

Parameters:
  • min_avg_U (float) – Include genes that have unspliced average bigger than min_avg_U in at least one of the clusters
  • min_avg_S (float) – Include genes that have spliced average bigger than min_avg_U in at least one of the clusters
  • Note (the two conditions are combined by and "&" logical operator) –
Returns:

  • Nothing but it creates the attribute
  • clu_avg_selected (np.ndarray bool) – The gene cluster that is selected
  • To perform the filtering use the method filter_genes

score_detection_levels(min_expr_counts: int = 50, min_cells_express: int = 20, min_expr_counts_U: int = 0, min_cells_express_U: int = 0) → numpy.ndarray[source]

Prepare basic filtering of genes on the basis of their detection levels

Parameters:
  • min_expr_counts (float) – The minimum number of spliced molecules detected considering all the cells
  • min_cells_express (float) – The minimum number of cells that express spliced molecules of a gene
  • min_expr_counts_U (float) – The minimum number of unspliced molecules detected considering all the cells
  • min_cells_express_U (float) – The minimum number of cells that express unspliced molecules of a gene
  • Note (the conditions are combined by and "&" logical operator) –
Returns:

  • Nothing but an attribute self.detection_level_selected is created
  • To perform the filtering by detection levels use the method filter_genes

filter_genes(by_detection_levels: bool = False, by_cluster_expression: bool = False, by_cv_vs_mean: bool = False, by_custom_array: Any = None, keep_unfiltered: bool = False) → None[source]

Filter genes taking care that all the matrixes and all the connected annotation get filtered accordingly

Attributes affected: .U, .S, .ra

Parameters:
  • by_detection_levels (bool, default=False) – filter genes by the score_detection_levels result
  • by_cluster_expression (bool, default=False) – filter genes by the score_cluster_expression result
  • by_cv_vs_mean (bool, default=False) – filter genes by the score_cluster_expression result
  • np.ndarray, default=None (by_custom_array,) – provide a boolean or index array
  • keep_unfiltered (bool, default=False) – whether to create attributes self.S_prefilter, self.U_prefilter, self.ra_prefilter, (array will be made sparse to minimize memory footprint) or just overwrite the previous arrays
Returns:

Return type:

Nothing but it updates the self.S, self.U, self.ra attributes

custom_filter_attributes(attr_names: List[str], bool_filter: numpy.ndarray) → None[source]

Filter attributes given a boolean array. attr_names can be dictionaries or numpy arrays

Parameters:
  • attr_names (List[str]) – a list of the attributes to be modified. The can be 1d arrays, dictionary of 1d arrays, ndarrays, will be filtered by axis=0 if .T is specified by axis=-1
  • bool_filter – the boolean filter to be applied
Returns:

Return type:

Nothing it filters the specified attributes

normalize(which: str = 'both', size: bool = True, log: bool = True, pcount: float = 1, relative_size: numpy.ndarray = None, use_S_size_for_U: bool = False, target_size: Tuple[float, float] = (None, None)) → None[source]

Normalization interface

Parameters:
  • which (either 'both', 'S', 'U', "imputed", "Sx", "Ux") – which attributes to normalize. “both” corresponds to “S” and “U” “imputed” corresponds to “Sx” and “Ux”
  • size (bool) – perform size normalization
  • log (bool) – perform log normalization (if size==True, this comes after the size normalization)
  • pcount (int, default: 1) – The extra count added when logging (log2)
  • relative_size (np.ndarray, default=None) – if None it calculate the sums the molecules per cell (self.S.sum(0)) if an array is provided it is used for the normalization
  • use_S_size_for_U (bool) – U is size normalized using the sum of molecules of S
  • target_size (float or Tuple[float, float] (depending if the which parameter implies 1 or more normalizations)) – the size of the cells after normalization will be set to. If tuple the order is (S, U) or (Sx, Ux) If None the target size is the average of the cell sizes
Returns:

  • Nothing but creates the attributes U_norm, U_sz and S_norm, “S_sz”
  • or Ux_norm, Ux_sz and Sx_norm, “Sx_sz”

perform_PCA(which: str = 'S_norm', n_components: int = None, div_by_std: bool = False) → None[source]

Perform PCA (cells as samples)

Parameters:
  • which (str, default="S_norm") – The name of the attribute to use for the calculation (e.g. S_norm or Sx_norm)
  • n_components (int, default=None) – Number of components to keep. If None all the components will be kept.
  • div_by_std (bool, default=False) – Wether to divide by standard deviation
Returns:

  • Returns nothing but it creates the attributes
  • pca (np.ndarray) – a numpy array of shape (cells, npcs)

normalize_by_total(min_perc_U: float = 0.5, plot: bool = False, skip_low_U_pop: bool = True, same_size_UnS: bool = False) → None[source]

Normalize the cells using the (initial) total molecules as size estimate

Parameters:
  • min_perc_U (float) – the percentile to use as a minimum value allowed for the size normalization
  • plot (bool, default=False) – whether
  • skip_low_U_pop (bool, default=True) – population with very low unspliced will not be multiplied by the scaling factor to avoid predicting very strong velocity just as a consequence of low detection
  • same_size_UnS (bool, default=False) – Each cell is set tot have the same total number of spliced and unspliced molecules
Returns:

  • Returns nothing but it creates the attributes
  • small_U_pop (np.ndarray) – Cells with extremely low unspliced count

normalize_by_size_factor(min_perc_U: float = 0.5, plot: bool = False, skip_low_U_pop: bool = True, same_size_UnS: bool = False) → None[source]

Normalize the cells using the (initial) size_factor

Parameters:
  • min_perc_U (float) – the percentile to use as a minimum value allowed for the size normalization
  • plot (bool, default=False) – whether
  • skip_low_U_pop (bool, default=True) – population with very low unspliced will not be multiplied by the scaling factor to avoid predicting very strong velocity just as a consequence of low detection
  • same_size_UnS (bool, default=False) – Each cell is set tot have the same total number of spliced and unspliced molecules
Returns:

  • Returns nothing but it creates the attributes
  • small_U_pop (np.ndarray) – Cells with extremely low unspliced count

adjust_totS_totU(skip_low_U_pop: bool = True, normalize_total: bool = False, fit_with_low_U: bool = True, svr_C: float = 100, svr_gamma: float = 1e-06, plot: bool = False) → None[source]

Adjust the spliced count on the base of the relation S_sz_tot and U_sz_tot

Parameters:
  • skip_low_U_pop (bool, default=True) – Do not normalize the low unspliced molecules cell population to avoid overinflated values
  • normalize_total (bool, default=False) – If this is True the function results in a normalization by median of both U and S. NOTE: Legacy compatibility, I might want to split this into a different function.
  • fit_with_low_U (bool, default=True) – Wether to consider the low_U population for the fit
  • svr_C (float) – The C parameter of scikit-learn Support Vector Regression
  • svr_gamma (float) – The gamma parameter of scikit-learn Support Vector Regression
  • plot (bool) – Whether to plot the results of the fit
Returns:

  • Nothing but it modifies the attributes
  • U_sz (np.ndarray)

normalize_median(which: str = 'imputed', skip_low_U_pop: bool = True) → None[source]

Normalize cell size to the median, for both S and U.

Parameters:
  • which (str, default="imputed") – “imputed” or “renormalized”
  • skip_low_U_pop (bool=True) – Whether to skip the low U population defined in normalize_by_total
Returns:

  • Nothing but it modifies the attributes
  • S_sz (np.ndarray)
  • U_sz (np.ndarray)
  • or
  • Sx_sz (np.ndarray)
  • Ux_sz (np.ndarray)

plot_pca(dim: List[int] = [0, 1, 2], elev: float = 60, azim: float = -140) → None[source]

Plot 3d PCA

knn_imputation(k: int = None, pca_space: float = True, metric: str = 'euclidean', diag: float = 1, n_pca_dims: int = None, maximum: bool = False, size_norm: bool = True, balanced: bool = False, b_sight: int = None, b_maxl: int = None, group_constraint: Union[str, numpy.ndarray] = None, n_jobs: int = 8) → None[source]

Performs k-nn smoothing of the data matrix

Parameters:
  • k (int) – number of neighbors. If None the default it is chosen to be 0.025 * Ncells
  • pca_space (bool, default=True) – if True the knn will be performed in PCA space (pcs) otherwise it will use log2 size normalized data (S_norm)
  • metric (str) – “euclidean” or “correlation”
  • diag (int, default=1) – before smoothing this value is substituted in the diagonal of the knn contiguity matrix Resulting in a reduction of the smoothing effect. E.g. if diag=8 and k=10 value of Si = (8 * S_i + sum(S_n, with n in 5nn of i)) / (8+5)
  • maximum (bool, default=False) – If True the maximum value of the smoothing and the original matrix entry is taken.
  • n_pca_dims (int, default=None) – number of pca to use for the knn distance metric. If None all pcs will be used. (used only if pca_space == True)
  • balanced (bool) – whether to use BalancedKNN version
  • b_sight (int) – the sight parameter of BalancedKNN (used only if balanced == True)
  • b_maxl (int) – the maxl parameter of BalancedKNN (used only if balanced == True)
  • group_constraint (str or np.ndarray[int]:) – currently implemented only for balanced = True if “clusters” the the clusters will be used as a constraint so that cells of different clusters cannot be neighbors if an array of integers of shape vlm.S.shape[1] it will be interpreted as labels of the groups
  • n_jobs (int, default 8) – number of parallel jobs in knn calculation
Returns:

  • Nothing but it creates the attributes
  • knn (scipy.sparse.csr_matrix) – knn contiguity matrix
  • knn_smoothing_w (scipy.sparse.lil_matrix) – the weights used for the smoothing
  • Sx (np.ndarray) – smoothed spliced
  • Ux (np.ndarray) – smoothed unspliced

knn_imputation_precomputed(knn_smoothing_w: scipy.sparse.lil.lil_matrix, maximum: bool = False) → None[source]

Performs k-nn imputation (like .knn_imputation()) but with a precomputed weight matrix

Parameters:
  • knn_smoothing_w (sparse.lil_matrix) – the sparse matrix to be convolved with self.S_sz and self.U_sz This should be the result of something like: connectivity.setdiag(diagonal_value) knn_smoothing_w = connectivity_to_weights(connectivity)
  • maximum (bool, default=False) – whether to take the maximum value of the smoothing and the original matrix
Returns:

  • Nothing but it creates the attributes
  • Sx (np.ndarray) – smoothed spliced
  • Ux (np.ndarray) – smoothed unspliced

gene_knn_imputation(k: int = 15, pca_space: float = False, metric: str = 'correlation', diag: float = 1, scale_weights: bool = True, balanced: bool = True, b_sight: int = 100, b_maxl: int = 18, n_jobs: int = 8) → None[source]

Performs genes k-nn smoothing of the genes

Parameters:
  • k (int, default=15) – number of neighbors
  • pca_space (bool, default=False) – if True the knn will be performed in PCA space (pcs) otherwise it will use log2 size normalized data (S_norm)
  • metric (str, default="correlation") – “euclidean” or “correlation”
  • diag (int, default=1) – before smoothing this value is substituted in the diagonal of the knn contiguity matrix Resulting in a reduction of the smoothing effect E.g. if diag=8 and k=10 value of Si = (8 * S_i + sum(S_n, with n in 5nn of i)) / (8+5)
  • scale_weights (bool, default=True) – whether to scale weights by gene total expression/yield
  • balanced (bool, default=True) – whether to use BalancedKNN version
  • b_sight (int, default=100) – the sight parameter of BalancedKNN (used only if balanced == True)
  • b_maxl (int, default=18) – the maxl parameter of BalancedKNN (used only if balanced == True)
  • n_jobs (int, default=8) – number of parallel jobs in knn calculation
Returns:

  • Nothing but it creates the attributes
  • gknn (scipy.sparse.csr_matrix) – genes knn contiguity matrix
  • gknn_smoothing_w (scipy.sparse.lil_matrix) – the weights used for the smoothing of the genes
  • Sx (np.ndarray) – smoothed spliced
  • Ux (np.ndarray) – smoothed unspliced

fit_gammas(steady_state_bool: numpy.ndarray = None, use_imputed_data: bool = True, use_size_norm: bool = True, fit_offset: bool = True, fixperc_q: bool = False, weighted: bool = True, weights: numpy.ndarray = 'maxmin_diag', limit_gamma: bool = False, maxmin_perc: List[float] = [2, 98], maxmin_weighted_pow: float = 15) → None[source]

Fit gamma using spliced and unspliced data

Parameters:
  • steady_state_bool (np.ndarray, default=None) – if a boolean array is specified, gamma is fitted using only the corresponding cells
  • use_imputed_data (bool, default=True) – use knn smoothed data
  • use_size_norm (bool, default=False) – use size normalized data for the fit
  • fit_offset (bool, default=True) – Fit with offset
  • fixperc_q (bool, default=False) – (when fit_offset==False) Wether to fix the offset to a lower percentile of the unspliced
  • weighted (bool, default=True) – use weights for the least squares fit
  • weights (string or np.ndarray, default="maxmin_diag") – the method to determine the weights of the least squares fit. “maxmin_diag”, “maxmin”, “sum”, “prod”, “maxmin_weighted” are supported if a 2d np.ndarray is provided the entry (i,j) is the weight of the cell j when fitting gamma to gene i
  • limit_gamma (np.ndarray, default=True) – whether to limit gamma when unspliced is much higher than spliced
  • maxmin_perc (List[float], default=[2,98]) – the percentile to use if weights = “maxmin” or “maxmin_diag”
Returns:

  • Nothing it just creates the attributes
  • gammas (np.ndarray) – the vector of the gammas fit to each gene
  • q (np.ndarray) – the vector of offsets of the fit
  • R2 (np.ndarray (optional)) – The vector of squared coefficient of determination

filter_genes_good_fit(minR: float = 0.1, min_gamma: float = 0.01) → None[source]

For backwards compatibility a wrapper around filter_genes_by_phase_portrait

filter_genes_by_phase_portrait(minR2: float = 0.1, min_gamma: float = 0.01, minCorr: float = 0.1) → None[source]

Use the coefficient of determination to filter away genes that have an irregular/complex phase portrait

Parameters:
  • minR2 (float, default=0.1) – Filter away low coefficient of determination fits. If None this filtering will be skipped
  • min_gamma (float, default=0.01) – Filter away low gammas. If None this filtering will be skipped
  • minCorr (flaot, default=0.2) – Filter away low spliced-unspliced correlation. If None this filtering will be skipped
Returns:

  • Nothing but modifies it filters out the genes that do not satisfy the conditions
  • This affects (“U”, “U_sz”, “U_norm”, “Ux”, “Ux_sz”, “Ux_norm”, “S”, “S_sz”, “S_norm”, “Sx”, “Sx_sz”, “Sx_norm”, “gammas”, “q”, “R2”)

predict_U(which_gamma: str = 'gammas', which_S: str = 'Sx_sz', which_offset: str = 'q') → None[source]

Predict U (gamma * S) given the gamma model fit

Parameters:
  • which_gamma (str, default="gammas") – name of the attribute to use as gamma
  • which_S (str, default="Sx_sz") – name of the attribute to use as S
  • which_offset (str, default="q") – name of the attribute containing the offset
Returns:

  • Noting but it creates the attribute
  • Upred (np.ndarray) – unspliced estimated as gamma * S

calculate_velocity(kind: str = 'residual', eps: float = None) → None[source]

Calculate velocity

Parameters:
  • kind (str, default="residual") – “residual” calculates the velocity as U_measured - U_predicted
  • eps (float, default=None) – if specified, velocities with absolute value smaller than eps * range_of_U will be set to 0 if None this step will be skipped
  • Results
  • -------
  • but it creates the attribute (Nothing) –
  • velocity (np.ndarray) – U_measured - U_predicted
calculate_shift(assumption: str = 'constant_velocity', delta_t: float = 1) → None[source]

Find the change (deltaS) in gene expression for every cell

Parameters:
  • assumption (str, default="constant_velocity") – constant_velocity (described in the paper as Model I) constant_unspliced (described in the paper as Model II)
  • delta_t (float, default=1) – the time step for extrapolation
Returns:

  • Nothing it only creates the following attributes
  • delta_S (np.ndarray) – The variation in gene expression

extrapolate_cell_at_t(delta_t: float = 1, clip: bool = True) → None[source]

Extrapolate the gene expression profile for each cell after delta_t

Parameters:
  • delta_t (float, default=1) – the time step considered for the extrapolation
  • clip (bool, default=True) – If True negative values are clipped to zero
Returns:

  • Nothing but it creates the attributes
  • Sx_sz_t (np.ndarray) – the extrapolated expression profile
  • used_delta_t (float) – stores delta_t for future usage

perform_TSNE(n_dims: int = 2, perplexity: float = 30, initial_pos: numpy.ndarray = None, theta: float = 0.5, n_pca_dim: int = None, max_iter: int = 1000) → None[source]

Perform TSNE on the PCA using barnes hut approximation

estimate_transition_prob(hidim: str = 'Sx_sz', embed: str = 'ts', transform: str = 'sqrt', ndims: int = None, n_sight: int = None, psc: float = None, knn_random: bool = True, sampled_fraction: float = 0.3, sampling_probs: Tuple[float, float] = (0.5, 0.1), max_dist_embed: float = None, n_jobs: int = 4, threads: int = None, calculate_randomized: bool = True, random_seed: int = 15071990, **kwargs) → None[source]

Use correlation to estimate transition probabilities for every cells to its embedding neighborhood

Parameters:
  • hidim (str, default="Sx_sz") – The name of the attribute containing the high dimensional space. It will be retrieved as getattr(self, hidim) The updated vector at time t is assumed to be getattr(self, hidim + “_t”) Appending .T to the string will transpose the matrix (useful in case we want to use S or Sx)
  • embed (str, default="ts") – The name of the attribute containing the embedding. It will be retrieved as getattr(self, embed)
  • transform (str, default="sqrt") – The transformation that is applies on the high dimensional space. If None the raw data will be used
  • ndims (int, default=None) – The number of dimensions of the high dimensional space to work with. If None all will be considered It makes sense only when using principal components
  • n_sight (int, default=None (also n_neighbors)) – The number of neighbors to take into account when performing the projection
  • psc (float, default=None) – pseudocount added in variance normalizing transform If None, 1 would be used for log, 0 otherwise
  • knn_random (bool, default=True) – whether to random sample the neighborhoods to speedup calculation
  • sampling_probs (Tuple, default=(0.5, 1)) –
  • max_dist_embed (float, default=None) – CURRENTLY NOT USED The maximum distance allowed If None it will be set to 0.25 * average_distance_two_points_taken_at_random
  • n_jobs (int, default=4) – number of jobs to calculate knn this only applies to the knn search, for the more time consuming correlation computation see threads
  • threads (int, default=None) – The threads will be used for the actual correlation computation by default half of the total.
  • calculate_randomized (bool, default=True) – Calculate the transition probabilities with randomized residuals. This can be plotted downstream as a negative control and can be used to adjust the visualization scale of the velocity field.
  • random_seed (int, default=15071990) – Random seed to make knn_random mode reproducible
calculate_embedding_shift(sigma_corr: float = 0.05, expression_scaling: bool = True, scaling_penalty: float = 1.0) → None[source]

Use the transition probability to project the velocity direction on the embedding

Parameters:
  • sigma_corr (float, default=0.05) – the kernel scaling
  • expression_scaling (bool, default=True) – rescale arrow intensity penalizing arrows that explain very small amount of expression differences
  • scaling_penalty (float, default=1) – Higher values correspond to a stronger penalty
Returns:

  • Nothing but it creates the following attributes
  • transition_prob (np.ndarray) – the transition probability calculated using the exponential kernel on the correlation coefficient
  • delta_embedding (np.ndarray) – The resulting vector

calculate_grid_arrows()[source]

Calculate the velocity using a points on a regular grid and a gaussian kernel

Note: the function should work also for n-dimensional grid

Parameters:
  • embed (str, default=embedding) – The name of the attribute containing the embedding. It will be retrieved as getattr(self, embed) The difference vector is getattr(self, ‘delta’ + ‘_’ + embed)
  • smooth (float, smooth=0.5) – Higher value correspond to taking in consideration further points the standard deviation of the gaussian kernel is smooth * stepsize
  • steps (tuple, default) – the number of steps in the grid for each axis
  • n_neighbors – number of neighbors to use in the calculation, bigger number should not change too much the results.. …as soon as smooth is small Higher value correspond to slower execution time
  • n_jobs – number of processes for parallel computing
Returns:

  • Nothing but it sets the attributes
  • flow_embedding (np.ndarray) – the coordinates of the embedding
  • flow_grid (np.ndarray) – the gridpoints
  • flow (np.ndarray) – vector field coordinates
  • flow_magnitude (np.ndarray) – magnitude of each vector on the grid
  • total_p_mass (np.ndarray) – density at each point of the grid

prepare_markov(sigma_D: numpy.ndarray, sigma_W: numpy.ndarray, direction: str = 'forward', cells_ixs: numpy.ndarray = None) → None[source]

Prepare a transition probability for Markov process

Parameters:
  • sigma_D (float) – the standard deviation used on the locality-limiting component
  • sigma_W (float) – the standard deviation used on the noise component
  • direction (str, default="backwards") – whether to diffuse forward of backwards
  • cells_ixs (np.ndarray, default=None) – Cells to use, if None all the cells will be considered.
Returns:

  • Nothing but it creates the following attributes
  • tr (np.ndarray) – the transition probability matrix

run_markov(starting_p: numpy.ndarray = None, n_steps: int = 2500, mode: str = 'time_evolution') → None[source]

Run a Markov process

Parameters:
  • starting_p (np.ndarray, default=None) – specifies the starting density if None is passed an array of 1/self.tr.shape[0] will be created
  • n_steps (np.ndarray, default=2500) – Numbers of steps to be performed
  • mode (str, default="time_evolution") – this argument is passed to the Diffusion.diffuse call
Returns:

  • Nothing but it creates the attribute
  • diffused (np.ndarray) – The probability to be found at any of the states

default_filter_and_norm(min_expr_counts: int = None, min_cells_express: int = None, N: int = None, min_avg_U: float = None, min_avg_S: float = None) → None[source]

Useful function to get started with velocyto: it performs initial filtering and feature selection, it uses some heuristics to determine the thresholds, results might be suboptimal.

See the analysis quick start guide for further info.

Parameters:
  • min_expr_counts (int, default=None) – filtering condition: the minimum spliced counts
  • min_cells_express (int, default=None) – filtering condition: the minimum number of cells expressing the gene
  • N (int, default=None) – number of genes selected by the feature selection procedure
  • min_avg_U (float, default=None) – if cluster have been specified beforehand (using the function set_clusters) then this is the minimum average unspliced molecules per cluster
  • min_avg_S (float, default=None) – if cluster have been specified beforehand (using the function set_clusters) then this is the minimum average spliced molecules per cluster
default_fit_preparation(k: int = None, n_comps: int = None) → None[source]

Useful function to get started with velocyto: it performs PCA and kNN smoothing, it uses some heuristics to determine the parameters, results might be suboptimal.

See the analysis quick start guide for further info.

Parameters:
  • k (int, default=None) – k in k-NearestNeighbors smoothing
  • n_comps (int, default=None) – numbed of components in pca
plot_phase_portraits(genes: List[str]) → None[source]

Plot spliced-unspliced scatterplots resembling phase portraits

Parameters:genes (List[str]) – A list of gene symbols.
plot_grid_arrows(quiver_scale: Union[str, float] = 'auto', scale_type: str = 'relative', min_mass: float = 1, min_magnitude: float = None, scatter_kwargs_dict: Dict = None, plot_dots: bool = False, plot_random: bool = False, **quiver_kwargs) → None[source]

Plots vector field averaging velocity vectors on a grid

Parameters:
  • quiver_scale (float, default="auto") – Rescaling factor applied to the arrow field to enhance visibility If “auto” the scale is selected using the randomized (negative) control (even if plot_random`=False) If a float is provided the interpretation of the value depends on the parameter `scale_type, see below. NOTE: In the situation where “auto” is resulting in very small or big velocities, pass a float to this parameter The float will be interpreted as a scaling, importantly both the data and the control will be scaled in this way you can rescale the velocity arbitrarily without the risk of observing just an overfit of the noise
  • scale_type (str, default="relative") – How to interpret quiver_scale: If “relative” (default) the value will be used as a scaling factor and multiplied by the value from “auto” (it follows that quiver_scale=”auto” is equivalent to quiver_scale=1) If “absolute” the value will be passed to the matplotlib quiver function (not recommended if you are not sure what this implies)
  • min_mass (float, default=1) – the minimum density around a grid point for it to be considered and plotted
  • min_magnitude (float, default=None) – the minimum magnitude of the velocity for it to be considered and plotted
  • scatter_kwargs_dict (dict, default=None) – a dictionary of keyword arguments to pass to scatter by default the following are passed: s=20, zorder=-1, alpha=0.2, lw=0, c=self.colorandum. But they can be overridden.
  • plot_dots (bool, default=True) – whether to plot dots in correspondence of all low velocity grid points
  • plot_random (bool, default=True) – whether to plot the randomized control next to the plot
  • **quiver_kwargs (dict) – keyword arguments to pass to quiver By default the following are passed angles=’xy’, scale_units=’xy’, minlength=1.5. But they can be overridden.
plot_arrows_embedding(choice: Union[str, int] = 'auto', quiver_scale: Union[str, float] = 'auto', scale_type: str = 'relative', plot_scatter: bool = False, scatter_kwargs: Dict = {}, color_arrow: str = 'cluster', new_fig: bool = False, plot_random: bool = True, **quiver_kwargs) → None[source]

Plots velocity on the embedding cell-wise

Parameters:
  • choice (int, default = "auto") – the number of cells to randomly pick to plot the arrows (To avoid overcrowding)
  • quiver_scale (float, default="auto") – Rescaling factor applied to the arrow field to enhance visibility If “auto” the scale is selected using the randomized (negative) control (even if plot_random`=False) If a float is provided the interpretation of the value depends on the parameter `scale_type, see below. NOTE: Despite a similar option than plot_grid_arrows, here there is no strong motivation to calculate the scale relative to the randomized control This is because the randomized doesn’t have to have smaller velocity cell-wise, there might be for example scattered cells that will have strong velocity but they will, correctly just average out when calculating the average velocity field.
  • scale_type (str, default="relative") – How to interpret quiver_scale: If “relative” (default) the value will be used as a scaling factor and multiplied by the value from “auto” (it follows that quiver_scale=”auto” is equivalent to quiver_scale=1) If “absolute” the value will be passed to the matplotlib quiver function
  • plot_scatter (bool, default = False) – whether to plot the points
  • scatter_kwargs (Dict) – A dictionary containing all the keywords arguments to pass to matplotlib scatter by default the following are passed: c=”0.8”, alpha=0.4, s=10, edgecolor=(0, 0, 0, 1), lw=0.3. But they can be overridden.
  • color_arrow (str, default = "cluster") – the color of the arrows, if “cluster” the arrows are colored the same as the cluster
  • new_fig (bool, default=False) – whether to create a new figure
  • plot_random (bool, default=True) – whether to plot the randomized control next to the plot
  • **quiver_kwargs (dict) – keyword arguments to pass to quiver By default the following are passed angles=’xy’, scale_units=’xy’, minlength=1.5. But they can be overridden.
Returns:

Return type:

Nothing, just plots the tsne with arrows

plot_cell_transitions(cell_ix: int = 0, alpha: float = 0.1, alpha_neigh: float = 0.2, cmap_name: str = 'RdBu_r', plot_arrow: bool = True, mark_cell: bool = True, head_width: int = 3) → None[source]

Plot the probability of a cell to transition to any other cell

This function is untested

plot_velocity_as_color(gene_name: str = None, cmap: Any = <matplotlib.colors.LinearSegmentedColormap object>, gs: Any = None, which_tsne: str = 'ts', **kwargs) → None[source]

Plot velocity as color on the Tsne embedding

Parameters:
  • gene_name (str) – The name of the gene, should be present in self.S
  • cmap (maplotlib.cm.Colormap, default=maplotlib.cm.RdBu_r) – Colormap to use, divergent ones are better, RdBu_r is default Notice that 0 will be always set as the center of the colormap. (e.g. white in RdBu_r)
  • gs (Gridspec subplot) – Gridspec subplot to plot on.
  • which_tsne (str, default="ts") – the name of the attributed where the desired embedding is stored
  • **kwargs (dict) – other keywords arguments will be passed to the plt.scatter call
Returns:

Return type:

Nothing

plot_expression_as_color(gene_name: str = None, imputed: bool = True, cmap: Any = <matplotlib.colors.LinearSegmentedColormap object>, gs: Any = None, which_tsne: str = 'ts', **kwargs) → None[source]

Plot expression as color on the Tsne embedding

Parameters:
  • gene_name (str) – The name of the gene, should be present in self.S
  • imputed (bool, default=True) – whether to plot the smoothed or the raw data
  • cmap (maplotlib.cm.Colormap, default=maplotlib.cm.Greens) – Colormap to use.
  • gs (Gridspec subplot) – Gridspec subplot to plot on.
  • which_tsne (str, default="ts") – the name of the attributed where the desired embedding is stored
  • **kwargs (dict) – other keywords arguments will be passed to the plt.scatter call
Returns:

Return type:

Nothing

reload_raw(substitute: bool = False) → None[source]

Reload raw data as it was before filtering steps

Parameters:substitute (bool=False) – if True S, U, A, ca, ra will be all overwritten if False S, U, A, ca, ra will be loaded separately as raw_S, raw_U, raw_A, raw_ca, raw_ra
velocyto.analysis.scatter_viz(x: numpy.ndarray, y: numpy.ndarray, *args, **kwargs) → Any[source]

A wrapper of scatter plot that guarantees that every point is visible in a very crowded scatterplot

Parameters:
  • x (np.ndarray) – x axis coordinates
  • y (np.ndarray) – y axis coordinates
  • and kwargs (args) – positional and keyword arguments as in matplotplib.pyplot.scatter
Returns:

Return type:

Plots the graph and returns the axes object

velocyto.analysis.ixs_thatsort_a2b(a: numpy.ndarray, b: numpy.ndarray, check_content: bool = True) → numpy.ndarray[source]

This is super duper magic sauce to make the order of one list to be like another

velocyto.analysis.colormap_fun(x: numpy.ndarray) → numpy.ndarray[source]
velocyto.analysis.numba_random_seed[source]

Same as np.random.seed but for numba

velocyto.analysis.permute_rows_nsign[source]

Permute in place the entries and randomly switch the sign for each row of a matrix independently.

velocyto.analysis.scale_to_match_median(sparse_matrix: scipy.sparse.csr.csr_matrix, genes_total: numpy.ndarray) → scipy.sparse.csr.csr_matrix[source]

Normalize contribution of different neighbor genes to match the median totals

Parameters:
  • sparse_matrix (sparse.csr_matrix) – weights matrix
  • genes_total (sparse.csr_matrix shape=(sparse_matrix.shape[0])) – array of the total molecules detected for each gene
Returns:

  • knn_weights (sparse.csr_matrix) – sparse_matrix after the normalization
  • # NOTE, since the use I made of this later I could have changed sparse_matrix in place

velocyto.analysis.gaussian_kernel(X: numpy.ndarray, mu: float = 0, sigma: float = 1) → numpy.ndarray[source]

Compute gaussian kernel

velocyto.analysis.load_velocyto_hdf5(filename: str) → velocyto.analysis.VelocytoLoom[source]

Loads a Velocyto loom object from an hdf5 file

Parameters:filename (str) – The name of the serialized file
Returns:
Return type:A VelocytoLoom object

Note

The hdf5 file must have been created using VelocytoLoom.to_hdf5 or the dump_hdf5 function