CLI Internals¶
The following page describes the internal API used by the Command Line Pipeline. These functions and objects are not meant for interactive usage. So this page is useful if you want to change the behavior of the molecule counting pipeline.
velocyto.counter module¶
-
class
velocyto.counter.
ExInCounter
(sampleid: str, logic: velocyto.logic.Logic, valid_bcset: Set[str] = None, umi_extension: str = 'no', onefilepercell: bool = False, dump_option: str = '0', outputfolder: str = './', loom_numeric_dtype: str = 'uint16')[source]¶ Bases:
object
Main class to do the counting of introns and exons
-
peek
(bamfile: str, lines: int = 1000) → None[source]¶ Peeks into the samfile to determine if it is a cellranger or dropseq file
-
peek_umi_only
(bamfile: str, lines: int = 30) → None[source]¶ Peeks for umi into the samfile to determine if it is a cellranger or dropseq file
-
iter_alignments
(bamfiles: Tuple[str], unique: bool = True, yield_line: bool = False) → Iterable[source]¶ Iterates over the bam/sam file and yield Read objects
Parameters: - bamfiles (Tuple[str]) – path to the bam files
- unique (bool) – yield only unique alignments
- yield_line (bool) – whether to yield the raw sam line
Returns: - yields vcy.Read for each valid line of the bam file
- or a Tuple (vcy.Read, sam_line) if
yield_line==True
- NOTE (At the file change it yields a None)
-
read_repeats
(gtf_file: str, tolerance: int = 5) → Dict[str, List[velocyto.feature.Feature]][source]¶ Read repeats and merge close ones into highly repetitive areas
Parameters: - gtf_file (str) – file to read
- tolerance (int, default=5) – if two repeats intervals to be masked are found closer than tolerance bases from each other they are fused in one bigger masked interval. Notice that in the downstream analysis only reads that are fall inside mask intervals are discarded
Returns: mask_ivls_by_chromstrand – A dictionary key: chromosome+strand value: list of features (repeat intervals) (The reference is returned but an internal attribure self.self.masked_by_chrm_strand is kept)
Return type: Dict[str, List[vcy.Feature]]
-
assign_indexes_to_genes
(features: Dict[str, velocyto.transcript_model.TranscriptModel]) → None[source]¶ Assign to each newly encoutered gene an unique index corresponding to the output matrix column ix
-
read_transcriptmodels
(gtf_file: str) → Dict[str, Dict[str, velocyto.transcript_model.TranscriptModel]][source]¶ Reads transcript models from a sorted .gtf file
Parameters: gtf_file (str) – Path to the sorted gtf file Returns: - annotations_by_chrm_strand (Dict[str, List[vcy.TrancriptModel]]) – A dictionary key: chromosome+strand value: list of trascript models (The reference is returned but an internal attribure self.annotations_by_chrm_strand is kept)
- There will exist an object vcy.Features for the same exon appearing in a different vcy.TranscriptModel. (his is desired)
-
peek_and_correct
(gtf_lines: List[str]) → List[str][source]¶ Look at the first 20 instances of a list of lines of a gtf file to dermine if exon number is specified as it should. If econ number is not contained it will infer the exon number sorting the list by lexicographic ordering tr_id, start, end
Parameters: gtf_lines – a list of the lines of a gtf file Returns: the same list or the list corrected with added a exon number (filtered to contain only exons) Return type: gtf_lines
-
mark_up_introns
(bamfile: Tuple[str], multimap: bool) → None[source]¶ Mark up introns that have reads across exon-intron junctions
Parameters: - bamfile (Tuple[str]) – path to the bam files to markup
- logic (vcy.Logic) – The logic object to use, changes in different techniques / levels of strictness NOTE: Right now it is not used
Returns: Return type: Nothing it just add to validation to the vcy.Features
Note
Situations not considered: # If an the exon is so short that is possible to get both exonA-exonB junction and exonB-intronB boundary in the same read
-
count
(bamfile: Tuple[str], multimap: bool, cell_batch_size: int = 100, molecules_report: bool = False) → Tuple[Dict[str, List[numpy.ndarray]], List[str]][source]¶ Do the counting of molecules
Parameters: - bamfile (str) – path to the bam files to markup
- cell_batch_size (int, default = 50) – it defines whether to require or not exon-intron spanning read to consider an intron valid.
Returns: Return type: dict_list_arrays, cell_bcs_order
Note
The memory footprint could be reduced allocating scipy sparse arrays
-
velocyto.transcript_model module¶
-
class
velocyto.transcript_model.
TranscriptModel
(trid: str, trname: str, geneid: str, genename: str, chromstrand: str)[source]¶ Bases:
object
A simple object representing a transcript model as a list of vcy.Feature objects
-
trid
¶
-
trname
¶
-
geneid
¶
-
genename
¶
-
chromstrand
¶
-
list_features
¶
-
start
¶ NOTE – This should be accessed only after the creation of the transcript model is finished (i.e.) after append_exon has been called to add all the exons/introns
-
end
¶ NOTE – This should be accessed only after the creation of the transcript model is finished (i.e.) after append_exon has been called to add all the exons/introns
-
append_exon
(exon_feature: velocyto.feature.Feature) → None[source]¶ Append an exon and create an intron when needed
Parameters: exon_feature (vcy.Feature) – A feature object represneting an exon to add to the transcript model.
-
chop_if_long_intron
(maxlen: int = 1000000) → None[source]¶ Modify a Transcript model choppin the 5’ region upstram of a very long intron To avoid that extremelly long intron mask the counting of interal genes
Parameters: maxlen (int, default=vcy.LONGEST_INTRON_ALLOWED) – transcript model tha contain one or more intronic interval of len == maxlen will be chopped Returns: - Nothing it will call _remove_upstream_of or _remove_downstream_of on the transcript model
- its name will be changed appending _mod to both trid and trname
-
velocyto.segment_match module¶
-
class
velocyto.segment_match.
SegmentMatch
(segment: Tuple[int, int], feature: velocyto.feature.Feature, is_spliced: bool = False)[source]¶ Bases:
object
-
segment
¶
-
feature
¶
-
is_spliced
¶
-
maps_to_intron
¶
-
maps_to_exon
¶
-
skip_makes_sense
¶ If the SKIP in the segment matches some extremity of the feature and therefore can be interpreted as a splice event
-
velocyto.feature module¶
-
class
velocyto.feature.
Feature
(start: int, end: int, kind: int, exin_no: str, transcript_model: Any = None)[source]¶ Bases:
object
A simple class representing an annotated genomic feature (e.g. exon, intron, masked repeat)
-
start
¶
-
end
¶
-
transcript_model
¶
-
kind
¶
-
exin_no
¶
-
is_validated
¶
-
is_last_3prime
¶
-
get_downstream_exon
() → Any[source]¶ To use only for introns. Returns the vcy.Feature corresponding to the neighbour exon downstream
Note
In a 15 exons transcript model: Downstream to intron10 is exon11 or the interval with index 20 if strand “+”. Downtream to intron10 is exon10 or the interval with index 10 if strand “-“
-
get_upstream_exon
() → Any[source]¶ To use only for introns. Returns the vcy.Feature corresponding to the neighbour exon downstream
Note
In a 15 exons transcript model: Upstream to intron10 is exon9 or the interval with inxex 18 if strand “+”. Upstream to intron10 is exon11 or the interval with inxex 8 if strand “-“
-
ends_upstream_of
(read: velocyto.read.Read) → bool[source]¶ - The following situation happens
- Read
*|||segment|||-?-||segment|||????????
???????|||||Ivl|||||||||*
-
doesnt_start_after
(segment: Tuple[int, int]) → bool[source]¶ One of the following situation happens
*||||||segment|||||????????
-
contains
(segment: Tuple[int, int], minimum_flanking: int = 5) → bool[source]¶ One of following situation happens
- —–||||||segment|||||—–
|||||||||||||Ivl||||||||||||||||
—–||||||segment|||||—–|||||||||||||Ivl||||||||||||||||
—–||||||segment|||||—–|||||||||||||Ivl||||||||||||||||
where — idicates the minimum flanking
-
velocyto.indexes module¶
-
class
velocyto.indexes.
TransciptsIndex
(trascript_models: List[velocyto.transcript_model.TranscriptModel])[source]¶ Bases:
object
-
transcipt_models
¶
-
tidx
¶
-
maxtidx
¶
-
scan_not_terminated
¶ Return false when all the chromosome has been scanned
-
find_overlapping_trascript_models
(read: velocyto.read.Read) → Set[velocyto.transcript_model.TranscriptModel][source]¶ Finds all the Transcript models the Read overlaps with
Parameters: read (vcy.Read) – the read object to be analyzed Returns: matched_transcripts – TranscriptModel the read is overlapping with and values the kind of overlapping it is one of vcy.MATCH_INSIDE (1), vcy.MATCH_OVER5END (2), vcy.MATCH_OVER3END (4) Return type: set of vcy.TranscriptModel
-
-
class
velocyto.indexes.
FeatureIndex
(ivls: List[velocyto.feature.Feature] = [])[source]¶ Bases:
object
Search help class used to find the intervals that a read is spanning
-
last_interval_not_reached
¶
-
has_ivls_enclosing
(read: velocyto.read.Read) → bool[source]¶ Finds out if there are intervals that are fully containing all the read segments
Parameters: read (vcy.Read) – the read object to be analyzed Returns: respones – if one has been found Return type: bool
-
mark_overlapping_ivls
(read: velocyto.read.Read) → None[source]¶ Finds the overlap between Read and Features and mark intronic features if spanned
Parameters: read (vcy.Read) – the read object to be analyzed Returns: Return type: Nothing, it marks the vcy.Feature object (is_validated = True) if there is evidence of exon-intron spanning
-
find_overlapping_ivls
(read: velocyto.read.Read) → Dict[velocyto.transcript_model.TranscriptModel, List[velocyto.segment_match.SegmentMatch]][source]¶ Finds the possible overlaps between Read and Features and return a 1 read derived mapping record
Parameters: read (vcy.Read) – the read object to be analyzed Returns: mapping_record – A record of the mappings by transcript model. Every entry contains a list of segment matches that in turn contains information on the segment and the feature Return type: Dict[vcy.TranscriptModel, List[vcy.SegmentMatch]] Note
- It is possible that a segment overalps at the same time an exon and an intron (spanning segment)
- It is not possible that a segment overalps at the same time two exons. In that case the read is splitted
into two segments and the Read attribute is_spliced == True. - Notice that the name of the function might be confousing. if there is a non valid overallapping an empty mappign record will be return - Also notice that returning an empty mapping record will cause the suppression of the counting of the molecule
-
velocyto.molitem module¶
-
velocyto.molitem.
dictionary_union
(d1: DefaultDict[Any, List], d2: DefaultDict[Any, List]) → DefaultDict[Any, List][source]¶ Set union (|) operation on default dicitonary
Parameters: - d1 (defaultdict) – First default dict
- d2 (defaultdict) – Second default dict
Returns: - A dictionary with the key the set union of the keys.
- If same key is present the entry will be combined using __add__
-
velocyto.molitem.
dictionary_intersect
(d1: DefaultDict[Any, List], d2: DefaultDict[Any, List]) → DefaultDict[Any, List][source]¶ Set intersection (&) operation on default dicitonary
Parameters: - d1 (defaultdict) – First default dict
- d2 (defaultdict) – Second default dict
Returns: - A dictionary with the key the set intersection of the keys.
- If same key is present the entry will be combined using __add__
velocyto.gene_info module¶
velocyto.read module¶
-
class
velocyto.read.
Read
(bc: str, umi: str, chrom: str, strand: str, pos: int, segments: List, clip5: Any, clip3: Any, ref_skipped: bool)[source]¶ Bases:
object
Container for reads from sam alignment file
-
bc
¶
-
umi
¶
-
chrom
¶
-
strand
¶
-
pos
¶
-
segments
¶
-
clip5
¶
-
clip3
¶
-
ref_skipped
¶
-
is_spliced
¶
-
start
¶
-
end
¶
-
span
¶
-