CLI Usage Guide

Introduction

After you have velocyto correctly installed on your machine the velocyto command will become available. velocyto is a command line tool with subcomands. You can get all the available commands typing velocyto --help. You will get the following output.

Usage: velocyto COMMAND [ARGS]...

Options:
    --help  Show this message and exit.

Commands:
    extract_intervals   Transform a genome annotation .gtf file into a intervals .txt file.
    extract_repeats Transform a repeats .gtf file into a intervals .gtf file.
    multi10x           Runs the velocity analysis on multiple a Chromium samples
    run                Runs the velocity analysis outputting a loom file
    run10x             Runs the velocity analysis for a Chromium Sample

You can further query for information on each subcommand by typing velocyto COMMANDNAME --help.

Alternatively you can visit the online api description page that includes usage information for all the subcommands

Preparation of the intronic/exonic intervals

Note

This step need to be performed only once per genome annotation. Execution time is ~1h

The first step is to build your preprocessed annotation files starting from a genome annotation (.gtf file). For example if you use the default cellranger pipeline, this will be prepackaged and it can be downloaded here. The script will scan the annotations and annotate situations of overlapping.

If genes.gtf is your annotation file. The typical use of the command is is:

velocyto extract_intervals genes.gtf -p somepath/genomename

And the full signature is:

Usage: velocyto extract_intervals [OPTIONS] GTF_FILE

Transform a genome annotation .gtf file into a intervals .txt file required to run velocyto.

GTF_FILE: input file

Options:
    --dogenes / --no-dogenes  whether to process gene models  [default: True]
    --dotrs / --no-dotrs      whether to process transcript models  [default: True]
    -p, --outfileprefix PATH  prefix to the output files [REQUIRED]
    --help                    Show this message and exit.

--outfileprefix/-p PATH is mandatory and it should be set something like somepath/genomename

If the command runs correctly the file somepath/genomename_gene_ivls.txt will be produced. At this point you are ready to produce a .loom file for all your datasets.

Prepare repeats annotation

Note

This step is optional but recommended!

Dowloaded an appropriate mask repeat annotation (for example from UCSC genome browser and remember to select GTF as output format). And run the following command:

velocyto extract_repeats mm10_rmsk.gtf

This will generate the file mm10_rmsk_joined.gtf. (The command is sorting the file and merging very close repeats intervals into bigger intervals that will be masked in the downstream pipeline).

Run velocyto

The general purpose command to start the pipeline for read counting is velocyto run. The run subcommand logic is compatible with both 10X Genomics v1/v2 and InDrops 3’ chemistry. However, for data generated by 10X Genomics platform using the cellranger pipeline, we suggest to use the shortcut run10x described below.

A typical use of run is:

velocyto run -b valid_barcodes.txt -o output_path -m mm10_rmsk_joined.gtf mapped_reads.bam mm10_gene_ivls.txt

The general signature for the run subcommand is:

Usage: velocyto run [OPTIONS] BAMFILE IVLFILE

Runs the velocity analysis outputting a loom file

BAMFILE bam file with sorted reads

IVLFILE text file generated by velocyto extract_intervals

Options:
-b, --bcfile PATH         Valid barcodes file, to filter the bam. If --bcfile is not specified all the cell barcodes will be included. Cell barcodes should be specified in the bcfile as the CB tag of each read
-o, --outputfolder PATH   Output folder, does not need to exist
-d, --sampleid PATH       The sample name that will be used to retrieve informations from metadatatable
-s, --metadatatable PATH  Table containing metadata of the various samples (csv formatted, [row:samples, col:entry])
-m, --repmask PATH        .gtf file containing intervals sorted by chromosome, strand, position (e.g. by running sort -k1,1 -k7,7 -k4,4n mm10_rmsk.gtf > mm10_rmsk_sorted.gtf; velocyto extract_repeats mm10_rmsk_sorted.gtf)
-d, --debug               debug mode. It will generate .sam files of individual reads (not molecules) that are identified as exons, introns, ambiguous and chimeras
--help                    Show this message and exit.

Warning

Running velocyto run without specifying --bcfile is not recommended: it has not been appropriately tested yet.

The metadatatable is a csv file containing metadata of multiple samples. This will be transferred in the column attributes of the produced .loom file. It should be formatted as following:

Run on a single or multiple 10X Chromium samples

velocyto supports a shortcut to run directly on one or more cellranger output folders (e.g. this is the folder containing the subfolder: outs, outs/analys and outs/filtered_gene_bc_matrices).

For example if we want to run the pipeline on the folder mypath/sample01. We would do:

velocyto run10x -m mm10_rmsk_joined.gtf mypath/sample01 mm10_gene_ivls.txt

The full signature of the command is:

Usage: velocyto run10x [OPTIONS] SAMPLEFOLDER IVLFILE

Runs the velocity analysis for a Chromium 10X Sample

10XSAMPLEFOLDER specifies the cellranger sample folder

IVLFILE text file generated by velocyto extract_intervals

Options:
-s, --metadatatable PATH  Table containing metadata of the various samples (csv fortmated, [row:samples, col:entry])
-m, --repmask PATH        .gtf file containing intrvals sorted by chromosome, strand, position
                            (e.g. generated by running `velocyto extract_repeats mm10_rmsk.gtf`)
-z, --introns TEXT        introns validation heuristic mode. if `strict` if will require exon-intron spanning evidence; if `permissive` it does not check for spanning
-d, --debug               debug mode. It will generate .sam files of individual reads (not molecules) that are identified as exons, introns, ambiguous and chimeras
--help                    Show this message and exit.

In addition to run10x the comand multi10x allows running many samples at the same time in parallel. For example the following commands will run all the samples present in parentfolder, parallelizing the processing to up 8 samples at a time.

velocyto multi10x -n 8 -l logfolder -m mm10_rmsk_joined.gtf parentfolder mm10_gene_ivls.txt

The logs of each process will be found inside logfolder.

Usage: velocyto multi10x [OPTIONS] PARENTFOLDER IVLFILE

Runs the velocity analysis on multiple a Chromium samples in parallel, spawning several subprocesses

Options:
-n, --number INTEGER      Number of processes to execute
-w, --wait INTEGER        Delay in seconds between the executions of single run comands
-s, --metadatatable PATH  Table containing metadata of the various samples (csv fortmated, [row:samples, col:entry])
-m, --repmask PATH        .gtf file containing intrvals sorted by chromosome, strand, position
                            (e.g. generated by running `velocyto extract_repeats mm10_rmsk.gtf`)
-z, --introns TEXT        introns validation heuristic mode. if `strict` if will require exon-intron spanning evidence; if `permissive` it does not check for spanning
-l, --logfolder PATH      Folder where all the log files will be generated
-d, --debug               debug mode. It will generate .sam files of individual reads (not molecules) that are identified as exons, introns, ambiguous and chimeras
--help                    Show this message and exit.

Note

Execution time is ~2h30m per sample but might vary significantly by sequencing depth and cpu power.

About the output .loom file

The main result file is a 4-layered loom file : sample_id.loom.

A valid .loom file is simply an HDF5 file that contains specific groups representing the main matrix as well as row and column attributes. Because of this, .loom files can be created and read by any language that supports HDF5.

.loom files can be easily handled using the loompy package.