CLI Usage Guide

Introduction

After you have velocyto correctly installed on your machine (see installation tutorial) the velocyto command will become available in the terminal. velocyto is a command line tool with subcomands. You can get info on all the available commands typing velocyto --help. You will get the following output:

Usage: velocyto [OPTIONS] COMMAND [ARGS]...

        Options:
          --help  Show this message and exit.

        Commands:
          run     Runs the velocity analysis outputing a loom file
          run10x  Runs the velocity analysis for a Chromium Sample

You can further query for information on each subcommand by typing velocyto COMMANDNAME --help.

Alternatively you can visit the online api description page that includes usage information for all the subcommands.

Preparation

Download genome annotation file

Download a genome annotation (.gtf file) for example from GENCODE or Ensembl. If you use the cellranger pipeline, you should download the gtf that comes prepackaged with it here.

Download expressed repeats annotation

Note

This step is optional.

You might want to mask expressed repetitive elements, since those count could constitute a confounding factor in the downstream analysis. To do so you would need to download an appropriate expressed repeat annotation (for example from UCSC genome browser and make sure to select GTF as output format).

Running velocyto

The general purpose command to start the pipeline for read counting is velocyto run. The run defaults are appropriate for the analysis of both 10X Genomics v1/v2 and InDrops 3’ chemistry.

A typical use of run is:

velocyto run -b filtered_barcodes.tsv -o output_path -m repeat_msk_srt.gtf possorted_genome_bam.bam mm10_annotation.gtf

The general signature for the run subcommand is:

Usage: velocyto run [OPTIONS] BAMFILE GTFFILE

          Runs the velocity analysis outputing a loom file

          BAMFILE bam file with sorted reads

          GTFFILE genome annotation file

        Options:
          -b, --bcfile PATH               Valid barcodes file, to filter the bam. If --bcfile is not specified all the cell barcodes will be incuded.
                                          Cell barcodes should be specified in the bcfile as the `CB` tag for each read
          -o, --outputfolder PATH         Output folder, if it does not exist it will be created.
          -e, --sampleid PATH             The sample name that will be used to retrieve informations from metadatatable
          -s, --metadatatable PATH        Table containing metadata of the various samples (csv formatted, rows are samples and cols are entries)
          -m, --repmask PATH              .gtf file containing intervals to mask
          -l, --logic TEXT                The logic to use for the filtering (default: Default)
          -M, --multimap                  Use reads that did not map uniquely (default: False)
          -x, --molrep                    Outputs pickle files with containing a sample of the read mappings supporting molecule counting. (Useful for development or debugging only)
          -@, --samtools-threads INTEGER  The number of threads to use to sort the bam by cellID file using samtools
          --samtools-memory INTEGER       The number of MB used for every thread by samtools to sort the bam file
          --help                          Show this message and exit.

Note

The input bam file needs to be sorted by position, this can be achieved running samtools sort mybam.bam -o sorted_bam.bam. In cellranger generated bamfiles are already sorted this way.

Note

Execution time is ~3h for a typical sample but might vary significantly by sequencing depth and cpu power.

Warning

Running velocyto without specifying a filtered barcode set (-b/--bcfile option) is not recommended, do it at your own risk. In this way, the counter will use all the cell barcodes it encounters. It might result in long runtimes, large memory allocations and big output matrix.

Notes on velocyto run

As one of its first steps velocyto run will try to create a copy of the input .bam files sorted by cell-barcode. The sorted .bam file will be placed in the same directory as the original file and it will be named cellsorted_[ORIGINALBAMNAME]. The sorting procedure uses samtools sort and it is expected to be time consumning, because of this, the procedurre is perfomed in parellel by default. It is possible to control this parallelization using the parameters --samtools-threads and --samtools-memory.

Note

If the file cellsorted_[ORIGINALBAMNAME] exists, the sorting procedure will be skipped and the file present will be used.

Warning

Most of the velocyto pipeline is single threaded and several instances can be run on the same multicore machine to process your samples in a time effective way. However, because of the above mentioned multithreaded call to samtools sort, running several instances of veloctyo run might end up using the memory and cpu of your system and possbily result in runtime errors. Therefore for batch jobs we suggest to first call samtools sort -t CB -O BAM -o cellsorted_possorted_genome_bam.bam possorted_genome_bam.bam sequentially and only then running velocyto

Run with different logics

The rules used to call spliced, unspliced and ambiguous molecules from the reads mappings can be set using the --logic parameter. The behavior of the counter can be modified using one of the different logics supported. Every logic has a different sensitivity. The currently available are:

  • Permissive10X
  • ValidatedIntrons10X (*Default)
  • Stricter10X
  • ObservedSpanning10X

Despite the name (that designates their original design for the 10X platform) the logics generalize well to similar chemistries (e.g. Drop-seq).

Hint

Custom logics supporting peculiarities of other chemistries can be implemented simply by creating a class that inherits from Logic.

Run on a single or multiple 10X Chromium samples

velocyto supports a shortcut to run directly on one or more cellranger output folders (e.g. this is the folder containing the subfolder: outs, outs/analys and outs/filtered_gene_bc_matrices).

For example if we want to run the pipeline on the folder mypath/sample01. We would do:

velocyto run10x -m repeat_msk_srt.gtf mypath/sample01 mm10_annotation.gtf

The full signature of the command is:

Usage: velocyto run10x [OPTIONS] SAMPLEFOLDER GTFFILE

          Runs the velocity analysis for a Chromium 10X Sample

          10XSAMPLEFOLDER specifies the cellranger sample folder

          GTFFILE genome annotation file

        Options:
          -s, --metadatatable PATH        Table containing metadata of the various samples (csv fortmated rows are samples and cols are entries)
          -m, --repmask PATH              .gtf file containing intervals to mask
          -l, --logic TEXT                The logic to use for the filtering (default: Default)
          -M, --multimap                  Use reads that did not map uniquely (default: False)
          -@, --samtools-threads INTEGER  The number of threads to use to sort the bam by cellID file using samtools
          --samtools-memory INTEGER       The number of MB used for every thread by samtools to sort the bam file
          --help                          Show this message and exit.

About the output .loom file

The main result file is a 4-layered loom file : sample_id.loom.

A valid .loom file is simply an HDF5 file that contains specific groups representing the main matrix as well as row and column attributes. Because of this, .loom files can be created and read by any language that supports HDF5.

.loom files can be easily handled using the loompy package.

Get started with the analysis

At this point you are ready to start analyzing your .loom file. To get started read our analysis tutorial and have a look at the notebooks examples.