scRNA-seq

Last updated: 2021-07-27

Checks: 7 0

Knit directory: MAESTRO_documentation/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20201223)

The command set.seed(20201223) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 4a7b06f

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 4a7b06f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/multi-scatac/.DS_Store
    Ignored:    data/multi-scrna/

Untracked files:
    Untracked:  .RDataTmp
    Untracked:  code/snRNA_genelength.R
    Untracked:  code/snRNA_genelength_TPM.R
    Untracked:  data/allen/
    Untracked:  data/multi-scatac/bed/
    Untracked:  data/multi-scatac/bigwig/GDF7_bed_added.png
    Untracked:  output/snRNA_TPM_VS_genelength.png
    Untracked:  output/snRNA_genelengthVSLognormUMI.png
    Untracked:  output/snRNA_genelengthVSUMI.png

Unstaged changes:
    Deleted:    MultiSample_scATACseq.Rproj
    Deleted:    atac_pbmc_500_nextgem.GIGGLE/0.peaks.bed
    Deleted:    atac_pbmc_500_nextgem.GIGGLE/0.peaks.bed.gz
    Deleted:    atac_pbmc_500_nextgem.GIGGLE/0.peaks.bed.result.xls
    Modified:   code/Basic_Operations.R
    Modified:   pbmc_1k_v3_8k_res.rds
    Modified:   pbmc_1k_v3_Monocyte_filtered.pdf
    Modified:   pbmc_1k_v3_Monocyte_top.pdf

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

Introduction

In this example, we will be analyzing a scRNA-seq dataset of 1K human peripheral blood mononuclear cells (PBMCs) freely available from 10X Genomics. The raw dataset can be downloaded from the 10X Genomics website. We will show you how to run through the whole MAESTRO pipeline from the raw sequencing fastq files to the final results.

Run MAESTRO pipeline

Step 0. Download the data and prepare the environment

Please download the raw data from 10X genomics website.

$ wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar
$ tar xvf pbmc_1k_v3_fastqs.tar

Before running MAESTRO, users need to activate the MAESTRO environment.

$ conda activate MAESTRO

Step 1. Configure the MAESTRO workflow

Initialize the MAESTRO scRNA-seq workflow using MAESTRO scrna-init command. This will install a Snakefile and a config file in this directory.

$ MAESTRO scrna-init \
--platform 10x-genomics --species GRCh38 \
--cores 32 --rseqc --directory pbmc_1k_v3_fastqs \
--count-cutoff 1000 --gene-cutoff 500 --cell-cutoff 10 \
--mapindex references/Refdata_scRNA_MAESTRO_GRCh38_1.2.2/GRCh38_STAR_2.7.6a \
--whitelist references/whitelist/3M-february-2018.txt \
--barcode-start 1 --barcode-length 16 --umi-start 17 --umi-length 12 \
--lisadir references/hg38_1000_2.0.h5 \
--signature human.immune.CIBERSORT

To get a full description of command-line options, please use the command

$ MAESTRO scrna-init -h

Here we list all the arguments and their description.

Input files arguments:

Arguments	Description
`--platform`	{10x-genomics,Dropseq,Smartseq2} Platform of single cell RNA-seq. DEFAULT: 10x-genomics.
`--sample-file`	JSON file with sample file information.
`--fastq-barcode`	Specify the barcode fastq file, only for the platform of ‘Dropseq’. If there are multiple pairs of fastq, please provide a comma-separated list of barcode fastq files. For example, `--fastq-barcode test1_1.fastq,test2_1.fastq`.
`--fastq-transcript`	Specify the transcript fastq file, only for the platform of ‘Dropseq’.
`--species`	{GRCh38,GRCm38} Specify the genome assembly (GRCh38 for human and GRCm38 for mouse). DEFAULT: GRCh38.

STARsolo reads mapping arguments:

Arguments Description

--STARsolo_Features {Gene,GeneFull,Gene GeneFull,SJ,Velocyto} Parameters passed to STARsolo –soloFeatures. Specify –soloFeatures Gene for single-cell data. Specify –soloFeatures GeneFull for single-nuclei data. Specify –soloFeatures Gene GeneFull for getting both counts in exons level and exon + intron level (velocity). If multiple features are listed, e.g. Gene GeneFull SJ, only the count matrix generated by the first feature (here: Gene) will be used for downstream analysis. DEFAULT: Gene.

--STARsolo_Threads Threads for running STARsolo. DEFAULT: 12.

Arguments	Description
`--STARsolo_Features`	{Gene,GeneFull,Gene GeneFull,SJ,Velocyto} Parameters passed to STARsolo –soloFeatures. Specify –soloFeatures Gene for single-cell data. Specify –soloFeatures GeneFull for single-nuclei data. Specify –soloFeatures Gene GeneFull for getting both counts in exons level and exon + intron level (velocity). If multiple features are listed, e.g. Gene GeneFull SJ, only the count matrix generated by the first feature (here: Gene) will be used for downstream analysis. DEFAULT: Gene.
`--STARsolo_Threads`	Threads for running STARsolo. DEFAULT: 12.

Running and output arguments:

Arguments	Description
`--cores`	The number of cores to use. DEFAULT: 10.
`--rseqc`	Whether or not to run RSeQC. If set, the pipeline will include the RSeQC part and then takes a longer time. By default (not set), the pipeline will skip the RSeQC part.
`--directory`	Path to the directory where the workflow shall be initialized and results shall be stored. DEFAULT: MAESTRO.
`--mergedname`	Prefix of merged output files. DEFAULT: All_sample.
`--outprefix`	Prefix of output files. Only required for platform of ‘Smartseq2’. DEFAULT: MAESTRO.

Quality control arguments:

Arguments	Description
`--count-cutoff`	Cutoff for the number of count in each cell. DEFAULT: 1000.
`--gene-cutoff`	Cutoff for the number of genes included in each cell. DEFAULT: 500.
`--cell-cutoff`	Cutoff for the number of cells covered by each gene. DEFAULT: 10.

Reference genome arguments:

Arguments	Description
`--mapindex`	Genome index directory for STAR. Users can just download the index file for human and mouse from CistromeDB and decompress them. Then specify the index directory for STAR, for example, `--mapindex Refdata_scRNA_MAESTRO_GRCh38_1.1.0/GRCh38_STAR_2.7.3a`.
`--rsem`	The prefix of transcript references for RSEM used by rsem-prepare-reference (Only required when the platform is Smartseq2). Users can directly download the reference file for human and mouse from CistromeDB and decompress them. Then specify the prefix for RSEM, for example, `--rsem Refdata_scRNA_MAESTRO_GRCh38_1.1.0/GRCh38_RSEM_1.3.2/GRCh38`.

Barcode arguments, for platform of ‘Dropseq’ or ‘10x-genomics’:

Arguments	Description
`--whitelist`	If the platform is ‘Dropseq’ or ‘10x-genomics’, please specify the barcode library (whitelist) so that STARsolo can do the error correction and demultiplex of cell barcodes. The 10X Chromium whitelist file can be found inside the CellRanger distribution. Please make sure that the whitelist is compatible with the specific version of the 10X chemistry: V2 or V3. For example, in CellRanger 3.1.0, the V2 whitelist is ‘cellranger-3.1.0/cellranger-cs/3.1.0/lib/python/cellranger/barcodes/737K-august-2016.txt’. The V3 whitelist is ‘cellranger-3.1.0/cellranger-cs/3.1.0/lib/python/cellranger/barcodes/3M-february-2018.txt’.
`--barcode-start`	The start site of each barcode. DEFAULT: 1.
`--barcode-length`	The length of cell barcode. For 10x-genomics, the length of barcode is 16. DEFAULT: 16.
`--umi-start`	The start site of UMI. DEFAULT: 17.
`--umi-length`	The length of UMI. For 10x-genomics, the length of V2 chemistry is 10. For 10X V3 chemistry, the length is 12. DEFAULT: 10.
`--trimR1`	Whether or not to run the R1 file. If set, the pipeline will include the trim off anything after the R1 reads past barcode information. Only necessary if reads were sequenced past these barcodes, by default not set.

Regulator identification arguments:

Arguments	Description
`--lisadir`	Path to the LISA data files. Please download LISA’s required data from cistrome.org: human and mouse. The latest version of LISA2 was recently released to largely decrease processing time on multiple gene lists. Please check out for detailed information.

Cell signature arguments:

Arguments Description

--signature Cell signature file used to annotate cell types. MAESTRO provides several sets of built-in cell signatures. Users can choose from [‘human.immune.CIBERSORT’, ‘mouse.brain.ALLEN’, ‘mouse.all.facs.TabulaMuris’, ‘mouse.all.droplet.TabulaMuris’]. Custom cell signatures are also supported. In this situation, users need to provide the file location of cell signatures, and the signature file is tab-separated without header. The first column is cell type, and the second column is signature gene. DEFAULT: human.immune.CIBERSORT.

Arguments	Description
`--signature`	Cell signature file used to annotate cell types. MAESTRO provides several sets of built-in cell signatures. Users can choose from [‘human.immune.CIBERSORT’, ‘mouse.brain.ALLEN’, ‘mouse.all.facs.TabulaMuris’, ‘mouse.all.droplet.TabulaMuris’]. Custom cell signatures are also supported. In this situation, users need to provide the file location of cell signatures, and the signature file is tab-separated without header. The first column is cell type, and the second column is signature gene. DEFAULT: human.immune.CIBERSORT.

Step 2. Configure the samples.json file

In the project directory, we will next run MAESTRO samples-init subcommand to configure samples.json file. The purpose of this step is let MAESTRO read the right format and path of input data.

$ cd pbmc_1k_v3_fastqs/

$ MAESTRO samples-init --assay_type scrna --platform 10x-genomics --data_typ fastq --data_dir data/pbmc_1k_v3_fastqs/

To get a full description of command-line options, please use the command

MAESTRO samples-init -h

Input files arguments:

Arguments	Description
`--assay_type`	{scrna,scatac} choose the type of assays.
`--platform`	{10x-genomics,sci-ATAC-seq,microfluidic} choose the type of sequencing platform.
`--data_type`	{fastq,fragment,bam} choose the type of input data. When using fastq format as input, MAESTRO can only recognize the 10X-like naming: [Sample Name]S1_L00[Lane Number][Read Type]_001.fastq.gz.
`--data_dir`	Path to the input file folder.

Once configured, you will find a samples.json file in the current project directory:

$ cat samples.json
{
    "pbmc_1k_v3": {
        "I1": [
            "/MAESTRO/test/tutorial/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_I1_001.fastq.gz",
            "/MAESTRO/test/tutorial/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_I1_001.fastq.gz"
        ],
        "R1": [
            "/MAESTRO/test/tutorial/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz",
            "/MAESTRO/test/tutorial/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz"
        ],
        "R2": [
            "/MAESTRO/test/tutorial/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz",
            "/MAESTRO/test/tutorial/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz"
        ]
    }
}

Step 3. Run snakemake pipeline

Before running the workflow, please check the config.yaml and see if it is configured correctly. Once config.yaml is configured, users can use snakemake to run the workflow.

$ snakemake -np

$ nohup snakemake -j 16 > run.out &
##Instead of using nohup, submitting $ snakemake -j 16 by slurm is highly recommended.

Step 4. Understand the final output files

The whole pipeline in this example takes about 4 hours with 12 cores. If MAESTRO runs successfully, an output folder will contain several useful outputs as described below.

$ ls Result
Analysis  Benchmark  Log  QC  Report  STAR

Output files

STAR/: The STAR directory contains all the mapping and analysis files generated by STAR normal (Smartseq2) or solo mode (10x-genomics or Dropseq). For solo mode, Solo.out/Gene/ stores raw and filtered count matrix. In MAESTRO pipeline, raw count matrix is used for a further filter according to quality control arguments like --count-cutoff --gene-cutoff and --cell-cutoff.
QC/: The QC directory contains quality control analysis results of scRNA-seq data, including the filtered count matrix outprefix_filtered_gene_count.h5 and RSeQC results (if --rseqc is set).
Analysis/: The Analysis directory contains a Seurat R object, as well as clustering result, cell type annotation result and driver transcription factor identification result, which we will introduce in the step-by-step analysis.
Benchmark/: The Benchmark directory stores benchmark files for all rules in Snakefile, each of which contains a tab-separated table of run times and memory usage in MiB.
Log/: The Log directory contains the log files generated in the pipeline analysis.
Report/: Contain the {outprefix}_scRNA_report.html file summarizes all the results in an HTML based document. The summary HTML for the example can be found here.

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        knitr_1.33        magrittr_2.0.1    R6_2.5.0         
 [5] rlang_0.4.11      fansi_0.5.0       stringr_1.4.0     tools_4.1.0      
 [9] xfun_0.24         utf8_1.2.1        git2r_0.28.0      jquerylib_0.1.4  
[13] htmltools_0.5.1.1 ellipsis_0.3.2    rprojroot_2.0.2   yaml_2.2.1       
[17] digest_0.6.27     tibble_3.1.2      lifecycle_1.0.0   crayon_1.4.1     
[21] later_1.2.0       sass_0.4.0        vctrs_0.3.8       promises_1.2.0.1 
[25] fs_1.5.0          glue_1.4.2        evaluate_0.14     rmarkdown_2.9    
[29] stringi_1.6.2     bslib_0.2.5.1     compiler_4.1.0    pillar_1.6.1     
[33] jsonlite_1.7.2    httpuv_1.6.1      pkgconfig_2.0.3