Last updated: 2021-07-27
Checks: 7 0
Knit directory: MAESTRO_documentation/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20201223)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 4a7b06f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .RData
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: code/.DS_Store
Ignored: data/.DS_Store
Ignored: data/multi-scatac/.DS_Store
Ignored: data/multi-scrna/
Untracked files:
Untracked: .RDataTmp
Untracked: code/snRNA_genelength.R
Untracked: code/snRNA_genelength_TPM.R
Untracked: data/allen/
Untracked: data/multi-scatac/bed/
Untracked: data/multi-scatac/bigwig/GDF7_bed_added.png
Untracked: output/snRNA_TPM_VS_genelength.png
Untracked: output/snRNA_genelengthVSLognormUMI.png
Untracked: output/snRNA_genelengthVSUMI.png
Unstaged changes:
Deleted: MultiSample_scATACseq.Rproj
Deleted: atac_pbmc_500_nextgem.GIGGLE/0.peaks.bed
Deleted: atac_pbmc_500_nextgem.GIGGLE/0.peaks.bed.gz
Deleted: atac_pbmc_500_nextgem.GIGGLE/0.peaks.bed.result.xls
Modified: code/Basic_Operations.R
Modified: pbmc_1k_v3_8k_res.rds
Modified: pbmc_1k_v3_Monocyte_filtered.pdf
Modified: pbmc_1k_v3_Monocyte_top.pdf
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
There are no past versions. Publish this analysis with wflow_publish()
to start tracking its development.
In this example, we will be analyzing a scRNA-seq dataset of 1K human peripheral blood mononuclear cells (PBMCs) freely available from 10X Genomics. The raw dataset can be downloaded from the 10X Genomics website. We will show you how to run through the whole MAESTRO pipeline from the raw sequencing fastq files to the final results.
Please download the raw data from 10X genomics website.
$ wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar
$ tar xvf pbmc_1k_v3_fastqs.tar
Before running MAESTRO, users need to activate the MAESTRO environment.
$ conda activate MAESTRO
Initialize the MAESTRO scRNA-seq workflow using MAESTRO scrna-init command. This will install a Snakefile and a config file in this directory.
$ MAESTRO scrna-init \
--platform 10x-genomics --species GRCh38 \
--cores 32 --rseqc --directory pbmc_1k_v3_fastqs \
--count-cutoff 1000 --gene-cutoff 500 --cell-cutoff 10 \
--mapindex references/Refdata_scRNA_MAESTRO_GRCh38_1.2.2/GRCh38_STAR_2.7.6a \
--whitelist references/whitelist/3M-february-2018.txt \
--barcode-start 1 --barcode-length 16 --umi-start 17 --umi-length 12 \
--lisadir references/hg38_1000_2.0.h5 \
--signature human.immune.CIBERSORT
To get a full description of command-line options, please use the command
$ MAESTRO scrna-init -h
Input files arguments:
Arguments | Description |
---|---|
--platform |
{10x-genomics,Dropseq,Smartseq2} Platform of single cell RNA-seq. DEFAULT: 10x-genomics. |
--sample-file |
JSON file with sample file information. |
--fastq-barcode |
Specify the barcode fastq file, only for the platform of ‘Dropseq’. If there are multiple pairs of fastq, please provide a comma-separated list of barcode fastq files. For example, --fastq-barcode test1_1.fastq,test2_1.fastq . |
--fastq-transcript |
Specify the transcript fastq file, only for the platform of ‘Dropseq’. |
--species |
{GRCh38,GRCm38} Specify the genome assembly (GRCh38 for human and GRCm38 for mouse). DEFAULT: GRCh38. |
STARsolo reads mapping arguments:
Arguments | Description |
---|---|
--STARsolo_Features |
{Gene,GeneFull,Gene GeneFull,SJ,Velocyto} Parameters passed to STARsolo –soloFeatures. Specify –soloFeatures Gene for single-cell data. Specify –soloFeatures GeneFull for single-nuclei data. Specify –soloFeatures Gene GeneFull for getting both counts in exons level and exon + intron level (velocity). If multiple features are listed, e.g. Gene GeneFull SJ, only the count matrix generated by the first feature (here: Gene) will be used for downstream analysis. DEFAULT: Gene. |
--STARsolo_Threads |
Threads for running STARsolo. DEFAULT: 12. |
Running and output arguments:
Arguments | Description |
---|---|
--cores |
The number of cores to use. DEFAULT: 10. |
--rseqc |
Whether or not to run RSeQC. If set, the pipeline will include the RSeQC part and then takes a longer time. By default (not set), the pipeline will skip the RSeQC part. |
--directory |
Path to the directory where the workflow shall be initialized and results shall be stored. DEFAULT: MAESTRO. |
--mergedname |
Prefix of merged output files. DEFAULT: All_sample. |
--outprefix |
Prefix of output files. Only required for platform of ‘Smartseq2’. DEFAULT: MAESTRO. |
Quality control arguments:
Arguments | Description |
---|---|
--count-cutoff |
Cutoff for the number of count in each cell. DEFAULT: 1000. |
--gene-cutoff |
Cutoff for the number of genes included in each cell. DEFAULT: 500. |
--cell-cutoff |
Cutoff for the number of cells covered by each gene. DEFAULT: 10. |
Reference genome arguments:
Arguments | Description |
---|---|
--mapindex |
Genome index directory for STAR. Users can just download the index file for human and mouse from CistromeDB and decompress them. Then specify the index directory for STAR, for example, --mapindex Refdata_scRNA_MAESTRO_GRCh38_1.1.0/GRCh38_STAR_2.7.3a . |
--rsem |
The prefix of transcript references for RSEM used by rsem-prepare-reference (Only required when the platform is Smartseq2). Users can directly download the reference file for human and mouse from CistromeDB and decompress them. Then specify the prefix for RSEM, for example, --rsem Refdata_scRNA_MAESTRO_GRCh38_1.1.0/GRCh38_RSEM_1.3.2/GRCh38 . |
Barcode arguments, for platform of ‘Dropseq’ or ‘10x-genomics’:
Arguments | Description |
---|---|
--whitelist |
If the platform is ‘Dropseq’ or ‘10x-genomics’, please specify the barcode library (whitelist) so that STARsolo can do the error correction and demultiplex of cell barcodes. The 10X Chromium whitelist file can be found inside the CellRanger distribution. Please make sure that the whitelist is compatible with the specific version of the 10X chemistry: V2 or V3. For example, in CellRanger 3.1.0, the V2 whitelist is ‘cellranger-3.1.0/cellranger-cs/3.1.0/lib/python/cellranger/barcodes/737K-august-2016.txt’. The V3 whitelist is ‘cellranger-3.1.0/cellranger-cs/3.1.0/lib/python/cellranger/barcodes/3M-february-2018.txt’. |
--barcode-start |
The start site of each barcode. DEFAULT: 1. |
--barcode-length |
The length of cell barcode. For 10x-genomics, the length of barcode is 16. DEFAULT: 16. |
--umi-start |
The start site of UMI. DEFAULT: 17. |
--umi-length |
The length of UMI. For 10x-genomics, the length of V2 chemistry is 10. For 10X V3 chemistry, the length is 12. DEFAULT: 10. |
--trimR1 |
Whether or not to run the R1 file. If set, the pipeline will include the trim off anything after the R1 reads past barcode information. Only necessary if reads were sequenced past these barcodes, by default not set. |
Regulator identification arguments:
Arguments | Description |
---|---|
--lisadir |
Path to the LISA data files. Please download LISA’s required data from cistrome.org: human and mouse. The latest version of LISA2 was recently released to largely decrease processing time on multiple gene lists. Please check out for detailed information. |
Cell signature arguments:
Arguments | Description |
---|---|
--signature |
Cell signature file used to annotate cell types. MAESTRO provides several sets of built-in cell signatures. Users can choose from [‘human.immune.CIBERSORT’, ‘mouse.brain.ALLEN’, ‘mouse.all.facs.TabulaMuris’, ‘mouse.all.droplet.TabulaMuris’]. Custom cell signatures are also supported. In this situation, users need to provide the file location of cell signatures, and the signature file is tab-separated without header. The first column is cell type, and the second column is signature gene. DEFAULT: human.immune.CIBERSORT. |
In the project directory, we will next run MAESTRO samples-init
subcommand to configure samples.json file. The purpose of this step is let MAESTRO read the right format and path of input data.
$ cd pbmc_1k_v3_fastqs/
$ MAESTRO samples-init --assay_type scrna --platform 10x-genomics --data_typ fastq --data_dir data/pbmc_1k_v3_fastqs/
To get a full description of command-line options, please use the command
MAESTRO samples-init -h
Input files arguments:
Arguments | Description |
---|---|
--assay_type |
{scrna,scatac} choose the type of assays. |
--platform |
{10x-genomics,sci-ATAC-seq,microfluidic} choose the type of sequencing platform. |
--data_type |
{fastq,fragment,bam} choose the type of input data. When using fastq format as input, MAESTRO can only recognize the 10X-like naming: [Sample Name]S1_L00[Lane Number][Read Type]_001.fastq.gz. |
--data_dir |
Path to the input file folder. |
Once configured, you will find a samples.json file in the current project directory:
$ cat samples.json
{
"pbmc_1k_v3": {
"I1": [
"/MAESTRO/test/tutorial/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_I1_001.fastq.gz",
"/MAESTRO/test/tutorial/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_I1_001.fastq.gz"
],
"R1": [
"/MAESTRO/test/tutorial/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz",
"/MAESTRO/test/tutorial/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz"
],
"R2": [
"/MAESTRO/test/tutorial/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz",
"/MAESTRO/test/tutorial/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz"
]
}
}
Before running the workflow, please check the config.yaml
and see if it is configured correctly. Once config.yaml
is configured, users can use snakemake to run the workflow.
$ snakemake -np
$ nohup snakemake -j 16 > run.out &
##Instead of using nohup, submitting $ snakemake -j 16 by slurm is highly recommended.
The whole pipeline in this example takes about 4 hours with 12 cores. If MAESTRO runs successfully, an output folder will contain several useful outputs as described below.
$ ls Result
Analysis Benchmark Log QC Report STAR
STAR
directory contains all the mapping and analysis files generated by STAR
normal (Smartseq2) or solo mode (10x-genomics or Dropseq). For solo mode, Solo.out/Gene/
stores raw and filtered count matrix. In MAESTRO pipeline, raw count matrix is used for a further filter according to quality control arguments like --count-cutoff
--gene-cutoff
and --cell-cutoff
.QC
directory contains quality control analysis results of scRNA-seq data, including the filtered count matrix outprefix_filtered_gene_count.h5
and RSeQC results (if --rseqc
is set).Analysis
directory contains a Seurat R object, as well as clustering result, cell type annotation result and driver transcription factor identification result, which we will introduce in the step-by-step analysis.Benchmark
directory stores benchmark files for all rules in Snakefile, each of which contains a tab-separated table of run times and memory usage in MiB.{outprefix}_scRNA_report.html
file summarizes all the results in an HTML based document. The summary HTML for the example can be found here.
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] workflowr_1.6.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 knitr_1.33 magrittr_2.0.1 R6_2.5.0
[5] rlang_0.4.11 fansi_0.5.0 stringr_1.4.0 tools_4.1.0
[9] xfun_0.24 utf8_1.2.1 git2r_0.28.0 jquerylib_0.1.4
[13] htmltools_0.5.1.1 ellipsis_0.3.2 rprojroot_2.0.2 yaml_2.2.1
[17] digest_0.6.27 tibble_3.1.2 lifecycle_1.0.0 crayon_1.4.1
[21] later_1.2.0 sass_0.4.0 vctrs_0.3.8 promises_1.2.0.1
[25] fs_1.5.0 glue_1.4.2 evaluate_0.14 rmarkdown_2.9
[29] stringi_1.6.2 bslib_0.2.5.1 compiler_4.1.0 pillar_1.6.1
[33] jsonlite_1.7.2 httpuv_1.6.1 pkgconfig_2.0.3