scATAC-seq

Last updated: 2021-07-27

Checks: 7 0

Knit directory: MAESTRO_documentation/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20201223)

The command set.seed(20201223) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 4a7b06f

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 4a7b06f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/multi-scatac/.DS_Store
    Ignored:    data/multi-scrna/

Untracked files:
    Untracked:  .RDataTmp
    Untracked:  code/snRNA_genelength.R
    Untracked:  code/snRNA_genelength_TPM.R
    Untracked:  data/allen/
    Untracked:  data/multi-scatac/bed/
    Untracked:  data/multi-scatac/bigwig/GDF7_bed_added.png
    Untracked:  output/snRNA_TPM_VS_genelength.png
    Untracked:  output/snRNA_genelengthVSLognormUMI.png
    Untracked:  output/snRNA_genelengthVSUMI.png

Unstaged changes:
    Deleted:    MultiSample_scATACseq.Rproj
    Deleted:    atac_pbmc_500_nextgem.GIGGLE/0.peaks.bed
    Deleted:    atac_pbmc_500_nextgem.GIGGLE/0.peaks.bed.gz
    Deleted:    atac_pbmc_500_nextgem.GIGGLE/0.peaks.bed.result.xls
    Modified:   code/Basic_Operations.R
    Modified:   pbmc_1k_v3_8k_res.rds
    Modified:   pbmc_1k_v3_Monocyte_filtered.pdf
    Modified:   pbmc_1k_v3_Monocyte_top.pdf

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

Introduction

In this example, we will be analyzing a scATAC-seq dataset of 500 human peripheral blood mononuclear cells (PBMCs) freely available from 10X Genomics. The raw dataset can be downloaded from the 10X Genomics website. We will show you how to run through the whole MAESTRO pipeline from the raw sequencing fastq files to the final results.

Step 0. Download the data and prepare the environment

Please download the raw data from 10X Genomics website.

$ wget https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fastqs.tar
$ tar xvf atac_pbmc_500_nextgem_fastqs.tar

Before running MAESTRO, users need to activate the MAESTRO environment.

$ conda activate MAESTRO

Step 1. Configure the MAESTRO workflow

Initialize the MAESTRO scATAC-seq workflow using MAESTRO scATAC-init command. This will install a Snakefile and a config file in this directory. Here we take the 10X PBMC data as an example. Considering MAESTRO provides built-in immune cell markers, it’s recommended to choose the RP-based cell-type annotation strategy. If the data is not immune-related, users can choose to provide their own marker gene list, or choose to annotate cell types through the peak-based method (see the following argument description for more details), or they can directly skip the annotation step by not setting --annotation.

$ MAESTRO scatac-init --input_path /MAESTRO/test/tutorial/data/atac_pbmc_500_nextgem_fastqs \
--gzip --species GRCh38 --platform 10x-genomics --format fastq --mapping chromap \
--giggleannotation /annotations/giggle.all/ \
--fasta /references/Refdata_scATAC_MAESTRO_GRCh38_1.1.0/GRCh38_genome.fa \
--index /references/chromap/GRCh38_chromap.index \
--whitelist /references/whitelist/737K-cratac-v1.txt \
--cores 16 --directory atac_pbmc_500_nextgem_chromap \
--annotation --method RP-based --signature human.immune.CIBERSORT \
--rpmodel Enhanced \
--peak_cutoff 100 --count_cutoff 1000 --frip_cutoff 0.2 --cell_cutoff 50

To get a full description of command-line options, please use the command

$ MAESTRO scatac-init -h

Here we list all the arguments and their description.

multi-sample processing parameters:

Arguments	Description
`--batch`	If set as true, peaks will be called on each sample individually first and peaks from all samples will be merged.A count matrix for each sample will be produced based on the merged peak set.
`--consensus_peaks`	When batch is TRUE, users can define whether to merge consensus peaks from each sample.If set as true, users should also set number of cutoff_samples to define consensus.
`--cutoff_samples`	Minimum number of samples to present consensus peaks. The peaks present in at least cutoff_samples will be kept.
`--bulk_peaks`	For multi-samples from the same experiment, if set as true, peaks will be called after merging all bam file.Bulk_peaks and consensus_peaks are mutually exclusive.
`--downsample`	For deeply sequenced samples, bam files can be downsampled to a certain number of reads (target_reads) to get peak set.
`--target_reads`	Number of reads to be kept in downsampling. If the sample has fewer than the target_reads, the original number of reads will be kept.

Input files arguments:

Arguments	Description
`--input_path`	Path to the input fastq files.
`--gzip`	Set as True if the input files are gzipped.
`--species`	{GRCh38,GRCm38} Specify the genome assembly (GRCh38 for human and GRCm38 for mouse). DEFAULT: GRCh38.
`--platform`	{10x-genomics,sci-ATAC-seq} Platform of single cell ATAC-seq. DEFAULT: 10x-genomics.
`--format`	{fastq,bam,fragments} The format of input files. Users can start with sequencing fastq files, bam files with CB tag or fragments.tsv.gz file generated by CellRanger ATAC. If the platform is 10x-genomics, users can start with sequencing fastq files, bam files with CB tag or fragments.tsv.gz file generated by CellRanger ATAC. If the platform is sci-ATAC-seq, users can start with sequencing fastq files, bam files with CB tag.
`--mapping`	{chromap,minimap2} Choose the alignment tool for scATAC-seq from either chromap or minimap2. DEFAULT: chromap.
`--deduplication`	{cell-level,bulk-level} deduplication level: cell-level or bulk-level.

Reference genome arguments:

Arguments	Description
`--giggleannotation`	Path of the giggle annotation file required for regulator identification. Please download the annotation file from here and decompress it..
`--fasta`	Genome fasta file for minimap2. Users can just download the fasta file for human and mouse from CistromDB and decompress them. For example, `--fasta Refdata_scATAC_MAESTR O_GRCh38_1.1.0/GRCh38_genome.fa`.
`--index`	Path of the reference index file for chromap. Users need to build the index file for the reference using command $ chromap -i -r ref.fa -o index.

Barcode library arguments, only for the platform of ‘sci-ATAC-seq’:

Arguments Description

--whitelist If the platform is ‘sci-ATAC-seq’ or ‘10x-genomics’, please specify the barcode library (whitelist) so that the pipeline can correct cell barcodes with 1 base mismatched. Otherwise, the pipeline will automatically output the barcodes with enough read count (>1000). The 10X Chromium whitelist file can be found inside the CellRanger-ATAC distribution. For example, in CellRanger-ATAC 1.1.0, the whitelist is ‘cellranger-atac-1.1.0/cellranger-atac-cs/1.1.0/lib/python/barcodes/737K-cratac-v1.txt’.

Arguments	Description
`--whitelist`	If the platform is ‘sci-ATAC-seq’ or ‘10x-genomics’, please specify the barcode library (whitelist) so that the pipeline can correct cell barcodes with 1 base mismatched. Otherwise, the pipeline will automatically output the barcodes with enough read count (>1000). The 10X Chromium whitelist file can be found inside the CellRanger-ATAC distribution. For example, in CellRanger-ATAC 1.1.0, the whitelist is ‘cellranger-atac-1.1.0/cellranger-atac-cs/1.1.0/lib/python/barcodes/737K-cratac-v1.txt’.

Output arguments:

Arguments	Description
`--cores`	The number of cores to use. DEFAULT: 8.
`--directory`	Path to the directory where the workflow shall be initialized and results shall be stored. DEFAULT: MAESTRO.

Cell-type annotation arguments:

Arguments	Description
`--annotation`	Whether or not to perform cell-type annotation. By default (not set), MAESTRO will skip the step of cell-type annotation. If set, please specify the method of cell-type annotation through `--method`.
`--method`	{RP-based,peak-based,both} Method to annotate cell types. MAESTRO provides two strategies to annotate cell types for scATAC-seq data. Users can choose from ‘RP-based’ and ‘peak-based’, or choose to run both of them. One is based on gene regulatory potential predicted by RP model. Another is based on the bulk chromatin accessibility data from Cistrome database. If ‘RP-based’ is set, MAESTRO performs the cell-type annotation using the gene regulatory potential to represent gene expression, and the logFC of gene regulatory potential between one cluster and all the other cells is used to calculate the gene signature scores. If ‘peak-based’ is set, MAESTRO utilizes GIGGLE to evaluate the enrichment of bulk chromatin accessibility peaks on cluster-specific peaks from scATAC-seq data, and then transfers the Cistrome cluster identity from the most enriched bulk chromatin accessibility data as the cell-type annotation for the scATAC-seq cluster. See the MAESTRO paper for more details. DEFAULT: RP-based.
`--signature`	Cell signature file used to annotate cell types (required when method is set as ‘RP-based’). MAESTRO provides several sets of built-in cell signatures. Users can choose from [‘human.immune.CIBERSORT’, ‘mouse.brain.ALLEN’, ‘mouse.all.facs.TabulaMuris’, ‘mouse.all.droplet.TabulaMuris’]. Custom cell signatures are also supported. In this situation, users need to provide the file location of cell signatures, and the signature file is tab-separated without header. The first column is cell type, and the second column is signature gene. DEFAULT: human.immune.CIBERSORT.

Customized peak arguments:

Arguments	Description
`--custompeak`	Whether or not to provide custom peaks. If set, users need to provide the file location of peak file through `--custompeak-file` and then MAESTRO will merge the custom peak file and the peak file called from all fragments using MACS2. By default (not set), the pipeline will use the peaks called using MACS2.
`--custompeak-file`	If `--custompeak` is set, please provide the file location of custom peak file. The peak file is BED formatted with tab-separated. The first column is the chromosome, the second is chromStart, and the third is chromEnd.
`--shortpeak`	Whether or not to call peaks from short fragments (shorter than 150bp). If set, MAESTRO will merge the peaks called from all fragments and those called from short fragments, and then use the merged peak file for further analysis. If not (by default), the pipeline will only use peaks called from all fragments.
`--clusterpeak`	Whether or not to call peaks by cluster. If set, MAESTRO will split the bam file according to the clustering result, and then call peaks for each cluster. By default (not set), MAESTRO will skip this step.

Gene score arguments:

Arguments Description

--rpmodel {Simple,Enhanced} The RP model to use to calculate gene score. For each gene, simple model sums over the impact of all regulatory elements within the up/dowm-stream of TSS. On the basis of simple model, enhanced model gives the regulatory elements within the exon region a higher weight, and also excludes the regulatory elements overlapped with another gene (the promoter and exon of a nearby gene). See the MAESTRO paper for more details. DEFAULT: Enhanced.

--genedistance Gene score decay distance, could be optional from 1kb (promoter-based regulation) to 10kb (enhancer-based regulation). DEFAULT: 10000.

Arguments	Description
`--rpmodel`	{Simple,Enhanced} The RP model to use to calculate gene score. For each gene, simple model sums over the impact of all regulatory elements within the up/dowm-stream of TSS. On the basis of simple model, enhanced model gives the regulatory elements within the exon region a higher weight, and also excludes the regulatory elements overlapped with another gene (the promoter and exon of a nearby gene). See the MAESTRO paper for more details. DEFAULT: Enhanced.
`--genedistance`	Gene score decay distance, could be optional from 1kb (promoter-based regulation) to 10kb (enhancer-based regulation). DEFAULT: 10000.

Quality control arguments:

Arguments	Description
`--peak-cutoff`	Minimum number of peaks included in each cell. DEFAULT: 100.
`--count-cutoff`	Cutoff for the number of count in each cell. DEFAULT: 1000.
`--frip-cutoff`	Cutoff for fraction of reads in promoter in each cell. DEFAULT: 0.2.
`--cell-cutoff`	Minimum number of cells covered by each peak. DEFAULT: 10.

Step 2. Configure samples.json file

In the project directory, we will next run MAESTRO samples-init subcommand to configure samples.json file. The purpose of this step is let MAESTRO read the right format and path of input data.

$ cd atac_pbmc_500_nextgem

$ MAESTRO samples-init --assay_type scatac --platform 10x-genomics --data_type fastq --data_dir /MAESTRO/test/tutorial/data/atac_pbmc_500_nextgem_fastqs/

To get a full description of command-line options, please use the command

MAESTRO samples-init -h

Input files arguments:

Arguments	Description
`--assay_type`	{scrna,scatac} choose the type of assays.
`--platform`	{10x-genomics,sci-ATAC-seq,microfluidic} choose the type of sequencing platform.
`--data_type`	{fastq,fragment,bam} choose the type of input data. When using fastq format as input, MAESTRO can only recognize the 10X-like naming: `[Sample Name]_S1_L00[Lane Number]_[Read Type]_001.fastq.gz`.
`--data_dir`	the full path to the input file folder.

Once configured, you will find a samples.json file in the current project directory:

$ cat samples.json
{
    "atac_pbmc_500_nextgem": {
        "I1": [
            "/MAESTRO/test/tutorial/data/atac_pbmc_500_nextgem_fastqs/atac_pbmc_500_nextgem_S1_L001_I1_001.fastq.gz",
            "/MAESTRO/test/tutorial/data/atac_pbmc_500_nextgem_fastqs/atac_pbmc_500_nextgem_S1_L002_I1_001.fastq.gz"
        ],
        "R1": [
            "/MAESTRO/test/tutorial/data/atac_pbmc_500_nextgem_fastqs/atac_pbmc_500_nextgem_S1_L001_R1_001.fastq.gz",
            "/MAESTRO/test/tutorial/data/atac_pbmc_500_nextgem_fastqs/atac_pbmc_500_nextgem_S1_L002_R1_001.fastq.gz"
        ],
        "R2": [
            "/MAESTRO/test/tutorial/data/atac_pbmc_500_nextgem_fastqs/atac_pbmc_500_nextgem_S1_L001_R2_001.fastq.gz",
            "/MAESTRO/test/tutorial/data/atac_pbmc_500_nextgem_fastqs/atac_pbmc_500_nextgem_S1_L002_R2_001.fastq.gz"
        ],
        "R3": [
            "/MAESTRO/test/tutorial/data/atac_pbmc_500_nextgem_fastqs/atac_pbmc_500_nextgem_S1_L001_R3_001.fastq.gz",
            "/MAESTRO/test/tutorial/data/atac_pbmc_500_nextgem_fastqs/atac_pbmc_500_nextgem_S1_L002_R3_001.fastq.gz"
        ]
    }
}

Step 3. Run snakemake pipeline

Before running the workflow, please check the config.yaml and see if it is configured correctly. Once configured, users can use snakemake to run the workflow.

$ snakemake -np

$ nohup snakemake --cores 32 > atac_pbmc_500_nextgem.out &
##Instead of using nohup, submitting $ snakemake -j 16 by slurm is highly recommended.

Step 4. Understand the final output files

The whole pipeline in this example takes about 6 hours with 10 cores. Here, we assume users have run MAESTRO from fastq files successfully. An output directory is specified in the run call, and will contain several useful outputs as described below.

$ ls Result/
Analysis  Benchmark  Log  Mapping  QC  Report

Output files

Mapping/: The Mapping directory contains all the mapping results generated by chromap or minimap2. fragments_corrected_count.tsv is a BED-like tabular file, in which each line represents a unique ATAC-seq fragment captured by the assay. The 5 columns in the file are chrom, chromStart, chromEnd, barcode, and duplicateCount, respectively. All the barcodes have been corrected according to the whitelist users provide. fragments_corrected_count.tsv is used for counting peaks in the downstream analysis.
QC/: The QC directory contains quality control analysis results of scATAC-seq data, including the barcode filtering table outprefix_scATAC_validcells.txt, filtered count matrix outprefix_filtered_peak_count.h5 and other QC results. The filtered count matrix is generated according to singlecell.txt and the parameters like --count-cutoff and --frip-cutoff that users provide. singlecell.txt is a tabular file, which provides QC information associated with the fragments per barcode. The file contains three columns, representing barcode, number of fragments and number of fragments overlapping promoter regions.
Analysis/: The Analysis directory contains MACS2 peak calling result, peak count table and gene score table, as well as clustering result, cell type annotation result and driver transcription factor identification result, which we will introduce in the step-by-step analysis.
Benchmark/: The Benchmark directory stores benchmark files for all rules in Snakefile, each of which contains a tab-separated table of run times and memory usage in MiB.
Log/: The Log directory contains the log files generated in the pipeline analysis.
Report/: The Report directory contains outprefix_scATAC_report.html file which summarizes all the results in an HTML based document. The summary HTML for the example can be found here.

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        knitr_1.33        magrittr_2.0.1    R6_2.5.0         
 [5] rlang_0.4.11      fansi_0.5.0       stringr_1.4.0     tools_4.1.0      
 [9] xfun_0.24         utf8_1.2.1        git2r_0.28.0      jquerylib_0.1.4  
[13] htmltools_0.5.1.1 ellipsis_0.3.2    rprojroot_2.0.2   yaml_2.2.1       
[17] digest_0.6.27     tibble_3.1.2      lifecycle_1.0.0   crayon_1.4.1     
[21] later_1.2.0       sass_0.4.0        vctrs_0.3.8       promises_1.2.0.1 
[25] fs_1.5.0          glue_1.4.2        evaluate_0.14     rmarkdown_2.9    
[29] stringi_1.6.2     bslib_0.2.5.1     compiler_4.1.0    pillar_1.6.1     
[33] jsonlite_1.7.2    httpuv_1.6.1      pkgconfig_2.0.3