Setting up

Hopefully, everyone can log into their (or shared) rugen node. Make a directory for this workshop.

mkdir STAR_tutorial
cd STAR_tutorial

Installing

STAR

Precompiled binaries are easiest.

wget https://github.com/alexdobin/STAR/blob/master/bin/Linux_x86_64_static/STAR

wget https://github.com/alexdobin/STAR/blob/master/bin/MacOSX_x86_64/STAR

Samtools

http://www.htslib.org/download/

Bedtools

https://bedtools.readthedocs.io/en/latest/content/installation.html

IGVTools

https://software.broadinstitute.org/software/igv/igvtools

GATK

https://software.broadinstitute.org/gatk/

See if you can get version 3 and 4

Get data

You will need a genome file (FASTA) and its corresponding annotation file (GTF). There are many locations and ways to download these files. GENCODE (https://www.gencodegenes.org/human/) has data for humans and mouse only, while ENSEMBL (http://useast.ensembl.org/index.html) and NCBI (https://www.ncbi.nlm.nih.gov/search/all/?term=human) will have other species. Note, these are also specific to the latest release and will change over time.

GRChXX: genome reference version
pXX: patch version
GencodeXX: gene annotation version

Genome (FASTA) and annotation (GTF) file

mkdir GRCh38_Gencode31
cd GRCh38_Gencode31
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh38.p12.genome.fa.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.annotation.gtf.gz
gunzip *

Reads/experiment

I have two experiments but they are a little large and we wont have time to wait for the alignment results during this workshop. But I’ve made smaller buk files to play with, and the you can use one of the single-cell paired runs to test. Inputs (and outputs) can be found in this folder on tyrone:

ls /data/sballouz
ase_run  bulk_run  single_cell_run

ls /data/sballouz/bulk_run
fastq  fastqc_out  STAR_out  xcgd_carrier_1_pbmc_S4_R1_001_sub.fastq  xcgd_carrier_1_pbmc_S4_R2_001_sub.fastq

ls /data/sballouz/single_cell_run
SCGC-GILL-JG-03_S1_L001_I1_001.fastq.gz  SCGC-GILL-JG-03_S1_L003_R1_001.fastq.gz
SCGC-GILL-JG-03_S1_L001_R1_001.fastq.gz  SCGC-GILL-JG-03_S1_L003_R2_001.fastq.gz
SCGC-GILL-JG-03_S1_L001_R2_001.fastq.gz  SCGC-GILL-JG-03_S1_L004_I1_001.fastq.gz
SCGC-GILL-JG-03_S1_L002_I1_001.fastq.gz  SCGC-GILL-JG-03_S1_L004_R1_001.fastq.gz
SCGC-GILL-JG-03_S1_L002_R1_001.fastq.gz  SCGC-GILL-JG-03_S1_L004_R2_001.fastq.gz
SCGC-GILL-JG-03_S1_L002_R2_001.fastq.gz  STAR_out_old
SCGC-GILL-JG-03_S1_L003_I1_001.fastq.gz  STAR_out_updated

ls /data/sballouz/ase_run
chr_cut  SRR598009  SRR598009.Aligned.sortedByCoord.out.bam  SRR598009.Aligned.sortedByCoord.out.bam.bai

Feel free to copy or link to these directories.

Prepare genome

Run from the same directory or from the genome location. Specify output directories. Note, depending where you run this, you might need to modify RAM and number of threads. Overhang is default. Many other options can be tweaked and depend on the species and application. You can only use the same STAR version for mapping that generated the genome!

/home/sballouz/tools/STAR-2.7.3a/STAR   \
--runThreadN 10       \
--runMode genomeGenerate  \
--genomeDir GRCh38_Gencode31   \
--genomeFastaFiles  GRCh38.p12.genome.fa  \
--sjdbGTFfile gencode.v31.annotation.gtf  \
--sjdbOverhang 100                 \
--limitGenomeGenerateRAM 40048000000

Alignment

Bulk RNA-seq experiment

/home/sballouz/tools/STAR-2.7.3a/STAR  \
--genomeDir /data/genomes/GRCh38_Gencode31/ \
--readFilesIn /data/sballouz/bulk_run/xcgd_carrier_1_pbmc_S4_R1_001.fastq  /data/sballouz/bulk_run/xcgd_carrier_1_pbmc_S4_R2_001.fastq \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--twopassMode Basic \
--twopass1readsN -1 \
--runThreadN 10

/home/sballouz/tools/STAR-2.7.3a/STAR  \
--genomeDir /data/genomes/GRCh38_Gencode31/ \
--readFilesIn /data/sballouz/bulk_run/xcgd_carrier_1_pbmc_S4_R1_001_sub.fastq  /data/sballouz/bulk_run/xcgd_carrier_1_pbmc_S4_R2_001_sub.fastq \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--runThreadN 10

Single-cell RNA-seq experiment

STAR \
--runThreadN 20 \
--soloType Droplet \
--genomeDir /data/genomes/GRCh38_Gencode31/ \
--outSAMtype BAM Unsorted \
--quantMode GeneCounts \
--soloCBwhitelist /data/genomes/cellranger_barcodes/3M-february-2018.txt \
--readFilesIn SCGC-GILL-JG-03_S1_L001_R2_001.fastq.gz,SCGC-GILL-JG-03_S1_L002_R2_001.fastq.gz,SCGC-GILL-JG-03_S1_L003_R2_001.fastq.gz,SCGC-GILL-JG-03_S1_L004_R2_001.fastq.gz SCGC-GILL-JG-03_S1_L001_R1_001.fastq.gz,SCGC-GILL-JG-03_S1_L002_R1_001.fastq.gz,SCGC-GILL-JG-03_S1_L003_R1_001.fastq.gz,SCGC-GILL-JG-03_S1_L004_R1_001.fastq.gz \
--readFilesCommand zcat \
--soloCellFilter  CellRanger2.2 8000 0.99 10 \
--soloFeatures Gene Velocyto \
--soloBarcodeReadLength 28

Inspecting output

Bulk

Start R, and then read in the bulk sample output.

files = "STAR_out"
genecounts = "ReadsPerGene.out.tab"
splicejunctions = "SJ.out.tab"
logname = "Log.final.out"
load("/data/genomes/GRCh38_Gencode31/gene_annotations_v31.Rdata") # see other tutorial on how to make this
dir = "."

Ns = list()
i = 1

for( n in files ){
  N = list()
  filedir = paste(dir, n, sep="/")
  countfile = paste(filedir, genecounts, sep=".")
  logfile = paste(filedir, logname, sep=".")
  if( file.exists(countfile) ) {
    print(countfile)
    
    counts =  read.table(countfile)
    log1 =read.table(logfile, sep="\t", nrows=6)
    log2 =read.table(logfile, sep="\t", skip=8, nrows=14)
    log3 =read.table(logfile, sep="\t", skip=23, nrows=4)
    log4 =read.table(logfile, sep="\t", skip=28, nrows=3)

    N$mapinfo      = rbind(log1,log2,log3,log4)
    N$unmapped     = counts[1,]
    N$multimapping = counts[2,]
    N$noFeature    = counts[3,]
    N$ambiguous    = counts[4,]
    N$length       = dim(counts)[1]-4
    N$genes        = counts[(1:N$length)+4,1]
    N$counts1      = counts[(1:N$length)+4,2]
    N$counts2      = counts[(1:N$length)+4,3]
    N$counts3      = counts[(1:N$length)+4,4]
 } else {
   # Stranded or unstranded? Spot check this before running the code. Should use the last column if stranded. Otherwise first/max. 
   N$counts3 = rep(0, length(attr$ensemblID ) )
 }
 if( i > 1  ){
   counts_exp = cbind(counts_exp, N$counts3)
 } else {
   counts_exp = N$counts3
 }
 Ns[[i]] = N
 print(i)
 i = i + 1
}
rownames(counts_exp) = N$genes  # should be the same for all samples if run on the same genome 
# rownames(counts_exp) = attr$ensemblID # ignore for now
colnames(counts_exp) = files
save(Ns, counts_exp, file=paste(dir, "counts.Rdata", sep=""))

Single-cell

If reading in the raw files, you will need to run the droplet processing pipeline first.

library(DropletUtils)
library(dplyr)
library(ggplot2)

# Load STAR solo output (reads in matrix.mtx file)
sce <- read10xCounts(data.dir = "Solo.out/")

# Rank barcodes to find knee and inflection points -> where UMI counts shift suggesting empty drops/ambient RNA
br.out <- barcodeRanks(counts(sce))
br.out.df <- as.data.frame(br.out)
br.out.df$barcode <- colData(sce)$Barcode
br.out.df %>% filter(rank <= 10) %>% arrange(rank)
x_knee <- br.out.df %>% filter(total > br.out$knee) %>% arrange(total) %>% select(rank) %>% head(1)
x_inflection <- br.out.df %>% filter(total > br.out$inflection) %>% arrange(total) %>% select(rank) %>% head(1)
padding <- length(br.out$rank) / 10
  

# Bend the knee!  
ggplot(br.out.df, aes(x = rank, y = total)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 14),
        title = element_text(size = 16)) +
  geom_hline(yintercept = br.out$knee, linetype = 2, colour = "dodgerblue") +
  geom_hline(yintercept = br.out$inflection, linetype = 2, colour = "forestgreen") +
  labs(x = "Rank", y = "Total", title = "Barcode Rank vs Total UMI") +
  annotate("text", label = paste0("Knee (", x_knee, ")"), x = x_knee$rank + padding, y = br.out$knee, size = 5) +
  annotate("text", label = paste0("Inflection (", x_inflection, ")"), x = x_inflection$rank + padding, y = br.out$inflection, size = 5)


# Find the empty drops 
e.out <- emptyDrops(counts(sce))
 
# use FDR threshold of 0.01
is.cell <- e.out$FDR <= 0.01
 
# replace all NA's with FALSE
is.cell.no.na <- is.cell
is.cell.no.na[is.na(is.cell)] <- FALSE
sum(is.cell.no.na)
 

# Plot 
ggplot(br.out.df[is.cell.no.na,], aes(x = rank, y = total)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 14),
        title = element_text(size = 16)) +
  labs(x = "Rank", y = "Total", title = "Cell barcodes that differ from ambient RNA")

# 
sce <- sce[,is.cell.no.na]
save(sce, file="sce.Rdata")

Or if filtered, read that file in instead, and carry on the single-cell analyses.

The STARs align