Thursday, February 24, 2011

Usage examples from the wild:


============================================================
Example 1
Summary: Compute the "spanning" or "physical" coverage of my paired-end
dataset on the reference genome. I will use hg18.
Contributor: Aaron Quinlan
Date: 06-July-2010
Tools used: samtools, bamToBed, genomeCoverageBed
============================================================
Workflow:
# 1. Sort the BAM files by read ID so that step 2 can correctly compute the "span" of the
# paired-end fragment.
samtools sort -n -m 50000000000 sample.sorted.dup.bam sample.namesorted.dup

# 2. Create BED from the spanning cooridnates and sort them by chrom for genomeCoverageBed.
# NOTE: Using -bedpe format from bamToBed so that I have the starts/ends for each mate.
# NOTE: Only computing coverage based upon "properly-paired" reads. (i.e., -f 0x2)
samtools view -uf 0x2 sample.namesorted.dup.bam | bamToBed -i stdin -bedpe | awk '{OFS="\t"; print $1,$2,$6,$7,$8}' | sort -k 1,1 > sample.namesorted.dup.bam.bed

# 3. Compute the ___spanning___ coverage histogram for the sample
# NOTE: We are creating a histogram where we lump all positions with depth > 60 into a single bin (-max 60)
genomeCoverageBed -i sample.namesorted.dup.bam.bed -g human.hg18.genome -max 60 > sample.spanning.cov




============================================================
Example 2
Summary: Convert a BAM into a bedgraph or a bigWig.
Contributor: Davide Cittaro
Date: 08-July-2010
Tools used: bamToBed, slopBed, genomeCoverageBed, samtools, wigToBigWig
============================================================
Workflow:
# 1. Remove duplicates from the BAM file. I'm using samtools, Picard may be also used. I'm assuming it's a single read file
samtools rmdup -s sample.bam sample.nodup.bam

# 2. Get chromosome sizes. You may get this from UCSC or from BAM headers
samtools view -H sample.nodup.bam | awk '/@SQ/ {OFS="\t"; gsub("SN:", "", $2); gsub("LN:", "", $3); print $2, $3}' > chromsizes.tab

# 3a. Convert the BAM file into a gzipped bedgraph
# NOTE: You should define the extension size (ssize) as the average library size minus the read length. There's no need to be precise here, if you have library at 200 bp and 36 bp reads, you can set ssize in a range between 160-180 bp
bamTobed -i sample.nodup.bam | slopBed -i stdin -g chromsizes.tab -s -l 0 -r ${ssize} | genomeCoverageBed -g chromsizes.tab -i stdin -bg | gzip -c - > sample.bdg.gz

# 3b. Convert the BAM file into a bigWig file
bamTobed -i sample.nodup.bam | slopBed -i stdin -g chromsizes.tab -s -l 0 -r ${ssize} | genomeCoverageBed -g chromsizes.tab -i stdin -bg | wigToBigWig stdin chromsizes.tab sample.bw



============================================================
Example 3
Summary: A slightly different way (than ex. 2) to create BigWig files from BAM
Contributor: Assaf Gordon
Date: 09-July-2010
Tools used: samtools, genomeCoverageBed, bedGraphToBigWig
More Information : http://cancan.cshl.edu/labmembers/gordon/files/viz2.pdf
============================================================
Workflow:
# 1. Convert SAM to BAM
samtools view -S -b -o sample.bam sample.sam

# 2. Sort the BAM file
samtools sort sample.bam sample.sorted

# 3. Create BedGraph coverage file
genomeCoverageBed -bg -ibam sample.sorted.bam -g chromsizes.txt > sample.bedgraph

# 4. Convert the BedGraph file to BigWig
bedGraphToBigWig sample.bedgraph chromsizes.txt sample.bw


============================================================
Example 4
Summary: Create two coverage tracks - one per strand
Contributor: Assaf Gordon
Date: 09-July-2010
Tools used: samtools, genomeCoverageBed, bedGraphToBigWig, awk
More Information : http://cancan.cshl.edu/labmembers/gordon/files/viz2.pdf
============================================================
Workflow:
# 1. Convert SAM to BAM
samtools view -S -b -o sample.bam sample.sam

# 2. Sort the BAM file
samtools sort sample.bam sample.sorted

# 3. Create BedGraph coverage file - for positive strand
genomeCoverageBed -bg -ibam sample.sorted.bam -g chromsizes.txt -strand + > sample.pos.bedgraph

# 4. Create BedGraph coverage file - for negative strand
# use AWK to negate the coverage values
genomeCoverageBed -bg -ibam sample.sorted.bam -g chromsizes.txt -strand - | awk -v OFS="\t" '{ $4 = - $4 ; print $0 }' > sample.neg.bedgraph

# 5. Create BigWig tracks from both bedgraph files
bedGraphToBigWig sample.pos.bedgraph chromsizes.txt sample.pos.bw
bedGraphToBigWig sample.pos.bedgraph chromsizes.txt sample.neg.bw

# 6. In the UCSC Genome Browser tracks, add both tracks, with different color values:
track name="Positive Strand" color=0,0,255 type=bigWig bigDataUrl=http://myserver.edu/SAMPLE.POS.BW
track name="Negative Strand" color=255,0,0 type=bigWig bigDataUrl=http://myserver.edu/SAMPLE.NEG.BW






============================================================
Example 5
Summary: Create two coverage tracks - for exons only
Contributor: Assaf Gordon
Date: 09-July-2010
Tools used: samtools, genomeCoverageBed, bedGraphToBigWig
More Information : http://cancan.cshl.edu/labmembers/gordon/files/viz2.pdf
============================================================
Workflow:
# 1. Convert SAM to BAM
samtools view -S -b -o sample.bam sample.sam

# 2. Sort the BAM file
samtools sort sample.bam sample.sorted

# 3. Create BedGraph coverage file - calculating the coverage of exons only
genomeCoverageBed -bg -ibam sample.sorted.bam -g chromsizes.txt -split > sample.bedgraph

# 4. Create BigWig tracks from the bedgraph files
bedGraphToBigWig sample.bedgraph chromsizes.txt sample.bw






=============================================================
Example 6
Summary: Find all genomic regions where all samples have >=X coverage
Contributor: Aaron Quinlan
Date: 18-July-2010
Tools used: samtools, intersectBed, genomeCoverageBed, mergeBed, awk
=============================================================
Workflow:
##############################################################
# NOTE: The example below is for three samples in the interest
# of clarity/brevity. It could be extended to many more
# samples with a BASH or Perl/Python script.
##############################################################

# 1. Let's assume you have three human samples, 1, 2, and 3.
# Let's also assume that you have aligned the sequences for
# these samples and they are now in BAM format.
ls
1.bam 2.bam 3.bam

# 1a. These BAM files contain both concordant and discordant reads.
# Also, they contain alignments with _all_ mapping qualities.
# Let's cull the files to solely those alignments that are
# "properly-paired" (0x2) and have a mapping quality >=10.
samtools view -q 10 -f 0x2 -b 1.bam > 1.clean.bam
samtools view -q 10 -f 0x2 -b 2.bam > 2.clean.bam
samtools view -q 10 -f 0x2 -b 3.bam > 3.clean.bam

# 2. First create a BEDGRAPH of the coverage for each sample.
genomeCoverageBed -ibam 1.clean.bam -g human.hg19.genome -bga > 1.clean.bam.bedg
genomeCoverageBed -ibam 2.clean.bam -g human.hg19.genome -bga > 2.clean.bam.bedg
genomeCoverageBed -ibam 3.clean.bam -g human.hg19.genome -bga > 3.clean.bam.bedg

# 3. Now merge all intervals with sufficient depth (say >=6X).
# This will create a BED file.
awk '$4 >= 6' 1.clean.bam.bedg | mergeBed -i stdin > 1.clean.bam.gte6.bed
awk '$4 >= 6' 2.clean.bam.bedg | mergeBed -i stdin > 2.clean.bam.gte6.bed
awk '$4 >= 6' 3.clean.bam.bedg | mergeBed -i stdin > 3.clean.bam.gte6.bed

# 4. Intersect them all to get the set of intervals having depth >= 6 in _all_ 3.
intersectBed -a 1.clean.bam.gte6.bed -b 2.clean.bam.gte6.bed | \
intersectBed -a stdin -b 3.clean.bam.gte6.bed > all.clean.bam.gte6.bed

# 5. What is the total number of base pairs having >=6X in _all_ 3 samples?
awk '{sum += $3-$2} END{print sum}' all.clean.bam.gte6.bed

# 6. One could now filter variant calls against this BED file to restrict
# calls to regions that you believe to be of sufficient quality, etc.