Pascal MARTIN
23 juin 2016
NGS = High-throughput sequencing
library(Biostrings)
Containers:
- XString – BString, DNAString, RNAString, AAString
- XStringSet – multiple sequences
- XStringViews
“Masked” sequences are also supported (see ?masks)
Manipulation:
- [[ and [ for subsetting
- subseq
- toString
- Views
- … see Biostrings cheat sheet
dm3_upstream_filepath <- system.file("extdata","dm3_upstream2000.fa.gz",
package="Biostrings")
dm3_upstream <- readDNAStringSet(dm3_upstream_filepath)
dm3_upstream
## A DNAStringSet instance of length 26454
## width seq names
## [1] 2000 GTTGGTGGCCCACCAGTGC...GTTTACCGGTTGCACGGT NM_078863_up_2000...
## [2] 2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201794_up_2...
## [3] 2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201795_up_2...
## [4] 2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201796_up_2...
## [5] 2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201797_up_2...
## ... ... ...
## [26450] 2000 ATTTACAAGACTAATAAAG...ATTAAATTTCAATAAAAC NM_001111010_up_2...
## [26451] 2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001015258_up_2...
## [26452] 2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001110997_up_2...
## [26453] 2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001276245_up_2...
## [26454] 2000 CGTATGTATTAGTTAACTC...AAGTGTAAGAACAAATTG NM_001015497_up_2...
package Rsamtools and function FaFile to access large fasta files
dm3_upstream[[5]]
## 2000-letter "DNAString" instance
## seq: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAAAG...ATTAATCGATAGATACGGAAAGTCATCCTCGAT
toString(dm3_upstream[[5]][2:30])
## [1] "TATTTATGTAGGCGCCCGTTCCCGCAGCC"
Views(dm3_upstream[[5]],start=c(1,11),end=c(10,25))
## Views on a 2000-letter DNAString subject
## subject: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAA...TAATCGATAGATACGGAAAGTCATCCTCGAT
## views:
## start end width
## [1] 1 10 10 [TTATTTATGT]
## [2] 11 25 15 [AGGCGCCCGTTCCCG]
alphabetFrequency(dm3_upstream[1:2],baseOnly=TRUE,as.prob=TRUE)
## A C G T other
## [1,] 0.323 0.191 0.1875 0.2985 0
## [2,] 0.300 0.207 0.2230 0.2700 0
library(BSgenome.Dmelanogaster.UCSC.dm3)
names(Dmelanogaster)[1:5]
## [1] "chr2L" "chr2R" "chr3L" "chr3R" "chr4"
Dmelanogaster$chr2L
## 23011544-letter "DNAString" instance
## seq: CGACAATGCACGACAGAGGAAGCAGAACAGATAT...TATTTGCAAATTTTGATGAACCCCCCTTTCAAA
For a “masked” version of the genome, see:
library(BSgenome.Dmelanogaster.UCSC.dm3.masked)
For adding SNPs info see:
library(BSgenome)
?available.SNPs
matchPattern("KATCGATA",dm3_upstream[[592]],fixed=F)
## Views on a 2000-letter DNAString subject
## subject: TCCCAAATTAACTAATGGCTTTTCACGCAGAT...GCCTCACTTTTGTCCACATCTTTTGAAAGGC
## views:
## start end width
## [1] 72 79 8 [GATCGATA]
## [2] 512 519 8 [TATCGATA]
## [3] 518 525 8 [TATCGATA]
K is G or T, see IUPAC code
Other functions to search for patterns: matchProbePair, findPalindromes, …
vmatchPattern('TATCGATA',Dmelanogaster)
Probabilistic description of short sequences largely used for TF binding sites
EcRMotif <- MotifDb::query(MotifDb,"EcR")[[1]]
seqLogo representation (reverse complement):
EcRHits <- matchPWM(EcRMotif,Dmelanogaster$chr4)
length(EcRHits)
## [1] 1799
EcRHits[1:2]
## Views on a 1351857-letter DNAString subject
## subject: GAATTCGCGTCCGCTTACCCATGTGCCTGTGG...TAAAAGCAGCCGTCGATTTGAGATATATGAA
## views:
## start end width
## [1] 1135 1142 8 [ATGTCCTT]
## [2] 1630 1637 8 [TTCACCTT]
Minus strand: use the reverseComplement of PWM
Other packages to work with motifs:
- MotifDb
- seqLogo
- PWMEnrich
- TFBSTools
- rGADEM
- BCRANK
- MotIV
- …
For database queries (+ other tools for AA sequences):
seqinr
Other functions:
pairwiseAlignment('CTTGCAGTGGTGTATTCATAC',dm3_upstream[[1]],type='global-local')
## Global-Local PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: [1] CTTGCAGTGG-TGTATTCATAC
## subject: [713] CTTGCAGTGGGTGTAT--ATAC
## score: 5.653368
\(Q = -10*{\log_{10}(P)}\) <=> \(P = 10^{-\frac{Q}{10}}\)
The ShortRead package (M. Morgan et al. 2009)
library(ShortRead)
Import a fastq file with 20K reads:
fq1_path <- system.file(package="ShortRead","extdata","E-MTAB-1147",
"ERR127302_1_subset.fastq.gz")
myFastq <- readFastq(fq1_path)
Explore with:
myFastq
## class: ShortReadQ
## length: 20000 reads; width: 72 cycles
myFastq[1:5]
## class: ShortReadQ
## length: 5 reads; width: 72 cycles
head(sread(myFastq),3)
## A DNAStringSet instance of length 3
## width seq
## [1] 72 GTCTGCTGTATCTGTGTCGGCTGTCTCGCGG...CAATGAAGGCCTGGAATGTCACTACCCCCAG
## [2] 72 CTAGGGCAATCTTTGCAGCAATGAATGCCAA...GTGGCTTTTGAGGCCAGAGCAGACCTTCGGG
## [3] 72 TGGGCTGTTCCTTCTCACTGTGGCCTGACTA...GCATTAAGAAAGAGTCACGTTTCCCAAGTCT
head(quality(myFastq),3)
## class: FastqQuality
## quality:
## A BStringSet instance of length 3
## width seq
## [1] 72 HHHHHHHHHHHHHHHHHHHHEBDBB?B:BBG...FEFBDBD@DDECEE3>:?;@@@>?=BAB?##
## [2] 72 IIIIHIIIGIIIIIIIHIIIIEGBGHIIIIH...IHIIIHIIIIIGIIIEGIIGBGE@DDGGGIG
## [3] 72 GGHBHGBGGGHHHHDHHHHHHHHHFGHHHHH...HHHHHHGHFHHHHHHHHHHHHHH8AGDGGG>
head(id(myFastq),3)
## A BStringSet instance of length 3
## width seq
## [1] 53 ERR127302.8493430 HWI-EAS350_0441:1:34:16191:2123#0/1
## [2] 53 ERR127302.21406531 HWI-EAS350_0441:1:88:9330:2587#0/1
## [3] 55 ERR127302.22173106 HWI-EAS350_0441:1:91:10434:14757#0/1
encoding(quality(myFastq))
## ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## : ; < = > ? @ A B C D E F G H I J
## 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
alphabet(sread(myFastq))
## [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
## [18] "."
Functions FastqSampler and FastqStreamer
Count the reads in a fastq file:
nr_myFastq <- 0
strm <- FastqStreamer(fq1_path,1000)
repeat {
## Get FASTQ chunk:
fq <- yield(strm)
if (length(fq) == 0)
break
## Do something on the chunk:
nr_myFastq <- nr_myFastq+length(fq)
}
close(strm) #close the connection
nr_myFastq
## [1] 20000
fastqc from Babraham Institute
Several tools available in R/BioC: ShortRead qa/qa2 functions, qrqc, seqTools, Rqc
Run library(Rqc)
on a fastq file:
rqcResultSet <- rqcQA(fq1_path,sample=T)
rqcCycleQualityPlot(rqcResultSet)
rqcCycleBaseCallsLinePlot(rqcResultSet)
Define some filters:
max1N <- nFilter(threshold=1L) #No 'Ns' in the reads
goodq <- srFilter(function(x){apply(as(quality(x),"matrix"),
1,median,na.rm=T)>=30},
name="MedianQualityAbove30")
myFilter <- compose(max1N,goodq) #combine filters
Create a function to filter and trim the reads:
FilterAndTrim <- function(fl,destination=sprintf("%s_filtered",fl))
{
stream <- FastqStreamer(fl) ## open input stream
on.exit(close(stream))
repeat {
###get fastq chunk
fq <- yield(stream)
if (length(fq)==0)
break
###TRIM first 4 and last 2 bases
fq <- narrow(fq,start=5,end=70)
###FILTER
fq <- fq[myFilter(fq)]
###write filtered fastq
writeFastq(fq, destination, mode="a")
}
}
Apply the function:
FilterAndTrim(fqFiles[1],
destination=file.path(getwd(),"FilteredFastq.fq"))
R packages: Rbowtie, QuasR (Gaidatzis et al. 2015), Rsubread (Liao, Smyth, and Shi 2013; Liao, Smyth, and Shi 2014)
Mapping quality scores (MAQ aligner in Li, Ruan, and Durbin 2008):
- base qualities (Phred scores)
- mismatches/indels
- repetitions
- paired reads
See also Wikipedia’s list of alignment tools
SAM format specifications
Useful tools: samtools, Picard tools
library(Rsamtools);library(GenomicAlignments)
library(pasillaBamSubset)
sr <- untreated1_chr4() #single-end
pr <- untreated3_chr4() #paired-end
Use samtools to index the file
indexBam(sr_bamFile)
Define what to import
which=GRanges(seqnames="chr4",
ranges=IRanges(c(75000,1190000),
c(85000,1203000)),
strand="*")
what = c("rname","strand","pos","qwidth","seq")
flag=scanBamFlag(isDuplicate=FALSE)
param=ScanBamParam(which=which,what=what,flag=flag)
import single-end reads
srbam <- readGAlignments(sr,param=param)
srbam[1:2]
## GAlignments object with 2 alignments and 5 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] chr4 + 21M13615N50M55N4M 75 72990 86734
## [2] chr4 - 4M13615N50M55N21M 75 73007 86751
## width njunc | rname strand pos qwidth
## <integer> <integer> | <factor> <factor> <integer> <integer>
## [1] 13745 2 | chr4 + 72990 75
## [2] 13745 2 | chr4 - 73007 75
## seq
## <DNAStringSet>
## [1] AAAAACTGCA...CGTAGCCACA
## [2] ATACCTGTGA...TGGACGGCTG
## -------
## seqinfo: 8 sequences from an unspecified genome
import paired-end reads
prbam <- readGAlignmentPairs(pr)
prbam[1:2]
## GAlignmentPairs object with 2 pairs, strandMode=1, and 0 metadata columns:
## seqnames strand : ranges -- ranges
## <Rle> <Rle> : <IRanges> -- <IRanges>
## [1] chr4 + : [169, 205] -- [ 326, 362]
## [2] chr4 + : [943, 979] -- [1086, 1122]
## -------
## seqinfo: 8 sequences from an unspecified genome
Packages IRanges & GenomicRanges (Lawrence et al. 2013)
See also the Introduction by Martin Morgan, 2014 and 2015
A simple IRanges:
(eg = IRanges(start = c(1, 10, 20),
end = c(4, 10, 19),
names = c("A", "B", "C")))
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## A 1 4 4
## B 10 10 1
## C 20 19 0
A bigger IRanges:
set.seed(123) #For reproducibility
start = floor(runif(10000, 1, 1000))
width = floor(runif(10000, 0, 100))
ir = IRanges(start, width=width)
ir
## IRanges object with 10000 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 288 318 31
## [2] 788 819 32
## [3] 409 495 87
## [4] 883 914 32
## [5] 940 951 12
## ... ... ... ...
## [9996] 466 492 27
## [9997] 899 983 85
## [9998] 114 124 11
## [9999] 571 595 25
## [10000] 900 965 66
Vector-like behavior: length, [
Accessors: start, end, width, names
length(ir)
## [1] 10000
width(ir[1:4])
## [1] 31 32 87 32
names(eg)
## [1] "A" "B" "C"
mid(ir[1:4])
## [1] 303 803 452 898
irl <- split(ir,width(ir))
irl[[1]][1:3]
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 321 320 0
## [2] 600 599 0
## [3] 184 183 0
length(irl)
## [1] 100
head(elementNROWS(irl))
## 0 1 2 3 4 5
## 96 83 108 95 84 110
List = list with all elements of the same type
start(irl)[1:2]
## IntegerList of length 2
## [["0"]] 321 600 184 297 276 816 87 729 ... 858 52 85 188 308 289 936 669
## [["1"]] 915 576 706 235 678 647 451 138 ... 638 66 979 740 300 869 433 645
log(start(irl)[1:2])
## NumericList of length 2
## [["0"]] 5.77144112313002 6.39692965521615 ... 6.50578406012823
## [["1"]] 6.81892406527552 6.35610766069589 ... 6.46925031679577
Package GenomicRanges:
library(GenomicRanges)
A typical use case:
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
(gr <- exons(txdb))
## GRanges object with 76920 ranges and 1 metadata column:
## seqnames ranges strand | exon_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr2L [7529, 8116] + | 1
## [2] chr2L [8193, 8589] + | 2
## [3] chr2L [8193, 9484] + | 3
## ... ... ... ... . ...
## [76918] chrUextra [523024, 523048] - | 76918
## [76919] chrUextra [523024, 523086] - | 76919
## [76920] chrUextra [523060, 523086] - | 76920
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
Operations on GRanges are generally seqnames- and strand-aware (see argument ignore.strand)
(grl <- exonsBy(txdb,by="gene"))
## GRangesList object of length 15682:
## $FBgn0000003
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr3R [2648220, 2648518] + | 45123 <NA>
##
## $FBgn0000008
## GRanges object with 13 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## [1] chr2R [18024494, 18024531] + | 20314 <NA>
## [2] chr2R [18024496, 18024713] + | 20315 <NA>
## [3] chr2R [18024938, 18025756] + | 20316 <NA>
## ... ... ... ... . ... ...
## [11] chr2R [18059821, 18059938] + | 20328 <NA>
## [12] chr2R [18060002, 18060339] + | 20329 <NA>
## [13] chr2R [18060002, 18060346] + | 20330 <NA>
##
## ...
## <15680 more elements>
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
See ?‘intra-range-methods’
Also reflect, narrow and threebands, restrict and trim
See ?‘inter-range-methods’
See ?‘setops-methods’
See also ?‘nearest-methods’ including nearest, precede, follow and distance,
Q: Number of TSS located at >500bp from another gene?
Get all genes and transcrits:
Dmg <- genes(txdb)
Dmt <- transcriptsBy(txdb,by="gene")
Get all TSS:
Dm_tss <- unlist(reduce(promoters(Dmt,up=0,down=1),min.gap=0L))
Proportion of TSS overlapping with more than 1 gene +/- 500bp:
mean(countOverlaps(Dm_tss,Dmg+500) > 1) #!strand-aware
## [1] 0.2509943
mean(countOverlaps(Dm_tss,Dmg+500,ignore.strand=T) > 1)
## [1] 0.5167952
Obtaining the overlaps:
fov <- findOverlaps(Dm_tss,Dmg+500,ignore.strand=T) ; fov[1:3]
## Hits object with 3 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 1
## [2] 1 1383
## [3] 2 2
## -------
## queryLength: 20869 / subjectLength: 15682
Two genes on opposite strands that are overlapping:
Dmg[subjectHits(fov)[queryHits(fov)==1]]
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## FBgn0000003 chr3R [2648220, 2648518] + | FBgn0000003
## FBgn0011904 chr3R [2648685, 2648757] - | FBgn0011904
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
Q: How many reads in srbam overlap with gene FBgn0002521?
length(subsetByOverlaps(srbam,Dmg["FBgn0002521"]))
## [1] 358
Q: How many reads in srbam overlap with exons of FBgn0002521?
length(srbam[overlapsAny(srbam,grl[["FBgn0002521"]])])
## [1] 346
Reads mapping on exons:
ctex <- summarizeOverlaps(features = grl[seqnames(Dmg)=="chr4"],
reads = srbam,
mode = Union)
head(assays(ctex)$counts)
## reads
## FBgn0002521 346
## FBgn0004607 0
## FBgn0004859 4
## FBgn0005558 0
## FBgn0005561 0
## FBgn0005666 0
(Huber et al. 2015)
srbam <- readGAlignments(sr)
(covr <- coverage(srbam))
## RleList of length 8
## $chr2L
## integer-Rle of length 23011544 with 1 run
## Lengths: 23011544
## Values : 0
##
## $chr2R
## integer-Rle of length 21146708 with 1 run
## Lengths: 21146708
## Values : 0
##
## $chr3L
## integer-Rle of length 24543557 with 1 run
## Lengths: 24543557
## Values : 0
##
## $chr3R
## integer-Rle of length 27905053 with 1 run
## Lengths: 27905053
## Values : 0
##
## $chr4
## integer-Rle of length 1351857 with 122061 runs
## Lengths: 891 27 5 12 13 45 ... 3 106 75 1600 75 1659
## Values : 0 1 2 3 4 5 ... 6 0 1 0 1 0
##
## ...
## <3 more elements>
Genes on chromosome 4:
gn4 <- Dmg[seqnames(Dmg)=="chr4"]
Extract gene level profiles:
profgn4 <- covr[gn4]
profgn4[strand(gn4)=="-"] <- lapply(profgn4[strand(gn4)=="-"],rev)
names(profgn4) <- names(gn4) ; profgn4[1:2]
## RleList of length 2
## $FBgn0002521
## integer-Rle of length 9178 with 825 runs
## Lengths: 86 5 26 1 10 2 13 13 5 5 ... 29 1 1 1 1 1 11 3 4 13
## Values : 0 1 2 4 5 6 7 8 9 8 ... 8 7 6 7 8 7 6 5 6 5
##
## $FBgn0004607
## integer-Rle of length 37983 with 41 runs
## Lengths: 11762 75 1336 15 60 ... 94 75 1986 75 829
## Values : 0 1 0 1 2 ... 0 1 0 1 0
Extract the first 1Kb as a matrix:
profgn4 <- profgn4[elementNROWS(profgn4)>=1000]
profgn4 <- as(lapply(profgn4,window,1,1000),"RleList")
mat1kb <- matrix(as.numeric(unlist(profgn4, use.names=F)),
nrow=length(profgn4), byrow=T,
dimnames=list(names(profgn4),NULL))
mat1kb <- mat1kb[rowSums(mat1kb)>0,]
Plot the average profile:
df1Kb <- data.frame(Coordinate=1:1000,
Coverage=apply(mat1kb,2,mean,na.rm=T,trim=0.03))
ggplot(df1Kb,aes(x=Coordinate,y=Coverage))+
geom_line()
getSeq(Dmelanogaster,gn4[1:2])
## A DNAStringSet instance of length 2
## width seq names
## [1] 9178 ATCGAATACCCATGCCAAACA...ATAAAAGTACGTTAACAGCA FBgn0002521
## [2] 37983 CAGCTCAGTCGAAAAAAAACG...AACGTACATTTATACGTCCT FBgn0004607
Views(Dmelanogaster,gn4[1:2])
## BSgenomeViews object with 2 views and 1 metadata column:
## seqnames ranges strand dna
## <Rle> <IRanges> <Rle> <DNAStringSet>
## FBgn0002521 chr4 [1193094, 1202271] - [ATCGAATACC...GTTAACAGCA]
## FBgn0004607 chr4 [ 522436, 560418] + [CAGCTCAGTC...TATACGTCCT]
## | gene_id
## | <character>
## FBgn0002521 | FBgn0002521
## FBgn0004607 | FBgn0004607
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
select(org.Dm.eg.db,
keys=c('FBgn0015664','FBgn0015602'),keytype="FLYBASE",
columns=c('SYMBOL','UNIGENE','ENTREZID','FLYBASECG'))
## FLYBASE SYMBOL UNIGENE ENTREZID FLYBASECG
## 1 FBgn0015664 Dref Dm.7169 34328 CG5838
## 2 FBgn0015602 BEAF-32 Dm.9114 36645 CG10159
library(AnnotationHub)
hub <- AnnotationHub()
length(hub) # >43500 datasets
unique(hub$dataprovider)
head(unique(hub$species))
head(unique(ah$rdataclass))
Thank you for your attention!
Davis, S., and P. S. Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 23 (14): 1846–7.
Dobin, A., C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T. R. Gingeras. 2013. “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics 29 (1): 15–21.
Durinck, S., Y. Moreau, A. Kasprzyk, S. Davis, B. De Moor, A. Brazma, and W. Huber. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21 (16): 3439–40.
Durinck, S., P. T. Spellman, E. Birney, and W. Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the R/Bioconductor Package BiomaRt.” Nat Protoc 4 (8): 1184–91.
Gaidatzis, D., A. Lerch, F. Hahne, and M. B. Stadler. 2015. “QuasR: Quantification and Annotation of Short Reads in R.” Bioinformatics 31 (7): 1130–2.
Gentleman, Robert C, Vincent J. Carey, Douglas M. Bates, and others. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biology 5: R80. http://genomebiology.com/2004/5/10/R80.
Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nat. Methods 12 (2): 115–21.
Kim, D., B. Langmead, and S. L. Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nat. Methods 12 (4): 357–60.
Kim, D., G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg. 2013. “TopHat2: Accurate Alignment of Transcriptomes in the Presence of Insertions, Deletions and Gene Fusions.” Genome Biol. 14 (4): R36.
Langmead, B., and S. L. Salzberg. 2012. “Fast Gapped-Read Alignment with Bowtie 2.” Nat. Methods 9 (4): 357–59.
Langmead, B., C. Trapnell, M. Pop, and S. L. Salzberg. 2009. “Ultrafast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome.” Genome Biol. 10 (3): R25.
Lawrence, M., and M. Morgan. 2014. “Scalable Genomics with R and Bioconductor.” Statistical Science 29 (2): 214–26.
Lawrence, M., R. Gentleman, and V. Carey. 2009. “Rtracklayer: An R Package for Interfacing with Genome Browsers.” Bioinformatics 25 (14): 1841–2.
Lawrence, M., W. Huber, H. Pages, P. Aboyoun, M. Carlson, R. Gentleman, M. T. Morgan, and V. J. Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Comput. Biol. 9 (8): e1003118.
Li, H., and R. Durbin. 2009. “Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform.” Bioinformatics 25 (14): 1754–60.
———. 2010. “Fast and Accurate Long-Read Alignment with Burrows-Wheeler Transform.” Bioinformatics 26 (5): 589–95.
Li, H., J. Ruan, and R. Durbin. 2008. “Mapping Short DNA Sequencing Reads and Calling Variants Using Mapping Quality Scores.” Genome Res. 18 (11): 1851–8.
Liao, Y., G. K. Smyth, and W. Shi. 2013. “The Subread Aligner: Fast, Accurate and Scalable Read Mapping by Seed-and-Vote.” Nucleic Acids Res. 41 (10): e108.
———. 2014. “FeatureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30.
Marco-Sola, S., M. Sammeth, R. Guigo, and P. Ribeca. 2012. “The GEM Mapper: Fast, Accurate and Versatile Alignment by Filtration.” Nat. Methods 9 (12): 1185–8.
Morgan, M., S. Anders, M. Lawrence, P. Aboyoun, H. Pages, and R. Gentleman. 2009. “ShortRead: A Bioconductor Package for Input, Quality Assessment and Exploration of High-Throughput Sequence Data.” Bioinformatics 25 (19): 2607–8.
Pyl, P. T., J. Gehring, B. Fischer, and W. Huber. 2014. “H5vc: Scalable Nucleotide Tallies with HDF5.” Bioinformatics 30 (10): 1464–6.
Reuter, J. A., D. V. Spacek, and M. P. Snyder. 2015. “High-Throughput Sequencing Technologies.” Mol. Cell 58 (4): 586–97.
Rivera, C. M., and B. Ren. 2013. “Mapping Human Epigenomes.” Cell 155 (1): 39–55.
Shendure, J., and E. Lieberman Aiden. 2012. “The Expanding Scope of DNA Sequencing.” Nat. Biotechnol. 30 (11): 1084–94.
Tippmann, S. 2015. “Programming Tools: Adventures with R.” Nature 517 (7532): 109–10.
Trapnell, C., L. Pachter, and S. L. Salzberg. 2009. “TopHat: Discovering Splice Junctions with RNA-Seq.” Bioinformatics 25 (9): 1105–11.
Wang, K., D. Singh, Z. Zeng, S. J. Coleman, Y. Huang, G. L. Savich, X. He, et al. 2010. “MapSplice: Accurate Mapping of RNA-Seq Reads for Splice Junction Discovery.” Nucleic Acids Res. 38 (18): e178.
Zhu, Y., R. M. Stephens, P. S. Meltzer, and S. R. Davis. 2013. “SRAdb: Query and Use Public Next-Generation Sequencing Data from Within R.” BMC Bioinformatics 14: 19.