Core Bioconductor packages for NGS data analysis

Pascal MARTIN

23 juin 2016

Outline

  1. What is NGS? What’s in BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. I/E and visualization

What is NGS?

What is NGS?

NGS = High-throughput sequencing

What is NGS?

NGS applications

Soon et al., Mol Syst Biol, 2013

NGS applications

The Bioconductor project

Bioconductor and the NGS workflow

Bioconductor and the NGS workflow

A flood of specialized packages

Outline

  1. What is NGS? What’s in BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. I/E and visualization

The fasta format

Biostrings containers and accessors

library(Biostrings)

Containers:
- XString – BString, DNAString, RNAString, AAString
- XStringSet – multiple sequences
- XStringViews
“Masked” sequences are also supported (see ?masks)

Manipulation:
- [[ and [ for subsetting
- subseq
- toString
- Views
- … see Biostrings cheat sheet

Importing sequences from a fasta file:

dm3_upstream_filepath <- system.file("extdata","dm3_upstream2000.fa.gz",
                                    package="Biostrings")
dm3_upstream <- readDNAStringSet(dm3_upstream_filepath)
dm3_upstream
##   A DNAStringSet instance of length 26454
##         width seq                                      names               
##     [1]  2000 GTTGGTGGCCCACCAGTGC...GTTTACCGGTTGCACGGT NM_078863_up_2000...
##     [2]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201794_up_2...
##     [3]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201795_up_2...
##     [4]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201796_up_2...
##     [5]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201797_up_2...
##     ...   ... ...
## [26450]  2000 ATTTACAAGACTAATAAAG...ATTAAATTTCAATAAAAC NM_001111010_up_2...
## [26451]  2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001015258_up_2...
## [26452]  2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001110997_up_2...
## [26453]  2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001276245_up_2...
## [26454]  2000 CGTATGTATTAGTTAACTC...AAGTGTAAGAACAAATTG NM_001015497_up_2...

package Rsamtools and function FaFile to access large fasta files

dm3_upstream[[5]]
##   2000-letter "DNAString" instance
## seq: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAAAG...ATTAATCGATAGATACGGAAAGTCATCCTCGAT
toString(dm3_upstream[[5]][2:30])
## [1] "TATTTATGTAGGCGCCCGTTCCCGCAGCC"
Views(dm3_upstream[[5]],start=c(1,11),end=c(10,25))
##   Views on a 2000-letter DNAString subject
## subject: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAA...TAATCGATAGATACGGAAAGTCATCCTCGAT
## views:
##     start end width
## [1]     1  10    10 [TTATTTATGT]
## [2]    11  25    15 [AGGCGCCCGTTCCCG]
alphabetFrequency(dm3_upstream[1:2],baseOnly=TRUE,as.prob=TRUE)
##          A     C      G      T other
## [1,] 0.323 0.191 0.1875 0.2985     0
## [2,] 0.300 0.207 0.2230 0.2700     0

Whole genome sequences in BSgenome packages:

library(BSgenome.Dmelanogaster.UCSC.dm3)
names(Dmelanogaster)[1:5]
## [1] "chr2L" "chr2R" "chr3L" "chr3R" "chr4"
Dmelanogaster$chr2L
##   23011544-letter "DNAString" instance
## seq: CGACAATGCACGACAGAGGAAGCAGAACAGATAT...TATTTGCAAATTTTGATGAACCCCCCTTTCAAA

For a “masked” version of the genome, see:

library(BSgenome.Dmelanogaster.UCSC.dm3.masked)

For adding SNPs info see:

library(BSgenome)
?available.SNPs

Pattern matching

matchPattern("KATCGATA",dm3_upstream[[592]],fixed=F)
##   Views on a 2000-letter DNAString subject
## subject: TCCCAAATTAACTAATGGCTTTTCACGCAGAT...GCCTCACTTTTGTCCACATCTTTTGAAAGGC
## views:
##     start end width
## [1]    72  79     8 [GATCGATA]
## [2]   512 519     8 [TATCGATA]
## [3]   518 525     8 [TATCGATA]

K is G or T, see IUPAC code

Other functions to search for patterns: matchProbePair, findPalindromes, …

vmatchPattern('TATCGATA',Dmelanogaster)

Position weight matrix (PWM)

Probabilistic description of short sequences largely used for TF binding sites

EcRMotif <- MotifDb::query(MotifDb,"EcR")[[1]]

seqLogo representation (reverse complement):

EcRHits <- matchPWM(EcRMotif,Dmelanogaster$chr4)
length(EcRHits)
## [1] 1799
EcRHits[1:2]
##   Views on a 1351857-letter DNAString subject
## subject: GAATTCGCGTCCGCTTACCCATGTGCCTGTGG...TAAAAGCAGCCGTCGATTTGAGATATATGAA
## views:
##     start  end width
## [1]  1135 1142     8 [ATGTCCTT]
## [2]  1630 1637     8 [TTCACCTT]

Minus strand: use the reverseComplement of PWM

More sequence tools:

Other packages to work with motifs:
- MotifDb
- seqLogo
- PWMEnrich
- TFBSTools
- rGADEM
- BCRANK
- MotIV
- …

For database queries (+ other tools for AA sequences):
seqinr

Other functions:

pairwiseAlignment('CTTGCAGTGGTGTATTCATAC',dm3_upstream[[1]],type='global-local')
## Global-Local PairwiseAlignmentsSingleSubject (1 of 1)
## pattern:   [1] CTTGCAGTGG-TGTATTCATAC 
## subject: [713] CTTGCAGTGGGTGTAT--ATAC 
## score: 5.653368

Fastq format

\(Q = -10*{\log_{10}(P)}\) <=> \(P = 10^{-\frac{Q}{10}}\)

Working with fastq files

The ShortRead package (M. Morgan et al. 2009)

library(ShortRead)

Import a fastq file with 20K reads:

fq1_path <- system.file(package="ShortRead","extdata","E-MTAB-1147",
                        "ERR127302_1_subset.fastq.gz")
myFastq <- readFastq(fq1_path)

Explore with:

myFastq
## class: ShortReadQ
## length: 20000 reads; width: 72 cycles
myFastq[1:5]
## class: ShortReadQ
## length: 5 reads; width: 72 cycles
head(sread(myFastq),3)
##   A DNAStringSet instance of length 3
##     width seq
## [1]    72 GTCTGCTGTATCTGTGTCGGCTGTCTCGCGG...CAATGAAGGCCTGGAATGTCACTACCCCCAG
## [2]    72 CTAGGGCAATCTTTGCAGCAATGAATGCCAA...GTGGCTTTTGAGGCCAGAGCAGACCTTCGGG
## [3]    72 TGGGCTGTTCCTTCTCACTGTGGCCTGACTA...GCATTAAGAAAGAGTCACGTTTCCCAAGTCT
head(quality(myFastq),3)
## class: FastqQuality
## quality:
##   A BStringSet instance of length 3
##     width seq
## [1]    72 HHHHHHHHHHHHHHHHHHHHEBDBB?B:BBG...FEFBDBD@DDECEE3>:?;@@@>?=BAB?##
## [2]    72 IIIIHIIIGIIIIIIIHIIIIEGBGHIIIIH...IHIIIHIIIIIGIIIEGIIGBGE@DDGGGIG
## [3]    72 GGHBHGBGGGHHHHDHHHHHHHHHFGHHHHH...HHHHHHGHFHHHHHHHHHHHHHH8AGDGGG>
head(id(myFastq),3)
##   A BStringSet instance of length 3
##     width seq
## [1]    53 ERR127302.8493430 HWI-EAS350_0441:1:34:16191:2123#0/1
## [2]    53 ERR127302.21406531 HWI-EAS350_0441:1:88:9330:2587#0/1
## [3]    55 ERR127302.22173106 HWI-EAS350_0441:1:91:10434:14757#0/1
encoding(quality(myFastq))
##  !  "  #  $  %  &  '  (  )  *  +  ,  -  .  /  0  1  2  3  4  5  6  7  8  9 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
##  :  ;  <  =  >  ?  @  A  B  C  D  E  F  G  H  I  J 
## 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
alphabet(sread(myFastq))
##  [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
## [18] "."

Large fastq files

Functions FastqSampler and FastqStreamer
Count the reads in a fastq file:

nr_myFastq <- 0
strm <- FastqStreamer(fq1_path,1000)
repeat {
 ## Get FASTQ chunk:
  fq <- yield(strm)
  if (length(fq) == 0)
   break
 ## Do something on the chunk:
  nr_myFastq <- nr_myFastq+length(fq)
}
close(strm) #close the connection
nr_myFastq
## [1] 20000

QC on fastq files

fastqc from Babraham Institute

Several tools available in R/BioC: ShortRead qa/qa2 functions, qrqc, seqTools, Rqc

Run library(Rqc) on a fastq file:

rqcResultSet <- rqcQA(fq1_path,sample=T)
rqcCycleQualityPlot(rqcResultSet)

rqcCycleBaseCallsLinePlot(rqcResultSet)

Filtering a fastq file

Define some filters:

max1N <- nFilter(threshold=1L) #No 'Ns' in the reads
goodq <- srFilter(function(x){apply(as(quality(x),"matrix"),
                            1,median,na.rm=T)>=30},
                 name="MedianQualityAbove30")
myFilter <- compose(max1N,goodq) #combine filters

Create a function to filter and trim the reads:

FilterAndTrim <- function(fl,destination=sprintf("%s_filtered",fl))
{
  stream <- FastqStreamer(fl) ## open input stream
  on.exit(close(stream))
  repeat {
    ###get fastq chunk  
    fq <- yield(stream)
    if (length(fq)==0)
      break
    ###TRIM first 4 and last 2 bases
    fq <- narrow(fq,start=5,end=70)
    ###FILTER
    fq <- fq[myFilter(fq)]
    ###write filtered fastq
    writeFastq(fq, destination, mode="a")
  }
}

Apply the function:

FilterAndTrim(fqFiles[1],
              destination=file.path(getwd(),"FilteredFastq.fq"))

Outline

  1. What is NGS? What’s in BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. I/E and visualization

Alignment of NGS reads

R packages: Rbowtie, QuasR (Gaidatzis et al. 2015), Rsubread (Liao, Smyth, and Shi 2013; Liao, Smyth, and Shi 2014)

Mapping quality scores (MAQ aligner in Li, Ruan, and Durbin 2008):
- base qualities (Phred scores)
- mismatches/indels
- repetitions
- paired reads

See also Wikipedia’s list of alignment tools

SAM/BAM format

SAM format specifications
Useful tools: samtools, Picard tools

BAM file import

library(Rsamtools);library(GenomicAlignments)
library(pasillaBamSubset)
sr <- untreated1_chr4() #single-end
pr <- untreated3_chr4() #paired-end

Use samtools to index the file

indexBam(sr_bamFile)

Define what to import

which=GRanges(seqnames="chr4",
              ranges=IRanges(c(75000,1190000),
                             c(85000,1203000)),
              strand="*")
what = c("rname","strand","pos","qwidth","seq")
flag=scanBamFlag(isDuplicate=FALSE)
param=ScanBamParam(which=which,what=what,flag=flag)

import single-end reads

srbam <- readGAlignments(sr,param=param)
srbam[1:2]
## GAlignments object with 2 alignments and 5 metadata columns:
##       seqnames strand             cigar    qwidth     start       end
##          <Rle>  <Rle>       <character> <integer> <integer> <integer>
##   [1]     chr4      + 21M13615N50M55N4M        75     72990     86734
##   [2]     chr4      - 4M13615N50M55N21M        75     73007     86751
##           width     njunc |    rname   strand       pos    qwidth
##       <integer> <integer> | <factor> <factor> <integer> <integer>
##   [1]     13745         2 |     chr4        +     72990        75
##   [2]     13745         2 |     chr4        -     73007        75
##                           seq
##                <DNAStringSet>
##   [1] AAAAACTGCA...CGTAGCCACA
##   [2] ATACCTGTGA...TGGACGGCTG
##   -------
##   seqinfo: 8 sequences from an unspecified genome

import paired-end reads

prbam <- readGAlignmentPairs(pr)
prbam[1:2]
## GAlignmentPairs object with 2 pairs, strandMode=1, and 0 metadata columns:
##       seqnames strand :     ranges --       ranges
##          <Rle>  <Rle> :  <IRanges> --    <IRanges>
##   [1]     chr4      + : [169, 205] -- [ 326,  362]
##   [2]     chr4      + : [943, 979] -- [1086, 1122]
##   -------
##   seqinfo: 8 sequences from an unspecified genome

Outline

  1. What is NGS? What’s in BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. I/E and visualization

Working with genomic ranges

Packages IRanges & GenomicRanges (Lawrence et al. 2013)
See also the Introduction by Martin Morgan, 2014 and 2015

IRanges Definition

A simple IRanges:

(eg = IRanges(start = c(1, 10, 20),
              end = c(4, 10, 19),
              names = c("A", "B", "C")))
## IRanges object with 3 ranges and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   A         1         4         4
##   B        10        10         1
##   C        20        19         0

A bigger IRanges:

set.seed(123) #For reproducibility
start = floor(runif(10000, 1, 1000))
width = floor(runif(10000, 0, 100))
ir = IRanges(start, width=width)
ir
## IRanges object with 10000 ranges and 0 metadata columns:
##               start       end     width
##           <integer> <integer> <integer>
##       [1]       288       318        31
##       [2]       788       819        32
##       [3]       409       495        87
##       [4]       883       914        32
##       [5]       940       951        12
##       ...       ...       ...       ...
##    [9996]       466       492        27
##    [9997]       899       983        85
##    [9998]       114       124        11
##    [9999]       571       595        25
##   [10000]       900       965        66

Vector-like behavior: length, [
Accessors: start, end, width, names

length(ir)
## [1] 10000
width(ir[1:4])
## [1] 31 32 87 32
names(eg)
## [1] "A" "B" "C"
mid(ir[1:4])
## [1] 303 803 452 898

IRangesList

irl  <- split(ir,width(ir))
irl[[1]][1:3]
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       321       320         0
##   [2]       600       599         0
##   [3]       184       183         0
length(irl)
## [1] 100
head(elementNROWS(irl))
##   0   1   2   3   4   5 
##  96  83 108  95  84 110

List objects

List = list with all elements of the same type

start(irl)[1:2]
## IntegerList of length 2
## [["0"]] 321 600 184 297 276 816 87 729 ... 858 52 85 188 308 289 936 669
## [["1"]] 915 576 706 235 678 647 451 138 ... 638 66 979 740 300 869 433 645
log(start(irl)[1:2])
## NumericList of length 2
## [["0"]] 5.77144112313002 6.39692965521615 ... 6.50578406012823
## [["1"]] 6.81892406527552 6.35610766069589 ... 6.46925031679577

Genomic ranges

Package GenomicRanges:

library(GenomicRanges)

GRanges

A typical use case:

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
(gr <- exons(txdb))
## GRanges object with 76920 ranges and 1 metadata column:
##            seqnames           ranges strand |   exon_id
##               <Rle>        <IRanges>  <Rle> | <integer>
##       [1]     chr2L     [7529, 8116]      + |         1
##       [2]     chr2L     [8193, 8589]      + |         2
##       [3]     chr2L     [8193, 9484]      + |         3
##       ...       ...              ...    ... .       ...
##   [76918] chrUextra [523024, 523048]      - |     76918
##   [76919] chrUextra [523024, 523086]      - |     76919
##   [76920] chrUextra [523060, 523086]      - |     76920
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

GRanges


Operations on GRanges are generally seqnames- and strand-aware (see argument ignore.strand)

GRangesList

(grl <- exonsBy(txdb,by="gene"))
## GRangesList object of length 15682:
## $FBgn0000003 
## GRanges object with 1 range and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##          <Rle>          <IRanges>  <Rle> | <integer> <character>
##   [1]    chr3R [2648220, 2648518]      + |     45123        <NA>
## 
## $FBgn0000008 
## GRanges object with 13 ranges and 2 metadata columns:
##        seqnames               ranges strand | exon_id exon_name
##    [1]    chr2R [18024494, 18024531]      + |   20314      <NA>
##    [2]    chr2R [18024496, 18024713]      + |   20315      <NA>
##    [3]    chr2R [18024938, 18025756]      + |   20316      <NA>
##    ...      ...                  ...    ... .     ...       ...
##   [11]    chr2R [18059821, 18059938]      + |   20328      <NA>
##   [12]    chr2R [18060002, 18060339]      + |   20329      <NA>
##   [13]    chr2R [18060002, 18060346]      + |   20330      <NA>
## 
## ...
## <15680 more elements>
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome

Intra-ranges operations

See ?‘intra-range-methods’


Also reflect, narrow and threebands, restrict and trim

Inter-ranges operations

See ?‘inter-range-methods’

Set operations - Nearest methods

See ?‘setops-methods’


See also ?‘nearest-methods’ including nearest, precede, follow and distance,

Between-range operations / Overlaps

Q: Number of TSS located at >500bp from another gene?

Get all genes and transcrits:

Dmg <- genes(txdb) 
Dmt <- transcriptsBy(txdb,by="gene")

Get all TSS:

Dm_tss <- unlist(reduce(promoters(Dmt,up=0,down=1),min.gap=0L))

Proportion of TSS overlapping with more than 1 gene +/- 500bp:

mean(countOverlaps(Dm_tss,Dmg+500) > 1) #!strand-aware
## [1] 0.2509943
mean(countOverlaps(Dm_tss,Dmg+500,ignore.strand=T) > 1)
## [1] 0.5167952

Overlaps

Obtaining the overlaps:

fov <- findOverlaps(Dm_tss,Dmg+500,ignore.strand=T) ; fov[1:3]
## Hits object with 3 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         1           1
##   [2]         1        1383
##   [3]         2           2
##   -------
##   queryLength: 20869 / subjectLength: 15682

Two genes on opposite strands that are overlapping:

Dmg[subjectHits(fov)[queryHits(fov)==1]]
## GRanges object with 2 ranges and 1 metadata column:
##               seqnames             ranges strand |     gene_id
##                  <Rle>          <IRanges>  <Rle> | <character>
##   FBgn0000003    chr3R [2648220, 2648518]      + | FBgn0000003
##   FBgn0011904    chr3R [2648685, 2648757]      - | FBgn0011904
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

Q: How many reads in srbam overlap with gene FBgn0002521?

length(subsetByOverlaps(srbam,Dmg["FBgn0002521"]))
## [1] 358

Q: How many reads in srbam overlap with exons of FBgn0002521?

length(srbam[overlapsAny(srbam,grl[["FBgn0002521"]])])
## [1] 346

Counting reads mapping on features

Reads mapping on exons:

ctex <- summarizeOverlaps(features = grl[seqnames(Dmg)=="chr4"],
                             reads = srbam,
                              mode = Union)

head(assays(ctex)$counts)
##             reads
## FBgn0002521   346
## FBgn0004607     0
## FBgn0004859     4
## FBgn0005558     0
## FBgn0005561     0
## FBgn0005666     0

SummarizedExperiment

(Huber et al. 2015)

Rle

srbam <- readGAlignments(sr)
(covr <- coverage(srbam))
## RleList of length 8
## $chr2L
## integer-Rle of length 23011544 with 1 run
##   Lengths: 23011544
##   Values :        0
## 
## $chr2R
## integer-Rle of length 21146708 with 1 run
##   Lengths: 21146708
##   Values :        0
## 
## $chr3L
## integer-Rle of length 24543557 with 1 run
##   Lengths: 24543557
##   Values :        0
## 
## $chr3R
## integer-Rle of length 27905053 with 1 run
##   Lengths: 27905053
##   Values :        0
## 
## $chr4
## integer-Rle of length 1351857 with 122061 runs
##   Lengths:  891   27    5   12   13   45 ...    3  106   75 1600   75 1659
##   Values :    0    1    2    3    4    5 ...    6    0    1    0    1    0
## 
## ...
## <3 more elements>

Average profile on gene bodies

Genes on chromosome 4:

gn4 <- Dmg[seqnames(Dmg)=="chr4"]

Extract gene level profiles:

profgn4 <- covr[gn4]
profgn4[strand(gn4)=="-"] <- lapply(profgn4[strand(gn4)=="-"],rev)
names(profgn4) <- names(gn4) ; profgn4[1:2]
## RleList of length 2
## $FBgn0002521
## integer-Rle of length 9178 with 825 runs
##   Lengths: 86  5 26  1 10  2 13 13  5  5 ... 29  1  1  1  1  1 11  3  4 13
##   Values :  0  1  2  4  5  6  7  8  9  8 ...  8  7  6  7  8  7  6  5  6  5
## 
## $FBgn0004607
## integer-Rle of length 37983 with 41 runs
##   Lengths: 11762    75  1336    15    60 ...    94    75  1986    75   829
##   Values :     0     1     0     1     2 ...     0     1     0     1     0

Extract the first 1Kb as a matrix:

profgn4 <- profgn4[elementNROWS(profgn4)>=1000]
profgn4 <- as(lapply(profgn4,window,1,1000),"RleList")
mat1kb <- matrix(as.numeric(unlist(profgn4, use.names=F)),
                 nrow=length(profgn4), byrow=T,
                 dimnames=list(names(profgn4),NULL))
mat1kb <- mat1kb[rowSums(mat1kb)>0,]

Plot the average profile:

df1Kb <- data.frame(Coordinate=1:1000,
                    Coverage=apply(mat1kb,2,mean,na.rm=T,trim=0.03))
ggplot(df1Kb,aes(x=Coordinate,y=Coverage))+
  geom_line()

Extract sequences in a BSgenome using a GRanges

getSeq(Dmelanogaster,gn4[1:2])
##   A DNAStringSet instance of length 2
##     width seq                                          names               
## [1]  9178 ATCGAATACCCATGCCAAACA...ATAAAAGTACGTTAACAGCA FBgn0002521
## [2] 37983 CAGCTCAGTCGAAAAAAAACG...AACGTACATTTATACGTCCT FBgn0004607
Views(Dmelanogaster,gn4[1:2])
## BSgenomeViews object with 2 views and 1 metadata column:
##               seqnames             ranges strand                       dna
##                  <Rle>          <IRanges>  <Rle>            <DNAStringSet>
##   FBgn0002521     chr4 [1193094, 1202271]      - [ATCGAATACC...GTTAACAGCA]
##   FBgn0004607     chr4 [ 522436,  560418]      + [CAGCTCAGTC...TATACGTCCT]
##               |     gene_id
##               | <character>
##   FBgn0002521 | FBgn0002521
##   FBgn0004607 | FBgn0004607
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

Outline

  1. What is NGS? What’s in BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. I/E and visualization

Annotations

select(org.Dm.eg.db,
       keys=c('FBgn0015664','FBgn0015602'),keytype="FLYBASE",
       columns=c('SYMBOL','UNIGENE','ENTREZID','FLYBASECG'))
##       FLYBASE  SYMBOL UNIGENE ENTREZID FLYBASECG
## 1 FBgn0015664    Dref Dm.7169    34328    CG5838
## 2 FBgn0015602 BEAF-32 Dm.9114    36645   CG10159

AnnotationHub

library(AnnotationHub)
hub <- AnnotationHub()
length(hub) # >43500 datasets
unique(hub$dataprovider)
head(unique(hub$species))
head(unique(ah$rdataclass))

Outline

  1. What is NGS? What’s in BioC for NGS?
  2. Working with sequences
  3. Working with aligned reads
  4. Working with ranges
  5. Annotations
  6. I/E and visualization

Import/Export

Visualization

Conclusions



To go beyond



Major Bioconductor contributors







Thank you for your attention!

References

Davis, S., and P. S. Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 23 (14): 1846–7.

Dobin, A., C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T. R. Gingeras. 2013. “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics 29 (1): 15–21.

Durinck, S., Y. Moreau, A. Kasprzyk, S. Davis, B. De Moor, A. Brazma, and W. Huber. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21 (16): 3439–40.

Durinck, S., P. T. Spellman, E. Birney, and W. Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the R/Bioconductor Package BiomaRt.” Nat Protoc 4 (8): 1184–91.

Gaidatzis, D., A. Lerch, F. Hahne, and M. B. Stadler. 2015. “QuasR: Quantification and Annotation of Short Reads in R.” Bioinformatics 31 (7): 1130–2.

Gentleman, Robert C, Vincent J. Carey, Douglas M. Bates, and others. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biology 5: R80. http://genomebiology.com/2004/5/10/R80.

Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nat. Methods 12 (2): 115–21.

Kim, D., B. Langmead, and S. L. Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nat. Methods 12 (4): 357–60.

Kim, D., G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg. 2013. “TopHat2: Accurate Alignment of Transcriptomes in the Presence of Insertions, Deletions and Gene Fusions.” Genome Biol. 14 (4): R36.

Langmead, B., and S. L. Salzberg. 2012. “Fast Gapped-Read Alignment with Bowtie 2.” Nat. Methods 9 (4): 357–59.

Langmead, B., C. Trapnell, M. Pop, and S. L. Salzberg. 2009. “Ultrafast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome.” Genome Biol. 10 (3): R25.

Lawrence, M., and M. Morgan. 2014. “Scalable Genomics with R and Bioconductor.” Statistical Science 29 (2): 214–26.

Lawrence, M., R. Gentleman, and V. Carey. 2009. “Rtracklayer: An R Package for Interfacing with Genome Browsers.” Bioinformatics 25 (14): 1841–2.

Lawrence, M., W. Huber, H. Pages, P. Aboyoun, M. Carlson, R. Gentleman, M. T. Morgan, and V. J. Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Comput. Biol. 9 (8): e1003118.

Li, H., and R. Durbin. 2009. “Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform.” Bioinformatics 25 (14): 1754–60.

———. 2010. “Fast and Accurate Long-Read Alignment with Burrows-Wheeler Transform.” Bioinformatics 26 (5): 589–95.

Li, H., J. Ruan, and R. Durbin. 2008. “Mapping Short DNA Sequencing Reads and Calling Variants Using Mapping Quality Scores.” Genome Res. 18 (11): 1851–8.

Liao, Y., G. K. Smyth, and W. Shi. 2013. “The Subread Aligner: Fast, Accurate and Scalable Read Mapping by Seed-and-Vote.” Nucleic Acids Res. 41 (10): e108.

———. 2014. “FeatureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30.

Marco-Sola, S., M. Sammeth, R. Guigo, and P. Ribeca. 2012. “The GEM Mapper: Fast, Accurate and Versatile Alignment by Filtration.” Nat. Methods 9 (12): 1185–8.

Morgan, M., S. Anders, M. Lawrence, P. Aboyoun, H. Pages, and R. Gentleman. 2009. “ShortRead: A Bioconductor Package for Input, Quality Assessment and Exploration of High-Throughput Sequence Data.” Bioinformatics 25 (19): 2607–8.

Pyl, P. T., J. Gehring, B. Fischer, and W. Huber. 2014. “H5vc: Scalable Nucleotide Tallies with HDF5.” Bioinformatics 30 (10): 1464–6.

Reuter, J. A., D. V. Spacek, and M. P. Snyder. 2015. “High-Throughput Sequencing Technologies.” Mol. Cell 58 (4): 586–97.

Rivera, C. M., and B. Ren. 2013. “Mapping Human Epigenomes.” Cell 155 (1): 39–55.

Shendure, J., and E. Lieberman Aiden. 2012. “The Expanding Scope of DNA Sequencing.” Nat. Biotechnol. 30 (11): 1084–94.

Tippmann, S. 2015. “Programming Tools: Adventures with R.” Nature 517 (7532): 109–10.

Trapnell, C., L. Pachter, and S. L. Salzberg. 2009. “TopHat: Discovering Splice Junctions with RNA-Seq.” Bioinformatics 25 (9): 1105–11.

Wang, K., D. Singh, Z. Zeng, S. J. Coleman, Y. Huang, G. L. Savich, X. He, et al. 2010. “MapSplice: Accurate Mapping of RNA-Seq Reads for Splice Junction Discovery.” Nucleic Acids Res. 38 (18): e178.

Zhu, Y., R. M. Stephens, P. S. Meltzer, and S. R. Davis. 2013. “SRAdb: Query and Use Public Next-Generation Sequencing Data from Within R.” BMC Bioinformatics 14: 19.