Importing sequences from a fasta file:

dm3_upstream_filepath <- system.file("extdata","dm3_upstream2000.fa.gz",
                                    package="Biostrings")
dm3_upstream <- readDNAStringSet(dm3_upstream_filepath)
dm3_upstream

##   A DNAStringSet instance of length 26454
##         width seq                                      names               
##     [1]  2000 GTTGGTGGCCCACCAGTGC...GTTTACCGGTTGCACGGT NM_078863_up_2000...
##     [2]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201794_up_2...
##     [3]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201795_up_2...
##     [4]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201796_up_2...
##     [5]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201797_up_2...
##     ...   ... ...
## [26450]  2000 ATTTACAAGACTAATAAAG...ATTAAATTTCAATAAAAC NM_001111010_up_2...
## [26451]  2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001015258_up_2...
## [26452]  2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001110997_up_2...
## [26453]  2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001276245_up_2...
## [26454]  2000 CGTATGTATTAGTTAACTC...AAGTGTAAGAACAAATTG NM_001015497_up_2...

package Rsamtools and function FaFile to access large fasta files

dm3_upstream[[5]]

##   2000-letter "DNAString" instance
## seq: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAAAG...ATTAATCGATAGATACGGAAAGTCATCCTCGAT

toString(dm3_upstream[[5]][2:30])

## [1] "TATTTATGTAGGCGCCCGTTCCCGCAGCC"

Views(dm3_upstream[[5]],start=c(1,11),end=c(10,25))

##   Views on a 2000-letter DNAString subject
## subject: TTATTTATGTAGGCGCCCGTTCCCGCAGCCAA...TAATCGATAGATACGGAAAGTCATCCTCGAT
## views:
##     start end width
## [1]     1  10    10 [TTATTTATGT]
## [2]    11  25    15 [AGGCGCCCGTTCCCG]

alphabetFrequency(dm3_upstream[1:2],baseOnly=TRUE,as.prob=TRUE)

##          A     C      G      T other
## [1,] 0.323 0.191 0.1875 0.2985     0
## [2,] 0.300 0.207 0.2230 0.2700     0

Whole genome sequences in BSgenome packages:

library(BSgenome.Dmelanogaster.UCSC.dm3)

names(Dmelanogaster)[1:5]

## [1] "chr2L" "chr2R" "chr3L" "chr3R" "chr4"

Dmelanogaster$chr2L

##   23011544-letter "DNAString" instance
## seq: CGACAATGCACGACAGAGGAAGCAGAACAGATAT...TATTTGCAAATTTTGATGAACCCCCCTTTCAAA

For a “masked” version of the genome, see:

library(BSgenome.Dmelanogaster.UCSC.dm3.masked)

For adding SNPs info see:

library(BSgenome)
?available.SNPs

Pattern matching

matchPattern("KATCGATA",dm3_upstream[[592]],fixed=F)

##   Views on a 2000-letter DNAString subject
## subject: TCCCAAATTAACTAATGGCTTTTCACGCAGAT...GCCTCACTTTTGTCCACATCTTTTGAAAGGC
## views:
##     start end width
## [1]    72  79     8 [GATCGATA]
## [2]   512 519     8 [TATCGATA]
## [3]   518 525     8 [TATCGATA]

K is G or T, see IUPAC code

Other functions to search for patterns: matchProbePair, findPalindromes, …

vmatchPattern('TATCGATA',Dmelanogaster)

Position weight matrix (PWM)

Probabilistic description of short sequences largely used for TF binding sites

EcRMotif <- MotifDb::query(MotifDb,"EcR")[[1]]

seqLogo representation (reverse complement):

EcRHits <- matchPWM(EcRMotif,Dmelanogaster$chr4)
length(EcRHits)

## [1] 1799

EcRHits[1:2]

##   Views on a 1351857-letter DNAString subject
## subject: GAATTCGCGTCCGCTTACCCATGTGCCTGTGG...TAAAAGCAGCCGTCGATTTGAGATATATGAA
## views:
##     start  end width
## [1]  1135 1142     8 [ATGTCCTT]
## [2]  1630 1637     8 [TTCACCTT]

Minus strand: use the reverseComplement of PWM

More sequence tools:

Other packages to work with motifs:
- MotifDb
- seqLogo
- PWMEnrich
- TFBSTools
- rGADEM
- BCRANK
- MotIV
- …

For database queries (+ other tools for AA sequences):
seqinr

Other functions:

pairwiseAlignment('CTTGCAGTGGTGTATTCATAC',dm3_upstream[[1]],type='global-local')

## Global-Local PairwiseAlignmentsSingleSubject (1 of 1)
## pattern:   [1] CTTGCAGTGG-TGTATTCATAC 
## subject: [713] CTTGCAGTGGGTGTAT--ATAC 
## score: 5.653368

Fastq format

Extension: .fq; .fastq
https://en.wikipedia.org/wiki/FASTQ_format

\(Q = -10*{\log_{10}(P)}\) <=> \(P = 10^{-\frac{Q}{10}}\)

Working with fastq files

The ShortRead package (M. Morgan et al. 2009)

library(ShortRead)

Import a fastq file with 20K reads:

fq1_path <- system.file(package="ShortRead","extdata","E-MTAB-1147",
                        "ERR127302_1_subset.fastq.gz")
myFastq <- readFastq(fq1_path)

Explore with:

myFastq

## class: ShortReadQ
## length: 20000 reads; width: 72 cycles

myFastq[1:5]

## class: ShortReadQ
## length: 5 reads; width: 72 cycles

head(sread(myFastq),3)

##   A DNAStringSet instance of length 3
##     width seq
## [1]    72 GTCTGCTGTATCTGTGTCGGCTGTCTCGCGG...CAATGAAGGCCTGGAATGTCACTACCCCCAG
## [2]    72 CTAGGGCAATCTTTGCAGCAATGAATGCCAA...GTGGCTTTTGAGGCCAGAGCAGACCTTCGGG
## [3]    72 TGGGCTGTTCCTTCTCACTGTGGCCTGACTA...GCATTAAGAAAGAGTCACGTTTCCCAAGTCT

head(quality(myFastq),3)

## class: FastqQuality
## quality:
##   A BStringSet instance of length 3
##     width seq
## [1]    72 HHHHHHHHHHHHHHHHHHHHEBDBB?B:BBG...FEFBDBD@DDECEE3>:?;@@@>?=BAB?##
## [2]    72 IIIIHIIIGIIIIIIIHIIIIEGBGHIIIIH...IHIIIHIIIIIGIIIEGIIGBGE@DDGGGIG
## [3]    72 GGHBHGBGGGHHHHDHHHHHHHHHFGHHHHH...HHHHHHGHFHHHHHHHHHHHHHH8AGDGGG>

head(id(myFastq),3)

##   A BStringSet instance of length 3
##     width seq
## [1]    53 ERR127302.8493430 HWI-EAS350_0441:1:34:16191:2123#0/1
## [2]    53 ERR127302.21406531 HWI-EAS350_0441:1:88:9330:2587#0/1
## [3]    55 ERR127302.22173106 HWI-EAS350_0441:1:91:10434:14757#0/1

encoding(quality(myFastq))

##  !  "  #  $  %  &  '  (  )  *  +  ,  -  .  /  0  1  2  3  4  5  6  7  8  9 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
##  :  ;  <  =  >  ?  @  A  B  C  D  E  F  G  H  I  J 
## 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

alphabet(sread(myFastq))

##  [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
## [18] "."

Large fastq files

Functions FastqSampler and FastqStreamer
Count the reads in a fastq file:

nr_myFastq <- 0
strm <- FastqStreamer(fq1_path,1000)
repeat {
 ## Get FASTQ chunk:
  fq <- yield(strm)
  if (length(fq) == 0)
   break
 ## Do something on the chunk:
  nr_myFastq <- nr_myFastq+length(fq)
}
close(strm) #close the connection
nr_myFastq

## [1] 20000

QC on fastq files

fastqc from Babraham Institute

Several tools available in R/BioC: ShortRead qa/qa2 functions, qrqc, seqTools, Rqc

Run library(Rqc) on a fastq file:

rqcResultSet <- rqcQA(fq1_path,sample=T)

rqcCycleQualityPlot(rqcResultSet)

rqcCycleBaseCallsLinePlot(rqcResultSet)

Filtering a fastq file

Define some filters:

max1N <- nFilter(threshold=1L) #No 'Ns' in the reads
goodq <- srFilter(function(x){apply(as(quality(x),"matrix"),
                            1,median,na.rm=T)>=30},
                 name="MedianQualityAbove30")
myFilter <- compose(max1N,goodq) #combine filters

Create a function to filter and trim the reads:

FilterAndTrim <- function(fl,destination=sprintf("%s_filtered",fl))
{
  stream <- FastqStreamer(fl) ## open input stream
  on.exit(close(stream))
  repeat {
    ###get fastq chunk  
    fq <- yield(stream)
    if (length(fq)==0)
      break
    ###TRIM first 4 and last 2 bases
    fq <- narrow(fq,start=5,end=70)
    ###FILTER
    fq <- fq[myFilter(fq)]
    ###write filtered fastq
    writeFastq(fq, destination, mode="a")
  }
}

Apply the function:

FilterAndTrim(fqFiles[1],
              destination=file.path(getwd(),"FilteredFastq.fq"))

Outline

What is NGS? What’s in BioC for NGS?
Working with sequences
Working with aligned reads
Working with ranges
Annotations
I/E and visualization

Alignment of NGS reads

R packages: Rbowtie, QuasR (Gaidatzis et al. 2015), Rsubread (Liao, Smyth, and Shi 2013; Liao, Smyth, and Shi 2014)

Mapping quality scores (MAQ aligner in Li, Ruan, and Durbin 2008):
- base qualities (Phred scores)
- mismatches/indels
- repetitions
- paired reads

Aligning unspliced reads:
- Bowtie (Langmead et al. 2009) / Bowtie2 (Langmead and Salzberg 2012); BWA (Li and Durbin 2009; Li and Durbin 2010); GEM mapper (Marco-Sola et al. 2012)
Aligning spliced reads
- TopHat2 (Trapnell, Pachter, and Salzberg 2009; Kim et al. 2013) / HISAT2 (Kim, Langmead, and Salzberg 2015); MapSplice (Wang et al. 2010); STAR (Dobin et al. 2013)

SAM/BAM format

SAM format specifications
Useful tools: samtools, Picard tools

BAM file import

library(Rsamtools);library(GenomicAlignments)
library(pasillaBamSubset)
sr <- untreated1_chr4() #single-end
pr <- untreated3_chr4() #paired-end

Use samtools to index the file

indexBam(sr_bamFile)

Define what to import

which=GRanges(seqnames="chr4",
              ranges=IRanges(c(75000,1190000),
                             c(85000,1203000)),
              strand="*")
what = c("rname","strand","pos","qwidth","seq")
flag=scanBamFlag(isDuplicate=FALSE)
param=ScanBamParam(which=which,what=what,flag=flag)

import single-end reads

srbam <- readGAlignments(sr,param=param)
srbam[1:2]

## GAlignments object with 2 alignments and 5 metadata columns:
##       seqnames strand             cigar    qwidth     start       end
##          <Rle>  <Rle>       <character> <integer> <integer> <integer>
##   [1]     chr4      + 21M13615N50M55N4M        75     72990     86734
##   [2]     chr4      - 4M13615N50M55N21M        75     73007     86751
##           width     njunc |    rname   strand       pos    qwidth
##       <integer> <integer> | <factor> <factor> <integer> <integer>
##   [1]     13745         2 |     chr4        +     72990        75
##   [2]     13745         2 |     chr4        -     73007        75
##                           seq
##                <DNAStringSet>
##   [1] AAAAACTGCA...CGTAGCCACA
##   [2] ATACCTGTGA...TGGACGGCTG
##   -------
##   seqinfo: 8 sequences from an unspecified genome

import paired-end reads

prbam <- readGAlignmentPairs(pr)
prbam[1:2]

## GAlignmentPairs object with 2 pairs, strandMode=1, and 0 metadata columns:
##       seqnames strand :     ranges --       ranges
##          <Rle>  <Rle> :  <IRanges> --    <IRanges>
##   [1]     chr4      + : [169, 205] -- [ 326,  362]
##   [2]     chr4      + : [943, 979] -- [1086, 1122]
##   -------
##   seqinfo: 8 sequences from an unspecified genome

Outline

What is NGS? What’s in BioC for NGS?
Working with sequences
Working with aligned reads
Working with ranges
Annotations
I/E and visualization

Working with genomic ranges

Packages IRanges & GenomicRanges (Lawrence et al. 2013)
See also the Introduction by Martin Morgan, 2014 and 2015

Importance of range concept
- Genomic data and annotations can be represented by ranges
- Many biological questions correspond to queries on ranges:
  - How many reads overlap each gene / exon / splice junction?
  - Where does my protein bind in ChiP-seq experiment?
  - How many of my peaks overlap with promoters?
  - …
Where do range-like objects come from?
- DATA (e.g. BAM, BED, WIG files,…)
- ANNOTATIONS (BED, GFF/GTF, TxDb*, AnnotationHub,…)

IRanges Definition

What is an IRanges object?
- 2 vectors out of ‘start’, ‘end’ and ‘width’ (SEW; \(end = start + width - 1\))
- closed intervals (i.e. include end points)
- zero-width convention: \(width \geq 0\) ; \(end=start-1 \Leftrightarrow width=0\)
- can be named

A simple IRanges:

(eg = IRanges(start = c(1, 10, 20),
              end = c(4, 10, 19),
              names = c("A", "B", "C")))

## IRanges object with 3 ranges and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   A         1         4         4
##   B        10        10         1
##   C        20        19         0

A bigger IRanges:

set.seed(123) #For reproducibility
start = floor(runif(10000, 1, 1000))
width = floor(runif(10000, 0, 100))
ir = IRanges(start, width=width)
ir

## IRanges object with 10000 ranges and 0 metadata columns:
##               start       end     width
##           <integer> <integer> <integer>
##       [1]       288       318        31
##       [2]       788       819        32
##       [3]       409       495        87
##       [4]       883       914        32
##       [5]       940       951        12
##       ...       ...       ...       ...
##    [9996]       466       492        27
##    [9997]       899       983        85
##    [9998]       114       124        11
##    [9999]       571       595        25
##   [10000]       900       965        66

Vector-like behavior: length, [
Accessors: start, end, width, names

length(ir)

## [1] 10000

width(ir[1:4])

## [1] 31 32 87 32

names(eg)

## [1] "A" "B" "C"

mid(ir[1:4])

## [1] 303 803 452 898

IRangesList

irl  <- split(ir,width(ir))
irl[[1]][1:3]

## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]       321       320         0
##   [2]       600       599         0
##   [3]       184       183         0

length(irl)

## [1] 100

head(elementNROWS(irl))

##   0   1   2   3   4   5 
##  96  83 108  95  84 110

List objects

List = list with all elements of the same type

start(irl)[1:2]

## IntegerList of length 2
## [["0"]] 321 600 184 297 276 816 87 729 ... 858 52 85 188 308 289 936 669
## [["1"]] 915 576 706 235 678 647 451 138 ... 638 66 979 740 300 869 433 645

log(start(irl)[1:2])

## NumericList of length 2
## [["0"]] 5.77144112313002 6.39692965521615 ... 6.50578406012823
## [["1"]] 6.81892406527552 6.35610766069589 ... 6.46925031679577

Genomic ranges

GRanges builds on IRanges
- ‘seqnames’ (typically chromosomes) and ‘strand’ (‘+’, ‘-’ or ’*’)
- (optional) ‘seqinfo’ for genome information
- (optional) ‘mcols’ for ‘metadata’ data frame on each range
Genome coordinates
- 1-based
- ‘left-most’ (‘start’ of ranges on the ‘minus’ strand are the left-most coordinate, rather than the 5’ coordinate)

Package GenomicRanges:

library(GenomicRanges)

GRanges

A typical use case:

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
(gr <- exons(txdb))

## GRanges object with 76920 ranges and 1 metadata column:
##            seqnames           ranges strand |   exon_id
##               <Rle>        <IRanges>  <Rle> | <integer>
##       [1]     chr2L     [7529, 8116]      + |         1
##       [2]     chr2L     [8193, 8589]      + |         2
##       [3]     chr2L     [8193, 9484]      + |         3
##       ...       ...              ...    ... .       ...
##   [76918] chrUextra [523024, 523048]      - |     76918
##   [76919] chrUextra [523024, 523086]      - |     76919
##   [76920] chrUextra [523060, 523086]      - |     76920
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

GRanges

Operations on GRanges are generally seqnames- and strand-aware (see argument ignore.strand)

GRangesList

(grl <- exonsBy(txdb,by="gene"))

## GRangesList object of length 15682:
## $FBgn0000003 
## GRanges object with 1 range and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##          <Rle>          <IRanges>  <Rle> | <integer> <character>
##   [1]    chr3R [2648220, 2648518]      + |     45123        <NA>
## 
## $FBgn0000008 
## GRanges object with 13 ranges and 2 metadata columns:
##        seqnames               ranges strand | exon_id exon_name
##    [1]    chr2R [18024494, 18024531]      + |   20314      <NA>
##    [2]    chr2R [18024496, 18024713]      + |   20315      <NA>
##    [3]    chr2R [18024938, 18025756]      + |   20316      <NA>
##    ...      ...                  ...    ... .     ...       ...
##   [11]    chr2R [18059821, 18059938]      + |   20328      <NA>
##   [12]    chr2R [18060002, 18060339]      + |   20329      <NA>
##   [13]    chr2R [18060002, 18060346]      + |   20330      <NA>
## 
## ...
## <15680 more elements>
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome

Intra-ranges operations

See ?‘intra-range-methods’

Also reflect, narrow and threebands, restrict and trim

Inter-ranges operations

See ?‘inter-range-methods’

Set operations - Nearest methods

See ?‘setops-methods’

See also ?‘nearest-methods’ including nearest, precede, follow and distance,

Between-range operations / Overlaps

Q: Number of TSS located at >500bp from another gene?

Get all genes and transcrits:

Dmg <- genes(txdb) 
Dmt <- transcriptsBy(txdb,by="gene")

Get all TSS:

Dm_tss <- unlist(reduce(promoters(Dmt,up=0,down=1),min.gap=0L))

Proportion of TSS overlapping with more than 1 gene +/- 500bp:

mean(countOverlaps(Dm_tss,Dmg+500) > 1) #!strand-aware

## [1] 0.2509943

mean(countOverlaps(Dm_tss,Dmg+500,ignore.strand=T) > 1)

## [1] 0.5167952

Overlaps

Obtaining the overlaps:

fov <- findOverlaps(Dm_tss,Dmg+500,ignore.strand=T) ; fov[1:3]

## Hits object with 3 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         1           1
##   [2]         1        1383
##   [3]         2           2
##   -------
##   queryLength: 20869 / subjectLength: 15682

Two genes on opposite strands that are overlapping:

Dmg[subjectHits(fov)[queryHits(fov)==1]]

## GRanges object with 2 ranges and 1 metadata column:
##               seqnames             ranges strand |     gene_id
##                  <Rle>          <IRanges>  <Rle> | <character>
##   FBgn0000003    chr3R [2648220, 2648518]      + | FBgn0000003
##   FBgn0011904    chr3R [2648685, 2648757]      - | FBgn0011904
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

Q: How many reads in srbam overlap with gene FBgn0002521?

length(subsetByOverlaps(srbam,Dmg["FBgn0002521"]))

## [1] 358

Q: How many reads in srbam overlap with exons of FBgn0002521?

length(srbam[overlapsAny(srbam,grl[["FBgn0002521"]])])

## [1] 346

Counting reads mapping on features

Reads mapping on exons:

ctex <- summarizeOverlaps(features = grl[seqnames(Dmg)=="chr4"],
                             reads = srbam,
                              mode = Union)

head(assays(ctex)$counts)

##             reads
## FBgn0002521   346
## FBgn0004607     0
## FBgn0004859     4
## FBgn0005558     0
## FBgn0005561     0
## FBgn0005666     0

SummarizedExperiment

(Huber et al. 2015)

Rle

srbam <- readGAlignments(sr)
(covr <- coverage(srbam))

## RleList of length 8
## $chr2L
## integer-Rle of length 23011544 with 1 run
##   Lengths: 23011544
##   Values :        0
## 
## $chr2R
## integer-Rle of length 21146708 with 1 run
##   Lengths: 21146708
##   Values :        0
## 
## $chr3L
## integer-Rle of length 24543557 with 1 run
##   Lengths: 24543557
##   Values :        0
## 
## $chr3R
## integer-Rle of length 27905053 with 1 run
##   Lengths: 27905053
##   Values :        0
## 
## $chr4
## integer-Rle of length 1351857 with 122061 runs
##   Lengths:  891   27    5   12   13   45 ...    3  106   75 1600   75 1659
##   Values :    0    1    2    3    4    5 ...    6    0    1    0    1    0
## 
## ...
## <3 more elements>

Average profile on gene bodies

Genes on chromosome 4:

gn4 <- Dmg[seqnames(Dmg)=="chr4"]

Extract gene level profiles:

profgn4 <- covr[gn4]
profgn4[strand(gn4)=="-"] <- lapply(profgn4[strand(gn4)=="-"],rev)
names(profgn4) <- names(gn4) ; profgn4[1:2]

## RleList of length 2
## $FBgn0002521
## integer-Rle of length 9178 with 825 runs
##   Lengths: 86  5 26  1 10  2 13 13  5  5 ... 29  1  1  1  1  1 11  3  4 13
##   Values :  0  1  2  4  5  6  7  8  9  8 ...  8  7  6  7  8  7  6  5  6  5
## 
## $FBgn0004607
## integer-Rle of length 37983 with 41 runs
##   Lengths: 11762    75  1336    15    60 ...    94    75  1986    75   829
##   Values :     0     1     0     1     2 ...     0     1     0     1     0

Extract the first 1Kb as a matrix:

profgn4 <- profgn4[elementNROWS(profgn4)>=1000]
profgn4 <- as(lapply(profgn4,window,1,1000),"RleList")
mat1kb <- matrix(as.numeric(unlist(profgn4, use.names=F)),
                 nrow=length(profgn4), byrow=T,
                 dimnames=list(names(profgn4),NULL))
mat1kb <- mat1kb[rowSums(mat1kb)>0,]

Plot the average profile:

df1Kb <- data.frame(Coordinate=1:1000,
                    Coverage=apply(mat1kb,2,mean,na.rm=T,trim=0.03))
ggplot(df1Kb,aes(x=Coordinate,y=Coverage))+
  geom_line()

Extract sequences in a BSgenome using a GRanges

getSeq(Dmelanogaster,gn4[1:2])

##   A DNAStringSet instance of length 2
##     width seq                                          names               
## [1]  9178 ATCGAATACCCATGCCAAACA...ATAAAAGTACGTTAACAGCA FBgn0002521
## [2] 37983 CAGCTCAGTCGAAAAAAAACG...AACGTACATTTATACGTCCT FBgn0004607

Views(Dmelanogaster,gn4[1:2])

## BSgenomeViews object with 2 views and 1 metadata column:
##               seqnames             ranges strand                       dna
##                  <Rle>          <IRanges>  <Rle>            <DNAStringSet>
##   FBgn0002521     chr4 [1193094, 1202271]      - [ATCGAATACC...GTTAACAGCA]
##   FBgn0004607     chr4 [ 522436,  560418]      + [CAGCTCAGTC...TATACGTCCT]
##               |     gene_id
##               | <character>
##   FBgn0002521 | FBgn0002521
##   FBgn0004607 | FBgn0004607
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

Outline

What is NGS? What’s in BioC for NGS?
Working with sequences
Working with aligned reads
Working with ranges
Annotations
I/E and visualization

Annotations

AnnotationDBi
- select
- keys / keytype
- columns

select(org.Dm.eg.db,
       keys=c('FBgn0015664','FBgn0015602'),keytype="FLYBASE",
       columns=c('SYMBOL','UNIGENE','ENTREZID','FLYBASECG'))

##       FLYBASE  SYMBOL UNIGENE ENTREZID FLYBASECG
## 1 FBgn0015664    Dref Dm.7169    34328    CG5838
## 2 FBgn0015602 BEAF-32 Dm.9114    36645   CG10159

GenomicFeatures: build and manipulate TxDb
OrganismDbi : Human, Mouse, Rat (BiocView)

AnnotationHub

AnnotationHub

library(AnnotationHub)
hub <- AnnotationHub()
length(hub) # >43500 datasets
unique(hub$dataprovider)
head(unique(hub$species))
head(unique(ah$rdataclass))

Data from ENCODE, EpigenomeRoadMap, 1000 Genomes project, Ensembl, NCBI, Inparanoid8, dbSNP,…
Screen datasets from large projects
liftOver files
Assemblies and annotation files for non-model organisms
…

Outline

What is NGS? What’s in BioC for NGS?
Working with sequences
Working with aligned reads
Working with ranges
Annotations
I/E and visualization

Import/Export

rtracklayer (Lawrence, Gentleman, and Carey 2009)
- import/export of track data (GFF, BED, WIG, bedGraph, BigWig)
- interface to UCSC genome browser and table browser
SRAdb (Zhu et al. 2013)
- browse and interact with NCBI’s Short Read Archive
- interface with IGV
biomaRt (Durinck et al. 2009; Durinck et al. 2005)
- interface to Biomart databases
GEOquery (Davis and Meltzer 2007)
- interface to NCBI GEO

Visualization

Genome browser-type: Gviz, ggbio, Sushi
More specialized packages: gtrellis, GenVisR, OmicCircos, circlize,…

Conclusions

A solid infrastructure for analyzing all kinds of NGS data
Efficient manipulation of large strings/ranges/vectors and sets of
Active development and most recent advances in analysis methods
Large and active community provides support and updates
Great functions for data summaries and graphical displays - ideal for exploratory analyses
Large projects also have their solutions (BiocParrallel, GenomicFiles, h5vc,…; see (Lawrence and Morgan 2014; Pyl et al. 2014))

To go beyond

Bioconductor courses & conferences
Bioconductor workflows
BiocViews & Package vignettes
Bioconductor articles & books
Stackoverflow, Biostars, SEQanswers,…

Major Bioconductor contributors

Michael Lawrence, Genentech Research and Early Development, USA.
Martin Morgan, Roswell Park Cancer Institute, USA.
Valerie Obenchain, Roswell Park Cancer Institute, USA.
Hervé Pagès, Fred Hutchinson Cancer Research Center, USA.
Patrick Aboyoun, Oracle, USA.

Bioconductor Core Team

Thank you for your attention!

References

Davis, S., and P. S. Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 23 (14): 1846–7.

Dobin, A., C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T. R. Gingeras. 2013. “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics 29 (1): 15–21.

Durinck, S., Y. Moreau, A. Kasprzyk, S. Davis, B. De Moor, A. Brazma, and W. Huber. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21 (16): 3439–40.

Durinck, S., P. T. Spellman, E. Birney, and W. Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the R/Bioconductor Package BiomaRt.” Nat Protoc 4 (8): 1184–91.

Gaidatzis, D., A. Lerch, F. Hahne, and M. B. Stadler. 2015. “QuasR: Quantification and Annotation of Short Reads in R.” Bioinformatics 31 (7): 1130–2.

Gentleman, Robert C, Vincent J. Carey, Douglas M. Bates, and others. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biology 5: R80. http://genomebiology.com/2004/5/10/R80.

Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nat. Methods 12 (2): 115–21.

Kim, D., B. Langmead, and S. L. Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nat. Methods 12 (4): 357–60.

Kim, D., G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg. 2013. “TopHat2: Accurate Alignment of Transcriptomes in the Presence of Insertions, Deletions and Gene Fusions.” Genome Biol. 14 (4): R36.

Langmead, B., and S. L. Salzberg. 2012. “Fast Gapped-Read Alignment with Bowtie 2.” Nat. Methods 9 (4): 357–59.

Langmead, B., C. Trapnell, M. Pop, and S. L. Salzberg. 2009. “Ultrafast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome.” Genome Biol. 10 (3): R25.

Lawrence, M., and M. Morgan. 2014. “Scalable Genomics with R and Bioconductor.” Statistical Science 29 (2): 214–26.

Lawrence, M., R. Gentleman, and V. Carey. 2009. “Rtracklayer: An R Package for Interfacing with Genome Browsers.” Bioinformatics 25 (14): 1841–2.

Lawrence, M., W. Huber, H. Pages, P. Aboyoun, M. Carlson, R. Gentleman, M. T. Morgan, and V. J. Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Comput. Biol. 9 (8): e1003118.

Li, H., and R. Durbin. 2009. “Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform.” Bioinformatics 25 (14): 1754–60.

———. 2010. “Fast and Accurate Long-Read Alignment with Burrows-Wheeler Transform.” Bioinformatics 26 (5): 589–95.

Li, H., J. Ruan, and R. Durbin. 2008. “Mapping Short DNA Sequencing Reads and Calling Variants Using Mapping Quality Scores.” Genome Res. 18 (11): 1851–8.

Liao, Y., G. K. Smyth, and W. Shi. 2013. “The Subread Aligner: Fast, Accurate and Scalable Read Mapping by Seed-and-Vote.” Nucleic Acids Res. 41 (10): e108.

———. 2014. “FeatureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30.

Marco-Sola, S., M. Sammeth, R. Guigo, and P. Ribeca. 2012. “The GEM Mapper: Fast, Accurate and Versatile Alignment by Filtration.” Nat. Methods 9 (12): 1185–8.

Morgan, M., S. Anders, M. Lawrence, P. Aboyoun, H. Pages, and R. Gentleman. 2009. “ShortRead: A Bioconductor Package for Input, Quality Assessment and Exploration of High-Throughput Sequence Data.” Bioinformatics 25 (19): 2607–8.

Pyl, P. T., J. Gehring, B. Fischer, and W. Huber. 2014. “H5vc: Scalable Nucleotide Tallies with HDF5.” Bioinformatics 30 (10): 1464–6.

Reuter, J. A., D. V. Spacek, and M. P. Snyder. 2015. “High-Throughput Sequencing Technologies.” Mol. Cell 58 (4): 586–97.

Rivera, C. M., and B. Ren. 2013. “Mapping Human Epigenomes.” Cell 155 (1): 39–55.

Shendure, J., and E. Lieberman Aiden. 2012. “The Expanding Scope of DNA Sequencing.” Nat. Biotechnol. 30 (11): 1084–94.

Tippmann, S. 2015. “Programming Tools: Adventures with R.” Nature 517 (7532): 109–10.

Trapnell, C., L. Pachter, and S. L. Salzberg. 2009. “TopHat: Discovering Splice Junctions with RNA-Seq.” Bioinformatics 25 (9): 1105–11.

Wang, K., D. Singh, Z. Zeng, S. J. Coleman, Y. Huang, G. L. Savich, X. He, et al. 2010. “MapSplice: Accurate Mapping of RNA-Seq Reads for Splice Junction Discovery.” Nucleic Acids Res. 38 (18): e178.

Zhu, Y., R. M. Stephens, P. S. Meltzer, and S. R. Davis. 2013. “SRAdb: Query and Use Public Next-Generation Sequencing Data from Within R.” BMC Bioinformatics 14: 19.

Core Bioconductor packages for NGS data analysis

Outline

What is NGS?

What is NGS?

What is NGS?

NGS applications

NGS applications

The Bioconductor project

Bioconductor and the NGS workflow

Bioconductor and the NGS workflow

A flood of specialized packages

Outline

The fasta format

Biostrings containers and accessors

Importing sequences from a fasta file:

Whole genome sequences in BSgenome packages:

Pattern matching

Position weight matrix (PWM)

More sequence tools:

Fastq format

Working with fastq files

Large fastq files

QC on fastq files

Filtering a fastq file

Outline

Alignment of NGS reads

SAM/BAM format

BAM file import

Outline

Working with genomic ranges

IRanges Definition

IRangesList

List objects

Genomic ranges

GRanges

GRanges

GRangesList

Intra-ranges operations

Inter-ranges operations

Set operations - Nearest methods

Between-range operations / Overlaps

Overlaps

Counting reads mapping on features

SummarizedExperiment

Rle

Average profile on gene bodies

Extract sequences in a BSgenome using a GRanges

Outline

Annotations

AnnotationHub

Outline

Import/Export

Visualization

Conclusions

To go beyond

Major Bioconductor contributors

References