Nicer scatter plots in GGgally ggpairs-ggduo
GGally is an extension of ggplot2. I use it mainly as a replacement/refinement of pairs in base R.
Here, I want to replace the default scatter plot used to compare 2 continuous variables with something nicer, especially adapted to plotting large number of points.
Load the library and some RNA-seq data:
library(GGally)
library(airway)
data(airway)
The airway dataset contains data for 4 cell lines treated or not with dexamethasone:
colData(airway)[,1:3]
SampleName | cell | dex | |
---|---|---|---|
SRR1039508 | GSM1275862 | N61311 | untrt |
SRR1039509 | GSM1275863 | N61311 | trt |
SRR1039512 | GSM1275866 | N052611 | untrt |
SRR1039513 | GSM1275867 | N052611 | trt |
SRR1039516 | GSM1275870 | N080611 | untrt |
SRR1039517 | GSM1275871 | N080611 | trt |
SRR1039520 | GSM1275874 | N061011 | untrt |
SRR1039521 | GSM1275875 | N061011 | trt |
Let’s keep a fraction of the data for simplicity
df <- as.data.frame(assays(airway)$counts[,1:4]) #first 4 columns
df <- df[rowSums(df)>4,] #keep genes with some counts
set.seed(123)
df <- df[sample.int(nrow(df),1e3),] #random 1K genes
default behavior of GGally::ggpair
:
GGally::ggpairs(log2(df+1))
Here is a better function to make the scatterplots in the lower triangle. It is largely based on this blog post and implements:
- viridis coloring of points based on expression level (practically based on effect size captured by the first principal component). I added some code to get consistent coloring even if orientation of the first PC differs between pairs of columns.
alpha
based on inverse of 2D density, i.e. low alpha when there are lots of points and higher when points are more scattered
Also, the use of GGally:eval_data_col
is described
here:
GGscatterPlot <- function(data, mapping, ...,
method = "spearman") {
#Get correlation coefficient
x <- GGally::eval_data_col(data, mapping$x)
y <- GGally::eval_data_col(data, mapping$y)
cor <- cor(x, y, method = method)
#Assemble data frame
df <- data.frame(x = x, y = y)
# PCA
nonNull <- x!=0 & y!=0
dfpc <- prcomp(~x+y, df[nonNull,])
df$cols <- predict(dfpc, df)[,1]
# Define the direction of color range based on PC1 orientation:
dfsum <- x+y
colDirection <- ifelse(dfsum[which.max(df$cols)] <
dfsum[which.min(df$cols)],
1,
-1)
#Get 2D density for alpha
dens2D <- MASS::kde2d(df$x, df$y)
df$density <- fields::interp.surface(dens2D ,
df[,c("x", "y")])
if (any(df$density==0)) {
mini2D = min(df$density[df$density!=0]) #smallest non zero value
df$density[df$density==0] <- mini2D
}
#Prepare plot
pp <- ggplot(df, aes(x=x, y=y, color = cols, alpha = 1/density)) +
ggplot2::geom_point(shape=16, show.legend = FALSE) +
ggplot2::scale_color_viridis_c(direction = colDirection) +
# scale_color_gradient(low = "#0091ff", high = "#f0650e") +
ggplot2::scale_alpha(range = c(.05, .6)) +
ggplot2::geom_abline(intercept = 0, slope = 1, col="darkred") +
ggplot2::geom_label(
data = data.frame(
xlabel = min(x, na.rm = TRUE),
ylabel = max(y, na.rm = TRUE),
lab = round(cor, digits = 3)),
mapping = ggplot2::aes(x = xlabel,
y = ylabel,
label = lab),
hjust = 0, vjust = 1,
size = 3, fontface = "bold",
inherit.aes = FALSE # do not inherit anything from the ...
) +
theme_minimal()
return(pp)
}
Now use in ggpairs (adjust correlation method to be consistent with defaults in GGscatterPlot):
GGally::ggpairs(log2(df+1),
1:4,
lower = list(continuous = GGscatterPlot),
upper = list(continuous = wrap("cor", method= "spearman")))
Now with Pearson correlation coefficient (and removing the correlation coefficient in the upper triangle since we already have them in the lower triangle):
GGally::ggpairs(log2(df+1),
1:4,
lower = list(continuous = wrap(GGscatterPlot, method="pearson")),
upper = "blank") +
theme_minimal() +
theme(panel.grid.minor = element_blank())
We can use the upper triangle to plot other info since the scatter plots include the correlation coefficient.
I illustrate it here with the correlation coefficient for genes with >15 exons vs those with <15 exons:
exonNumber <- elementNROWS(rowRanges(airway[rownames(df),]))
df$MoreThan15Exons <- ifelse(exonNumber>15,
">15ex", "<15ex")
df[,1:4] <- log2(df[,1:4]+1)
GGally::ggpairs(df,
1:4,
lower = list(continuous = wrap(GGscatterPlot, method="pearson")),
upper = list(continuous = wrap(ggally_cor, alignPercent = 0.8),
mapping = ggplot2::aes(color = MoreThan15Exons))) +
theme_minimal() +
theme(panel.grid.minor = element_blank())
The GGscatterPlot
function can also be used with GGally::ggduo
to study the link between 2 sets of columns/samples. Here an example comparing control vs DEX samples:
GGally::ggduo(df,
c(1,3),
c(2,4),
types = list(continuous = GGscatterPlot)) +
theme_minimal() +
theme(panel.grid.minor = element_blank())