For Novice: BigPint package for RNA-seq differential expression analysis

A paper published recently in 2019 demonstrated their new RNA-seq visualization tools, pretty well wrapped as "BigPint" package in R. With that, one that can detect normalization issues, differential expression designation problems, and identify genes of interest. They even introduced new interactive plotting methods for RNA-seq data, which is somehow more appealing to me.

#####A brief introduction to the package

The BigPint package contains multivariate graphical tools, namely parallel coordinate plots, scatterplot matrices, and litre plots.

a) Parallel coordinates plots

This is pretty straightforward, the gene expressed levels will be shown directly in the y-axis, where x-axis consists of the sample groups. BigPint function overlays the parallel coordinates plots on the side-by-side boxplot. The mixture of the graphs gives the patterns of the gene levels as well as showing off the patterns are presented using normalised data.

This graph shows two clusters of gene levels, each group have an opposite gene level as contrast to the other group of samples.


b) Scatterplot matrices

Each dot represents the gene from each row of the gene. A good scatterplot should have larger variability (the dots diverse from the trendline X-Y) between treatment groups than between replicates. We can spot the outliner or unexpected patterns, which highly likely serve as the gene-of-interest (DEGs).

One example shown by the authors is a published RNA-seq result of iron-metabolism soybean dataset. By observing an unusual pattern in one of the three treatment samples (P.3), they discovered that it may be due to the timing differences between replicate handling.


c) Litre plot

It produces a graph of hexagon bin that looks like a beehive. In brief, it is a reduced dimension of the scatterplot, in order to relief the rendering burden during an interactive graph or when using bigger data. We can highlight the ten most significant genes (in the tutorial, they shows the gene IDs with the lowest false positive rate (FDR)).


###### A little highlight of R tutorial
The R tutorial provided by the authors can be accessed from the link. It is recommended to run the tutorial datasets before adapting with your own data. At least it is what I have done to inspect the changes/transform of the data in every step, so I could predict what's going on and do a necessary modification for my own dataset. And I'm sure that the tutorial is good enough as an exercise for most RNA-seq scientists.

However, if you are a newbie R-user like me, I would like to add some key points to better understand the analysis.

1. There are two important pre-requisite before the data visualisation step, we need data (the file with read counts and gene ID) and data metrics (there is guidance showing how to create it in BigPint tutorial). Optionally, you can create a matrix-like behavior of summarized-experiment object combining the data and data metrics. You may get more insight of the summarized experiment from this link.

2. The analysis is performed on pairwise groups, each with at least two replicate sets. if you have more than two groups to test with, I recommend you to make subsets of the pairs before the start of the data reformation (into data metrics and subsequent standardisation). It is to make life a lot easier when adapting their code. However, the tutorial provided by Lindsay Rutter has sample scripts for three treatment sets though.

3. The tutorial shows you an example to create data metrics using edgeR package. Other models/packages you may like to use are DESeq2 and limma. This step is to obtain the differential expression values in term of log fold-change, FDR value or p-value. We need to have a clear understanding of the model that applied to our dataset.

4. Before plotting any graph, it is better to normalise the data into a range of 0 to 1. You can examine from the boxplot, whether the mean values of all the samples are equal. We can perform some preprocessing techniques on the read counts to obtain a normalized and standardized version of the dataset.

The mean lines are in different locations (Right, before normalization) vs the mean lines are alligned as a pretty straight line (Left, after normalization, ignore the orange gene-levels plotting)

5. One last important note is, before plotting the clustering graphs, it is suggested to get the correct number of cluster groups yourself. It can be done using the famous hierarchical cluster analysis such as k-mean clustering coupled with elbow or silhouette methods. The default of the number of clusters (nC) for plotClusters is 4.

6. Interactive plots allow us to detect the dots from each individual sample that deviated the most from the expected pattern. It can be implemented by just running a two-sentence code.

7. Parallel coordinate plots should be sorted with some metric to help place similar variables near to each other, especially when scaling to larger data sets.

########Signifance of BigPint package

This new visualisation tools, scatterplot matrices, parallel coordinate plots, and litre plots help users check for normalization problems, catch common errors in analysis pipelines, and confirm that the variation between replicates and treatments is as expected. Besides that, we can quickly explore DEG lists (using any model that we chose, edgeR, DEseq2...), as well as discover genes of interest through visual geometric patterns that would otherwise remain undiscovered with models.

Reference
Rutter, L., Lauter, A. N. M., Graham, M. A., & Cook, D. (2019). Visualization methods for differential expression analysis. BMC bioinformatics, 20(1), 458.

Comments