Clinical genomic studies often require profiling of the human transcriptome from a large number of patients. These RNA samples are typically from small amount of tissues or blood and often partially degraded from clinical archives of formalin-fixed, paraffin-embedded (FFPE) specimens, which present challenges for transcriptome analysis. For example, while RNA-seq can simultaneously discover new transcriptome elements and quantify their expression levels, it requires large amount of total RNA (typically >1ug) for ribosomal RNA (rRNA) removal and, even after rRNA removal using commercially available kits, only ~10-20% of the sequencing reads from FFPE samples can be mapped to exon regions, making RNA-seq currently prohibitive in analyzing >1 billion archived tissue samples in large scale disease studies. In addition, novel mRNA transcripts are discovered from deep RNA sequencing (RNA-Seq) to sufficient depth, and other transcriptome elements, e.g. alternative splicing events and non-coding transcripts, are shown to be important causes and markers of diseases. Here we propose to develop tools and associated protocols for high-throughput, cost-effective and reproducible profiling of these elements from patient samples in large scale clinical studies.
A) Design, test and utilize a 4uM array technology of Affymetrix which synthesizes 11 million different features on a single array (comparing with <7 million features on current arrays).
The new array design will cover all the candidate transcripts, exons, and junctions in the current mRNA transcript databases and from RNA-seq analysis of tissue panels (such as BodyMap2), as well as coding SNPs and non-coding transcripts. As an example, the summary of a test design is shown in Table 1. In addition, we plan to include probes against viral species which are known to be relevant to human health and yet not routinely monitored in patient studies. The array will utilize an improved synthesis that increases the percentage of full length 25mer probes and enable us to test the performance of longer oligo probes of 35mers.
Table 1. Summary of candidate targets for the array by analyzing a panel of 16 tissues sequenced at ~160M reads per sample (BodyMap2, ArrayExpress: E-MTAB-513). 95% of the sequencing reads fall within the targets of the GG-H array. At this read depth, for each tissue 15~25% genes and 30~55% exons have less than 20 reads per gene or exon. The BodyMap2 data were mapped to the current mRNA transcript databases (RefSeq, UCSC, Ensembl, Vega) and in parallel conducted de novo identifications of new transcripts, exons and junctions. RNA-seq data contributed 5-7% new candidates of transcriptome elements.
B) Optimize the array performance by probe selection and calibration.
First, we will use human genomic DNA to eliminate non-performing probes (no signal or saturated signal). Second, using the human transcripts in the mammalian gene collection, we will hybridize the array with different ‘cocktails’ of these transcripts at different concentrations (e.g. Latin Square design, titration mixtures), further select probes which have good correlations with the known, true levels of the transcript across the multiple cocktails and calibrate the calculated values of gene/exon expression accordingly.
We will benchmark the calibrated array using the BodyMap2 tissue panel and compare its performance with the available RNA-seq data (ArrayExpress: E-MTAB-513). As spike-in controls, we will use External RNA Controls (ERCC, Ambion Inc) and a new set of synthetic gene transcripts from Affymetrix which includes two isoforms per ‘gene’ and have no sequence homology to human genome, and evaluate the specificity and sensitivity at both gene and exon level.
C) Establish reliable protocols for processing clinically samples including FFPEs.
In the current grant period, we developed a robust protocol of processing <50ng total RNA for clinical samples of fresh tissue and blood. We will first test the performance of this protocol and others on the new array, and, if it is successful, further test on FFPEs (vs fresh frozen). Here we will utilize three potential advantages of the array, 1) results from current whole transcriptome (WT) protocols have shown that array is intrinsically ‘resistant’ to rRNA because of probe specificity; 2) the high density tiling design (>10 probes per exon and 8 probes per junction) allows selection of probes that have higher reproducibility and signal/noise, as well as larger response range; and 3) the improved synthesis of full length probes and longer probes of 35mers likely improve the hybridization kinetics and sensitivity. We will also explore the possibility of 1) selecting a subset of best performing probes for gene-level analysis which can be manufactured on ‘96-well’ plate format for automated gene level analysis, and 2) reducing the hybridization time taking advantage of the longer oligo probes and therefore shortening the assay time which is often desirable for clinical testing.