Development of Next-Gen Human Transcriptome Array

Introduction

Glue Grant Human Transcriptome Array (GG-H) is a collaboration result between Stanford Genome Technology Center, Wing Wong’s lab at Stanford, Affymetrix Inc and the Inflammation and Host Response to Injury program (“Glue Grant”). The array has been comprehsively designed to interrogate various apects of the transcriptome, incuding gene expression, alternative splicing, detection of coding SNPs and non-coding transcription. With talored procotol to work efficiently with small amount of total RNA, the array provides a high-throughput but low-cost platform for clinical genomic studies.

Affymetrix is expected to make the GG-H array available commercially in January 2013.  The commercial version of the GG-H array is named as Human Transcriptome Array (HTA).

Array Components and Probe Design

Various components of the array and their probe design strategies are summarized in the following table and illustrated in the figure.

Array Components Number of Targets Number of Probes Design
Gene exons 315,123 3,292,929 On average ten probes per exon (~119 probes per gene) were selected based on high thermodyanmic scores, uniqueness and spreadness on targets
Exon-exon junctions 260,488 1,060,703

Four probes per junction at (-3, -1, +1 , +3) relative to the splicing site

Coding SNPs and DMET variations 89,782 982,941 Six probes per allele at -4, 0, and +4 positions on each of the two strands relative to the SNP
Non-coding functional RNA (f-ncRNA) 730 5,869 Ten probes per ncRNA were selected based on high thermodyanmic scores, uniqueness and spreadness on targets
Non-coding antisense expression (as-ncRNA) 50,783 563,097 Probes were selected at the density of one probe per 50 bp of UTR and with a minimum of six probes per region
Un-annotated transcribed units (UTU) 49,957 488,581 Ten probes per UTU were selected based on high thermodyanmic scores, uniqueness and spreadness on targets
Other probes including controls   498,840 Designed for quality control of the assay, background modeling, estimation of cross hybridization, and monitoring the ribosomal RNA
Total   6,892,960  

 

arrayscheme

 

Libary Files, Annotation and Database

To support different kinds of analyses using GG-H array, we have developed a set of library and annotation files. Most important ones are summarized in the following table. In addition, a comprehensive database (http://gluegrant1.stanford.edu/~DIC/db) is also available for the query of array design and annotation information. Users can use the database to generate customerized library and annotation files.

 

File Name
File Type
Description
Download
hGlue2_0.r1.clf CEL Layout File (CLF) CLF along with PGF make up the core chip layout information for our array. The CLF contains the mapping of probe IDs to x/y positions in the CEL file. hGlue2_0.r1.core.tar.gz
hGlue2_0.r1.pgf Probe Grouping File (PGF) PGF along with CLF make up the core chip layout information for our array. The PGF groups specific probes (by probe ID) into probesets.
hGlue2_0.r1.antigenomic.bgp BackGround Probes (BGP) The BGP file lists what probes (by probe ID) are to be used in various background correction methods (e.g. GCBG method).
hGlue2_0.r1.qcc Quality Control Content (QCC) The QCC file lists probes serving various quality control purposes.
hGlue2_0.r1.pgf.tbl Tab-deliminated The file is used for GlueQC package for quality control summary.
hGlue2_0.r1.PSR.ps Probeset List (PS) The PS file lists probeset IDs for Probe Selection Regions (PSRs).
hGlue2_0.r1.TC.mps Meta Probeset List (MPS) The MPS file is used to group individual PSR (exon) level probesets into Transcript Cluster (gene) level meta probesets.
hGlue2_0.r1.TC_Annot.csv Gene Annotation File The annotation file links transcript cluster (gene) to chromosomal position information, gene information, functional annotation (gene ontology and pathway) and other information in public databases
hGlue2_0.r1.ASS Alternative Splicing Structure (ASS) The ASS file provides the alternative splicing structure based on design time knowledge. It describe how exons and junctions are connected in a transcript cluster.
hGlue2_0.r1.Probe.BED BED File Genome coordinate file for probes on hg18
hGlue2_0.r1.PSR.BED BED File Genome coordinate file for Probe Selection Regions (PSRs) on hg18
hGlue2_0.r1.TC.BED BED File Genome coordinate file for Transcript Clusters (TCs) on hg18

hGlue2_0.r1.gene info Gene Ontology.xls

dChip Library File Gene ontology frequency summary for GG-H genes hGlue2_0.r1.dChip.tar.gz
hGlue2_0.r1.gene info.xls dChip Library File Gene annotation information for GG-H genes
hGlue2_0.r1.genome info.xls dChip Library File Genome coordinate information for GG-H genes
component.ontology; function.ontology; process.ontology; dChip Library File cellular component, molecular function and biological process ontology mapping for GG-H genes

 

Analysis Pipeline and Softwares

To support routine analyses of GG-H array, we have established a basic pipeline for quality control, expression indices calculation and detection of alternative splicings. For other compomnents of the array, the analysis methods are still exploratory and very customerized.

Aanlysis Software Description Download
Quality control GlueQC (requires APT and R bioconductor) Assess array quality through exploratory plots and summary statistics

GlueQC website

Expression indices calculation

Affymetrix Power Tools (APT)

JETTA

Background correction, normalization and calculatation of exon or gene expression matrices

APT website

Detection of alternative splicing Junction and Exon array Toolkits for Transcriptome Analysis (JETTA) Detection of alternatively spliced exons with or without supporting junctions JETTA website
High-level exploratory analysis dChip Clustering of gene expression and enrichment analysis of ontogies, pathways and genome locations dChip website
Visualization UCSC genome browser Visualize probe/exon/gene on genome browser UCSC genome brower

 

1. Quality control

Ensuring high quality of data is crutial to genomic studies. GlueQC starts with CEL files and checks a few quality scores to filter out outliers. Quality statistics include probe-level foreground and background signal, area under curve using Norm Exons and Norm Introns as positive and negative controls respectively, probeset prensence call, and betwen-array correlation at both exon and gene level.

 

To run the script,

Rscript GlueQC.R celpath=CEL_PATH outpath=OUTPUT_PATH libpath=LIB_PATH

 

2. Expression indices calculation

Low-level analysis of microarray includes background correction, normalization and exon/gene expression indices calculation. Here we show examples of low-level analyses using APT.

To calculate gene-level expression using APT rma-sketch:

apt-probeset-summarize -a rma-sketch -c hGlue2_0.r1.clf -p hGlue2_0.r1.pgf -b hGlue2_0.r1.antigenomic.bgp -m hGlue2_0.r1.TC.mps -o gene_expr *.CEL

 

To calculate exon-level expression using APT rma-sketch:

apt-probeset-summarize -a rma-sketch -c hGlue2_0.r1.clf -p hGlue2_0.r1.pgf -b hGlue2_0.r1.antigenomic.bgp -s hGlue2_0.r1.PSR.ps -o exon_expr *.CEL

 

JETTA is also capable of performing low-level analyses. Please refer to its dedicated website for instructions (JETTA website).

 

3. Alternative splicing uing JETTA

With the addition of junction probes, GG-H can improve the accuracy of alternative splicing detection. To meet the need of including junctions into alternative splicing analysis, we have developed Junction and Exon array Toolkits for Transcriptome Analysis (JETTA), an integrated software tool for expression indicies calcaultaion and alternative splicing analysis. Please refer to its dedicated website for instructions (JETTA website).

 

4. High-level exploratory analysis using dChip

Biologists are oftentimes interested in clustering and functional enrichment analysis at gene level. For this purpose, we provide users a set of library files to support these kinds of analysis using dChip. Please refer to dChip website for more instrunctions on how to run dChip (dChip website).

 

Protocol

The GG-H procotol is based on Ambion Inc./Applied Biosystems (cat# 4411974) and has been specially modified to efficiently work with small amount of starting material. It uses two rounds of single-strand cDNA synthesis to amplify mRNA and Affymetrix GeneChip WT terminal labeling technology to label fragment cDNA for hybridization. The detailed proctol can be found here (GG-H protocol).

 

Availability

The array platform has been depsited to NCBI GEO under GPL11319. An example data set is accessible at GSE26072 (and GSE26109 for the RNA-Seq data used in the paper).

 The GG-H array can be ordered from Affymetrix as a custom array. For more information, please contact ron email or xiao email.

 

Reference 

Xu W, Seok J, Mindrinos MN, Schweitzer AC, Jiang H, Wilhelmy J, Clark TA, Kapur K, Xing Y, Faham M, Storey JD, Moldawer LL, Maier RV, Tompkins RG, Wong WH, Davis RW, Xiao W; Inflammation and Host Response to Injury Large-Scale Collaborative Research Program. Human transcriptome array for high-throughput clinical studies. Proc Natl Acad Sci U S A. 2011 Mar 1;108(9):3707-12. doi: 10.1073/pnas.1019753108. Epub 2011 Feb 11.

 

Questions and Comments

For questions and comments, please join our discussion group at http://groups.google.com/group/GGHarray.

 

Last modified 12/22/2012. Webmaster: weihongxATstanfordDOTedu

 

Leave a Reply

Your email address will not be published. Required fields are marked *