The human and mouse genome sequencing projects produced a "parts list" of mammalian genes and proteins. A relatively small number of these proteins have been studied in great detail, however, and for most of them we have little information about the biological role they play. Over the past decade, new experimental techniques and resources have become more widely available and affordable, enabling genome-wide measurements (e.g. gene expression microarrays, deep sequencing, tandem mass spectrometry, etc.) that should shed light on cellular mechanisms, gene regulation, protein functions and ultimately human disease, even for poorly studied proteins. However, the rate at which these raw data are translated into concrete knowledge is currently much slower than the rate of data generation. In order to help bridge the gap, our focus is on developing novel algorithms and approaches for the analysis, exploration and visualization of these data.
Computationally Uncovering Protein Roles
Over 50% of genes present in humans and mice have no experimentally verified role in any pathway or process. However, high-throughput whole-genome measurement technologies (such as microarrays, RNA-seq, ChIP-seq, mass spec, etc.) have produced thousands of measurements of the activity levels of these genes under a variety of conditions. Despite this wealth of data, we still lack specific knowledge of the functions and processes performed by these genes. My lab is developing and applying a variety of computational data mining and machine learning approaches to form high quality predictions of specific protein functions that we can then verify with follow-up laboratory work.
One of our target areas for protein function prediction is in bone biology and osteoporosis. By using the mammalian phenotype information maintained by the Mouse Genome Informatics group (MGI) as a starting point, we have trained a suite of support vector machines (SVMs) to determine which genes are most likely to contribute to over 1000 diverse phenotypes. We selected our most confident predictions of proteins involved in bone density forr experimental validation, and we have verified that in mice carrying knockouts of our novel predictions, they have significant defects in bone density and morphology.
Additionally, we are exploring the incorporation of developmental-, tissue-, and cell type- specific information into our methods in order to identify the functions performed by genes, and also the times and locations that these roles are performed. By incorporating the spatio-temporal expression information contained in The Jackson Laboratory's Gene eXpression Database(GXD) with the biological process annotations of MGI, we have created a new set of gold standard interactions that we are using to predict more specific functional relationsiops between genes.
These efforts are key for advancing the utility of computational methods beyond single celled organisms for their successful application to mammalian systems.
Data-driven Search Algoritms and Data Organization
High-throughput data sources provide a new level of information currently underutilized by many in the research community due to the difficulty of quickly locating and easily interactiing with data relevant to their area of interest. A key impediment to understanding these data is that the scale of available data collections prevents any individuals from examining these data in their totality. As such, we must provide the capability for individual researchers to quickly identify and interact with the existing data that is most relevant to their area of interest. While current repositories are cataloging and housing high-throughput data, the search interfaces they provide are generally limited to simple text-based searches throught manually-curated annotaions.
Our work in this area is focused on a complementary data-driven search paradigm that dynamically identifies important patterns in the data itself to provide answers to more diverse questions. For example, if a researcher is interested in the targets of a particularr transcription factor, our approach uses measured expression levels to identify which existing datasets elicited co-expression of these targets rather than relying soley on text-based annotations and curation of data. We originally developed this approach, called SPELL, for yeast gene expression data, but we are expanding our algorithmic approach and software to be applicable in any organism and to search across more diverse types of data.
Large-Scale Data Visualization
One of the best ways for researchers to understand their data is to visually look for patterns within that data. However, the scale of genome-wide datasets prevents traditional methods and devices from fully displaying a single dataset, much less large collections of related datasets. We are developing techniques that utilize large-scale display devices as well as traditional displays in order to show researchers the information that they need to extract from their data. Further, we have developed approaches that incorporate statistical measures directly into visualization schemes that improve their effectiveness and accuracy. For example, traditional displays of heat maps or line graphs implicitly show a Euclidean distance relationship between profiles. However, when using a different distance metric, such as Pearson's correlation, it is often better to explicitly encode these distances into visualizations.
In this area, we have created the bioHIDRA gene expression browser, which allows users to view microarray and RNA-seq expression datasets sid-by-side in order to quickly find and evaluate common and divergent patterns of transcription. In the future we will expand these techniques to incorporate additional forms of high-throughput data. Our broad visualization goal is to provide users intuitive access to the wealth of available biological data so that they can easily make inferences and observations that are not possible with current approaches.
Research Scientist: Cheryl Ackert-Bicknell
Research Assistant: Kathy Shultz
Colony Manager: Dana Godfrey
Software Engineer: Al Simons
Post Doc: KB Choi
PostDoc: Tongjun Gu
Graduate Student: Karen Dowell
Co-Op Associate: Braden Kell
Intern: Adam Perruzzi
Intern: Catherine Sharp
Research Administrative Assistant: Annie McDonnell
Gu T, Buaas FW, Simons AK, Ackert-Bicknell CL, Braun RE, Hibbs MA. 2012. Canonical A-to-I and C-to U RNA Editing Is Enriched at 3'UTRs and microRNA Target Sites in Multiple Mouse Tissues. PLos One 7(3):e33720 PMCID:PMC3308996
Li Y, Hibbs MA, Gard AL, Shylo NA, Yun K. 2012. Genome-Wide Analysis of N1ICD/RBPJ Targets in vivo Reveals Direct Transcriptional Regulation of Wnt, SHH, and Hippo Pathway Effectors by Notch 1. Stem Cells. Jan 9 epub ahead of print.
Guan Y, Ackert-Bicknell CL, Kell B, Troyanskaya OG, Hibbs MA. 2010. Functional Genomics Complements Quantitative Genetics in Identifying Disease-Gene Associations. PLoS Comput Biol 6(11):e1000991 PMCID: PMC2978695
Baryshnikova A, Costanzo M, Kim Y, Ding H, Koh J, Toufighi K, Youn J, Ou J, San Luis B, Bandyopadhyay S, Hibbs MA, Hess D, Gingras A, Bader GD, Troyanskaya OG, Brown GW, Andrews B, Boone C, Myers CL. 2010. Quantitative analysis of fitness and genetic interactions in yeast on a genome scale. Nature Methods 7(12):1017-1024.PMCID: PMC3117325
Gehlenborg N, O'Donoghue SI, Baliga NS, Goesmann A, Hibbs MA, Kitano H, Hohlbacher O, Neuweger H, Schneider R, Tenenbaum D, Gavin AC. 2010. Visualization of omics data for systems biology. Nat Methods 7(3suppl): S56-68.
Hess DC, Myers CL, Huttenhower C, Hibbs MA, Hayes AP, Paw J, Clore JJ, Mendoza RM, Luis BS, Nislow C, Giaever G, Costanzo M, Troyanskaya OG, Caudy AA. 2009. Computationally driven, quantitative experiments discover genes required for mitochondrial biogenesis. PLoS Genet 5(3):e1000407. PMC2648979
Hibbs MA. 2009. The Effects of Pre-processing and Parameter Choices on Searches Through Large Gene Expression Data Collections. IEEE Int Conf on Genomic Signal Processing and Statistics (GENSiPs).
Hibbs MA, Myers CL, Huttenhower C, Hess DC, Li K, Caudy AA, Troyanskaya OG. 2009. Directing experimental biology: a case study in mitochondrial biogenesis. PLoS Comput Biol 5(3):e1000322. PMC2654405
Huttenhower C, Haley EM, Hibbs MA, Dumeaux V, Barrett DR, Coller HA, Troyanskaya OG. 2009. Exploring the human genome with functional maps. Genome Res 19(6):1093-1106. PMC2694471
Huttenhower C, Hibbs MA, Myers CL, Caudy AA, Hess DC, Troyanskaya OG. 2009. The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction. Bioinformatics Epub ahead of print.
Haarer B, Viggiano S, Hibbs MA, Troyanskaya OG, Amberg DC. 2007. Modeling complex genetic interactions in a simple eukaryotic genome: actin displays a rich spectrum of complex haploinsufficiencies. Genes Dev 21(2):148-159. PMC1770898
Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. 2007. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics 23(20):2692-2699.
Hibbs MA, Wallace G, Dunham M, Li K, Troyanskaya OG. 2007. Viewing the Larger Context of Genomic Data through Horizontal Integration. Proceedings of IEEE-CS 11th Int. Conf. on Information Visualization (IV®07) 326-334.
Huttenhower C, Flamholz AI, Landis JN, Sahi S, Myers CL, Olszewski KL, Hibbs MA, Siemers NO, Troyanskaya OG, Coller HA. 2007. Nearest Neighbor Networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics 8:250. PMC1941745
Wallace G, Hibbs MA, Dunham M, Sealfon RSG, Troyanskaya OG, Li K. 2007. Scalable, Dynamic Analysis and Visualization for Genomic Datasets. Proceedings of IPDPS 2007 Workshop on Next Generation Software.
Huttenhower C, Hibbs MA, Myers CL, Troyanskaya OG. 2006. A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 22(23):2890-2897.
Myers CL, Barrett DR, Hibbs MA, Huttenhower C, Troyanskaya OG. 2006. Finding function: evaluation methods for functional genomic data. BMC Genomics 7:187. PMC1560386
Sealfon RS, Hibbs MA, Huttenhower C, Myers CL, Troyanskaya OG. 2006. GOLEM: an interactive graph-based gene-ontology navigation and analysis tool. BMC Bioinformatics 7:443. PMC1618863
Hibbs MA, Dirksen NC, Li K, Troyanskaya OG. 2005. Visualization methods for statistical analysis of microarray clusters. BMC Bioinformatics 6:115. PMC1156867
Li K, Hibbs MA, Wallace G, Troyanskaya OG. 2005. Dynamic Scalable Visualization for Collaborative Scientific Applications. Proceedings of IPDPS 2005 Workshop on Next Generation Software.
Myers CL, Robson D, Wible A, Hibbs MA, Chiriac C, Theesfeld CL, Dolinski K, Troyanskaya OG. 2005. Discovery of biological networks from diverse functional genomic data. Genome Biol 6(13):R114. PMC1414113
Wallace G, Anshus OJ, Bi P, Chen H, Chen Y, Clark D, Cook P, Finkelstein A, Funkhouser T, Gupta A, Hibbs M, Li K, Liu Z, Samanta R, Sukthankar R, Troyanskaya O. 2005. Tools and applications for large-scale display walls. IEEE Comput Graph Appl 25(4):24-33.