Cambridge Healthtech Institute’s Third Annual

Genomics & Sequencing Data Integration, Analysis and Visualization

Deriving Insights and Relationships from Big Data Sets to Advance Research and Patient Health

March 10-11, 2016 | Hilton San Francisco Union Square | San Francisco, CA
Part of the 23rd International Molecular Medicine Tri-Conference


The eruption of high-throughput genomic and proteomic technologies over the last two decades has motivated the development of tools and methodologies to transfer and integrate data into large-scale bioinformatics database platforms and repositories. The surge of biological data being collected has increased the need for standardized workflows, integrated solutions, economics of scale in the cloud, security and compliance in the cloud especially as genomics becomes more integrated with precision medicine initiatives, and tools to visualize and analyze the data. The third annual Genomics & Sequencing Data Integration, Analysis and Visualization Symposium will present concrete use cases in life sciences where analysis and visualization of big data have made a difference in science decisions. Thought leaders will discuss the trends in genomic data, big data analytics, and translational informatics and how dealing with data complexities has advanced research and patient health.

Final Agenda

Thursday, March 10

7:30 am Registration and Morning Coffee


9:00 Chairperson’s Opening Remarks

Cindy Crowninshield, RDN, LDN, Senior Conference Director/Team Lead, Molecular Medicine Tri-Conference

9:10 Power to the People: Annotation, Analysis and Visualization for Systems Biology and Precision Medicine Using CrosstalkerTM

Mark Chance, Ph.D., Vice Dean for Research; Director, Center for Proteomics and Bioinformatics; Charles W. and Iona A. Mathias Professor of Cancer Research, School of Medicine, Case Western Reserve University and Neo Proteomics, Inc.

Integration and visualization of diverse sets of molecular is one of the most challenging yet important approaches in order to identify dysregulated molecular targets in complex disease. Conceptions of disease states as high level descriptors is rapidly evolving into molecular descriptions of disease sub-types characterized at a systems level (e.g. networks and pathways), where the inclusion of specific patients into sub-types will eventually drive their individualized treatment protocols. While genomic and pathway characterization of cancer sub-types driving individual therapeutic decisions is now field standard, and provides an important proof of principle for precision medicine, the ability to integrate a wide range of gene, protein, or metabolite level data to permit the development of precision medicine across a wide range of diseases is in its infancy. Systems biology software solutions are essential to progress in this area, however, in terms of current commercial software solutions, both the analytical algorithms and databases lack transparency (black box), this feature limits the ability of the user to understand how and why their results came about, which is essential to mechanistic understanding. In addition, the reliance on proprietary databases ignores the increasing value and reliability of public (open-source) data. This lack of transparency will be increasingly untenable in light of the international movements towards reproducibility and accuracy of results demanded by sponsors and the public. On the other hand free-ware, which has the maximum in flexibility and agility, is typically limited to the “power-user” or professional bioinformatics community and historically lacks the levels of ongoing support and sustainability that is standard for commercial software products intended for broad user adoption. To overcome these challenges we have developed an integrated set of commercial tools that includes CrosstalkerTM, a user-friendly and transparent (e.g. white-box) analytical engine for molecular data analysis and integration, including but not limited to individual or simultaneous integration of mutations, SNPs, CNVs, array, RNAseq, and proteomics data, based on page rank-type approaches. This analytic framework is coupled to a molecular network generation engine called Disease Path FinderTM, where the user can document the details of the networks and pathways being annotated and scored and compare and contrast CrosstalkerTM integrations across different network and pathway frameworks to investigate and understand the molecular mechanisms underlying the disease and developmental phenotypes under study. Together these tools provide a systems biology workflow that includes the high quality of visualization and annotation expected in a commercial product while retaining the ability to integrate user molecular data with a wide range of public and private pathway and network representations, such that reproducibility and transparency of results is assured, paving the way for the adoption of precision medicine.
Authors: Mark R. Chance, Gurkan Bebek, Mehmet Koyuturk, and Sean Maxwell

9:40 Benchtop Sequence Analysis: Empowering Bench Scientists to Analyze Big Data through Web Interfaces

Dave Barkan, Ph.D., Investigator, Infectious Diseases, Novartis Institutes for BioMedical Research

As Next-Generation Sequencing library construction and data analysis have become refined and standardized, scientific focus has shifted from method development to results visualization and interpretation. For some basic sequence analysis tasks, a bioinformaticist's role may be simply to launch an established software pipeline on the command line with default parameters and send the generated results back to the bench scientists. In NIBR, the Bioinformatics and IT groups are working together to eliminate this intermediate step by building web front-ends that launch in-house bioinformatics pipelines and return the results and visualizations directly to the end-users in their browser.


10:10 Metabolomics, the Microbiome and Understanding Complex Diseases

Andreas Kogelnik, M.D., Ph.D., Founder and Director, Open Medicine Institute

The development of -omic biotechnologies such as gene expression, metabolomics and gut microbiome are advancing rapidly in terms of their utilization on the research front and are showing promise in clinical applications. Up until now, little attention has been paid to how these types of data relate to one another or their real-world impact on disease modulation. This talk will discuss two projects focused on integrating blood metabolomic and gut microbiomic data with direct clinical application. We will discuss how these technologies are being used to improve diagnostic rigor and pointing the way to therapeutic targets in particular, for complex diseases and chronic disease management. Current integrative -omics appears on course to re-shape precision diagnostics and therapies.

10:40 Coffee Break with Exhibit and Poster Viewing

11:15 Replacement Talk: In vivo Genome Engineering Using S. aureus Cas9: Development and Applications

Winston Yan, Graduate Student, M.D.-Ph.D. Program, Laboratory of Dr. Feng Zhang, Broad Institute of MIT and Harvard

The small Cas9 ortholog from Staphylococcus aureus (SaCas9) has proven to be a versatile and efficient RNA-guided endonuclease ideally suited for in vivo applications due to its ability to be packaged into the highly versatile adeno-associated virus (AAV) delivery vehicle. Here, we describe the characterization and structure of SaCas9, and its application in knocking down the cholesterol regulatory gene Pcsk9 in the adult liver as a prototype for efficient in vivo genome editing using CRISPR-Cas9.

11:45 Identification and Relevance of Fusion Transcripts in a Novel in vitro Progression Model of High-Grade Serous Ovarian Cancer

Sharmila Bapat, Ph.D., FNASc, FASc, Independent Project Investigator and Group Head, National Centre for Cell Science (NCCS), Pune India

High-grade serous ovarian adenocarcinoma (HGSC) is recognized to rapidly progress from asympomatic, silent onset to aggressive metatstatic disease that leads to the most dismal prognosis. Lack of early diagnosis has led to an opinion that better disease management through detailed molecular and biological understanding of tumors could pave the way for development of targeted ‘personalized’ therapeutic strategies and improve patient prognosis.

12:30 Enjoy Lunch on Your Own

1:15 Session Break


1:50 Chairperson’s Remarks

Cindy Crowninshield, RDN, LDN, Senior Conference Director/Team Lead, Molecular Medicine Tri-Conference

2:00 Software and Computational Platforms to Integrate Diverse Genomics and Epigenomic Datasets

Duygu Ucar, Ph.D., Assistant Professor, Genomic Medicine, Jackson Laboratory

We are building software and computational algorithms to integrate epigenetic datasets with other data sources including chromatin interaction maps, public data repositories (SNPs, gene sets, immune modules), and transcriptome. Attendees will learn about the methods and software we are developing in my lab, as well as the research directions that Jackson Laboratory for Genomic Medicine is taking, which is a brand-new genomics institute dedicated to understand human diseases, including cancer, immune diseases, and diabetes.

3:00 Refreshment Break with Exhibit and Poster Viewing

3:30 New Gene-Level Approaches to Identify Disease-Causing Mutations in Next-Generation Sequencing Data of Patients

Yuval Itan, Ph.D., MRes, Research Associate, Human Genetics of Infectious Diseases, The Rockefeller University

Ascertaining whether a gene that harbors a variation may be relevant to the disease being studied is key to testing as few potential candidate mutant alleles as possible while not excluding the disease-causing allele(s). We developed two novel gene-level approaches to estimate the relevance of a specific gene to a disease. We first describe the gene damage index (GDI), a genome-wide, gene-level estimate of accumulated mutational damage for human protein-coding genes: genes that are frequently mutated in healthy individuals are unlikely to cause rare diseases, and yet they contribute to a large proportion of the next generation sequencing data generated for any patient. We then present the mutation significance cutoff (MSC): a gene-specific threshold (rather than a fixed threshold for all human genes in current methods) to differentiate between benign and damaging variants. We demonstrate that the combination of the GDI and MSC approaches significantly increases the discovery rate of new disease-causing mutations in next generation sequencing data of patients.

4:00 Clinical Transcriptomic Profiles – Providing Clues for Novel Therapeutic Development Strategies: A Case Study on Psoriasis

Deepak K. Rajpal, D.V.M., Ph.D., Director, Computational Biology, Target Sciences, GlaxoSmithKline

Psoriasis is a chronic inflammatory skin disease with complex pathological features. By mining the publicly available clinical transcriptomic profile data, we present a framework for developing new therapeutic intervention strategies. We propose a psoriasis disease signature, and the reversal of such signature on therapeutic intervention, presents approaches to drug repurposing and novel target selection strategies. These approaches would potentially support biomarker and drug discovery strategies for psoriasis.

4:30 Ten Things You Probably Don’t Know About GenBank

Ben Busby, Ph.D., Genomics Outreach Coordinator, NCBI, NIH

5:00 Reception with Exhibit and Poster Viewing

6:00 Close of Day

Friday, March 11

8:00 am Morning Coffee


8:25 Chairperson’s Remarks

Martin Gollery, CEO, Tahoe Informatics

8:30 Role of Hadoop and Data Analysis to Move Genomics from Research to Personalized Medicine

Martin Gollery, CEO, Tahoe Informatics

9:00 Evolution of a Genomics Data Ecosystem: Efficient NGS Data Tracking, Processing, Integration and Results Sharing

Lihua Yu, Ph.D., Vice President, Data Science and Information Technology, H3 Biomedicine, Inc.

H3 Biomedicine is an oncology drug discovery company, which leverages cancer genomics data generated externally and internally throughout our target validation and drug discovery efforts. Our goals are to use genomics data to inform our drug discovery efforts, and most importantly allow data exploration by all scientists. Toward these goals, we have built a genomic data ecosystem with components including data storage, NGS data analysis with pipelines and workflow management tools, genomic data management/warehouse using AWS Redshift, genomic data integration system that allow data exploration for both computational biologists and other scientists, to results and knowledge sharing in a company- wide collaboration platform. Very importantly, we developed a genomics experiment and sample tracking system; the common IDs created in this system serve as unique identifiers to tie all components of the Eco-system together to allow efficient data flow and data/ information retrieval. This presentation discusses the importance of having such eco-system that also provides tractability and visibility and reusability of both the data and the scientific insights from genomics studies.

9:30 Speeding Up Drug Research with MongoDB: Introducing MongoDB into an RDBMS Environment

Doug Garrett, Research Leader, NGS Pipeline Development Group, Roche Sequencing

Genetic testing of animal models has been critical to Genentech Research in understanding the underlying cause of many diseases and in developing drugs to address those diseases. This importance has driven an increase in both the number and complexity of genetic testing requirements for the transgenic Genetic Analysis Lab. At the same time, improvements in genetic testing technologies have driven down the cost and increased the throughput of commercially available instruments. However, integrating these new instruments into our existing system had proven time consuming and resource intensive, with some new instruments requiring six months or more to integrate. To increase the flexibility of the system, we embarked on a major redesign, which included the use of MongoDB, a noSQL document database with a flexible schema. The new system has allowed a major reduction in the time needed to integrate new equipment from months to only weeks.

10:00 Selected Poster Presentation: Limitations and Problems of Genomic Data Sharing: Enabling Data Discovery and Accessibility
Amanda A McMurray, Ph.D., MBA, MIoD, Chief Financial Officer, Repositive Limited
The success of next generation sequencing technologies potentially opens up new horizons in clinical research and practice. But before one can really benefit from genomic clinics of the future, multiple issues must be addressed. Human genomics research relies on the availability of genomic datasets that are needed to test a hypothesis. Although a large amount of data is generated around the world, individual researchers still often lack access to it. Exemplary collaborative practices demonstrated during the realisation of the Human Genome Project do not reflect the state of data sharing in the community today: data sharing is not the default, but the exception. Data sharing has continually been recognised as important, not only for the advancement of scientific knowledge, but also for the preservation of information: verification of conclusions and safeguarding against misconduct. But data sharing in human genomics is a multifaceted challenge. Ethical considerations combined with the uniqueness of the genome of an individual require special precautions to enable sharing whilst protecting data privacy. Here, we investigate the current extent of human genomic data sharing by examining the data handling processes and needs of human genomics researchers in different settings. We explore how researchers are including data access and data sharing in their current workflows and whether any bottlenecks need to be addressed to enable more efficient data collaborations.

10:30 Coffee Break with Exhibit and Poster Viewing


11:00 Securing Sensitive Workloads in the Cloud: Best Practices and Procedures for Securing Your Data on Amazon Web Services

Brad Dispensa, Senior Solutions Architect, Amazon Web Services

Data security, access controls and monitoring are common areas of confusion for researchers interested in moving to the cloud. In this presentation I will cover how to configure your research to run securely using Amazon Web Services. We will review Amazon’s shared security model, encryption techniques, automation of security controls and resource provisioning and HIPAA workload design patterns on AWS.

11:30 Storms and Silver Linings: Developing Cloud-based Genomic Tools for a University Community 

Stephan Sanders, Ph.D., Assistant Professor, Department of Psychiatry, University of California, San Francisco

12:00 pm Lessons Learned Scaling Up Analysis for Thousands of Samples Using Amazon Web Services

Ravi Madduri, Fellow, Computation Institute, University of Chicago; Project Manager, Math and Computer Science Division, Argonne National Lab

Globus Genomics is a cloud-based, large scale genomics analysis service that is used by research consortiums, healthcare providers for analyzing 1000s of raw genomics datasets. In order to deliver results of the analyses on the tight deadlines, we created cost-aware resource scheduling on AWS resources that leverages the computational profiles that we created for various tools to schedule cost/performance optimized execution. In this talk, we will present some of the use cases and success stories from our work.

12:30 Close of Symposium