Cambridge Healthtech Institute’s Fifth Annual

Bioinformatics for Big Data

Converting Data into Actionable Knowledge

March 7 – 9, 2016 | Moscone North Convention Center | San Francisco, CA
Part of the 23rd International Molecular Medicine Tri-Conference


Bioinformatics continues to face challenges of integrating molecular biological information with processes of quality patient care and accelerating the discovery of useful new therapies. The challenges are triggered by the increasing biomedical research that needs to be performed with large-scale data. Scientists are able to study entire systems of data in parallel using a variety of tools and methods. Tremendous computational resources are required to store, compute, analyze, and share. Through lectures and panel discussions, the fifth Annual Bioinformatics for Big Data Program will assemble thought leaders who will present solutions to some of these challenges. You’ll hear the latest developments of bioinformatics database platforms and tools that provide proper scalable archive strategies, optimize computing capacities, simplify data and NGS workflow, flexible and open analysis and interpretation, better collaboration models, and faster time to science. You’ll also hear best practice case studies of taking data from multiple –omics sources and aligning it with clinical action. Turning big data into smart data can lead to real time assistance in disease prevention, prognosis, diagnostics, and therapeutics.

Final Agenda

Day 1 | Day 2 | Day 3 | Download Brochure

Monday, March 7

10:30 am Conference Program Registration


11:50 Chairperson’s Opening Remarks

Andreas Kogelnik, M.D., Ph.D., Founder and Director, Open Medicine Institute

12:00 pm Genomics Based Medicines for Masses - Problems and Promises

Harpreet Singh, Ph.D., Scientist – D, Indian Council of Medical Research

Smart data is big data made actionable in real time. It’s about actions that you take in response to data, not just merely collecting the data. Until now, the trend has been to integrate data from multiple sources:instruments, clinical, biochemical, epidemiological, molecular, etc. and then using data mining tools to analyze trends. Turning this data from big to smart can lead to real time assistance in disease prevention, prognosis, diagnostics and therapeutics.

12:30 Big Data in Cancer Research and Precision Oncology

Anthony R. Kerlavage, Ph.D., Chief, Cancer Informatics Branch, Center for Biomedical Informatics & Information Technology, National Cancer Institute

The advancement of cancer research and the promise of precision medicine in oncology can be accelerated by broad access to the growing body of cancer data, sustainable tools, and high-performance computing resources. This presentation will focus on the NCI Genomic Data Commons and Cancer Genomics Cloud Pilots as models for democratizing access to data from The Cancer Genome Atlas (TCGA), other cancer research data, and precision medicine clinical trials. The rationale for the pilots will be presented along with an overview of the three different approaches being taken and the context for a future cancer knowledge commons.

1:00 Session Break

1:15 Luncheon Presentation I to be Announced

Bina Technologies1:45 Luncheon Presentation II to be Announced

Hugo Lam, Senior Director, Bioinformatics, Research & Development,
Bina Technologies

Advancements in NGS technologies have produced massive number of short read sequences, making secondary analysis a challenging big data problem. In this seminar, we will talk about the current approaches at Bina in assessing and improving the accuracy of NGS algorithms with research ranging from genomics to cancer genomics and transcriptomics.

2:15 Session Break

2:30 Chairperson’s Remarks

Andreas Kogelnik, M.D., Ph.D., Founder and Director, Open Medicine Institute

2:40 Issues Surrounding Genomically-Guided Individualized Cancer Clinical Trials

Nicholas J. Schork, Ph.D., Professor and Director, Human Biology J. Craig Venter Institute

3:10 Big -Omics Data Coupled with Health Coaching to Optimize Wellness and Minimize Disease

Nathan D. Price, Ph.D., Professor & Associate Director, Institute for Systems Biology

Future medicine will be more proactive and data-rich than anything before possible - and will focus on maintaining and enhancing wellness more than just reacting to disease. We have launched a large-scale 100K person wellness project that integrates genomics, proteomics, transcriptomics, microbiomes, clinical chemistries and wearable devices of the quantified self to monitor wellness and disease. I present results from our proof-of-concept pilot study in a set of 107 individuals (the Pioneer 100 study) over the past year, showing how the interpretation of this data led to actionable findings for individuals to improve health and reduce risk drivers of disease.

3:40 Managing and Analyzing Big Biomedical Data with Globus

Kyle Chard, Ph.D., Senior Researcher and Fellow, Computation Institute, University of Chicago and Argonne National Laboratory

Globus provides software-as-a-service (SaaS) for research data management, including data transfer, synchronization, sharing and publication. Unlike other SaaS providers, Globus provides these capabilities directly from users’ computers, without the need to replicate data in the cloud. Here I describe Globus and discuss how it can be used to manage and analyze big biomedical data.

4:10 Solving the File Exchange Problem for Bioinformatics

Michelle Munson, CEO & Co-Founder, Aspera, an IBM Company

As new research techniques create gigabytes of data, the need to ingest and exchange digital files quickly, easily, securely, and with the cloud’s scale-up capacity is critical. A new SaaS platform allows any organization to establish a branded web-based presence for fast, easy and secure exchange and delivery of any size data between separate organizations.

4:40 Refreshment Break and Transition to Plenary Session

5:00 Plenary Session

6:00 Grand Opening Reception in the Exhibit Hall with Poster Viewing

7:30 Close of Day

Day 1 | Day 2 | Day 3 | Download Brochure

Tuesday, March 8

7:00 am Registration and Morning Coffee

8:00 Plenary Session

9:00 Refreshment Break in the Exhibit Hall with Poster Viewing


10:05 Chairperson’s Remarks

Martin Gollery, CEO, Tahoe Informatics

10:15 Optimizing AWS Hadoop for Bioinformatics: A Case Study

Zhong Wang, Ph.D., Computational Biologist & Genome Analysis Group Lead, Lawrence Berkeley National Lab & DOE Joint Genome Institute; Adjunct Associate Professor, University of California at Merced

Apache Hadoop-based bioinformatics solutions have been recently developed to tackle the challenge in analyzing the rapid growing next generation sequence (NGS) data. Among them, BioPIG is a toolkit based on Hadoop and PIG that enables easy parallel programming and scaling to datasets of terabyte sizes. However, BioPIG has not been optimized for its performance. When running on Amazon Web Services (AWS), the baseline performance may lead to high computational costs. In this study we aim to optimize Hadoop parameters to improve the performance of BioPIG on AWS. We chose k-mer analysis as an example as it is an essential part of a large number of NGS data analysis tools. We tuned five Hadoop parameters on a customized Hadoop cluster. We found that each parameter tuning experiment led to various performance improvement, and the overall job execution time was reduced by 50% with an optimized parameter setting. We believe this tuning practice provides valuable reference for other similar applications that generate large volume of intermediate data.


10:45 Prediction of Protein Structure, Dynamics and Function on the Genomic Scale

Andrzej Kloczkowski, Ph.D., Professor, Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children’s Hospital and Department of Pediatrics, The Ohio State University College of Medicine

Recent progress in the mass-scale sequencing projects has produced enormous numbers of protein sequences, for which crystallographic structures have not yet been determined. Additionally despite the huge investments in high throughput protein crystallography and the important efforts of Protein Structure Initiative (PSI) Centers, the gap between the number of experimentally solved protein structures, and the number of known sequences continues to accelerate. The knowledge of protein structure is critical to comprehend their function, for understanding of molecular mechanisms of disease, and for development of new generations of medicines based on the computer-aided drug design. Because of this there is an urgent need to improve the existing computational methods of structure prediction to reach ultimately the accuracy of prediction comparable to crystallographic or NMR structure determination resolution. Another extremely important aspect of the improved structure prediction is computational design of completely new proteins with desired properties that haven't been yet created naturally by evolution. Computational protein structure prediction and design usually leads not to a single model, but to many alternative models corresponding to local, nonnative energy minima and it becomes critical to develop potentials, scoring functions and model quality assessment and refinement programs that may identify the structural model that is the closest to the native state and successfully refine it. The knowledge of protein structure is the first step to determine its biological function. In the last 15 years it has been shown that protein structure determines protein dynamics and the knowledge of protein flexibility and its fluctuational dynamics is critical for determination of protein function. The theoretical methods of normal mode analysis and elastic network models of biomolecules will be presented. We discuss all these important problems, and propose new methods for genome-wide protein structure and function prediction that have been recently highly successfully blind-tested in Critical Assessment of protein Structure Prediction (CASP) experiments.

11:15 L1000CDS2: LINCS L1000 Characteristic Direction Signature Search Engine Predicts Kenpaullone as a Potential Therapeutic for Ebola

Avi Ma’ayan, Ph.D., Professor, Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai

The library of integrated network-based cellular signatures (LINCS) program aims to systematically profile the molecular and phenotypical outcomes of agent perturbed human cells. The variety of agents includes chemical compounds, different micro-environments, endogenous ligands, single gene knockdowns and overexpressions. The LINCS L1000 dataset comprises of over a million gene expression profiles of chemically or genetically perturbed human cell-lines. However, maximally extracting knowledge from such large dataset for further analysis is challenging. We show that processing the L1000 data with the Characteristic Direction method significantly improves signature mappings through several benchmarking pipelines. This processed dataset is served through a state-of-the-art signature search engine called L1000CDS². To demonstrate the utility of L1000CDS² we collected expression signatures from human cells infected with Ebola virus at 30, 60 and 120 minutes after infection. Querying these signatures with L1000CDS² we identified kenpaullone, a GSK3B/CDK2 inhibitor that we show, in subsequent experiments, has a dose dependent efficacy in inhibiting Ebola infection in vitro without causing toxicity. L1000CDS² was also applied to prioritize small molecules that are predicted to reverse expression in 670 disease signatures extracted from the gene expression omnibus (GEO).

11:45 Integrated Information Management Systems: Promise and Potential

Timothy Hoctor, Vice President Professional Services, Life Sciences, R&D Solutions, Elsevier
Additional Speaker to be Announced
The focus on analysis of large databases, ‘big data’, continues to increase as the collections of scientific observations accumulate. Elsevier has collected tens of millions of facts from scientific literature in the form of semantic triples. In biology the facts are triples like “A upregulates/down regulates/causes B” where A and B can be compounds, diseases drugs or other entity types. The relationship is also qualified by species, tissues, and other variables. In chemistry the relations such as “compound C inhibits target A“ are also qualified by variables such as potency, assay type, species, and variant. The possible combinations of these facts increases exponentially with the number of facts combined. By combining these observations in biology and chemistry we can explore questions such as “based on the known targets drug A inhibits, what other diseases might it treat, based on disease pathways reported for all other diseases” and “given the pathways reported to cause a disease, and compounds known to inhibit those pathways what known compounds/structure scaffolds could be tested to treat the disease”. We will present examples of using data frameworks that combine Elsevier and open source pathway and biological activity databases to explore these questions with the broadest possible databases.

12:15 pm Session Break

12:25 Luncheon Presentation I to be Announced


12:55 Luncheon Presentation II (Sponsorship Opportunity Available)

1:25 Refreshment Break in the Exhibit Hall with Poster Viewing


2:00 Chairperson’s Remarks

Martin Gollery, CEO, Tahoe Informatics

2:10 NASFinder: Defining a Network Activity Score

Corrado Priami, Ph.D., Computer Science, The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI)

Network analysis is a well-recognized tool of modern biology and has proven to be a powerful aid in representing and mastering complexity. Sub-networks that are enriched with experimentally produced omics data can help explain properties of the underlying biological processes. We propose a novel method (NASFinder) to identify tissue-specific sub-networks connecting a omics-determined module and the main regulator of this module selected among the molecules with a specific role (e.g., receptors, transcription factors, etc.). Quantification of information flow on the network topology is used to associate nodes with an activity level in information transmission and ultimately to determine the sub-network activity score. Finally, the sub-network is functionally annotated to discover its main biological function. The new NASFinder method has been implemented into a web-based, freely available resource associated with novel, easy to read visualization of omics data sets and network modules. We illustrate an application of the method to transcriptional data sets comprising six time points (0, 6, 48, 96, 192, 384 hours) during the differentiation of SBGS pre-adipocyte cells in vitro. We present two different analysis strategies: time-point analysis by comparing each time point against the control (0h) and time-lapse analysis by comparing each time point with the previous one. NASFinder identified the coordinate production of seemingly unrelated processes between each comparison, providing the first systems view of adipogenesis in culture.

2:40 Genomics for Every Biologist NOW: Introducing the Pantheon of Next- Generation NCBI BLAST Resources

Ben Busby, Ph.D., Genomics Outreach Coordinator, NCBI, NIH

Users of NCBI resources come from varied backgrounds, both biologically and computationally. Therefore, NCBI has developed a range of BLAST tools to address these expanding use cases. SmartBLAST allows users to taxonomically define similar proteins with the click of a button. For those interested in genomics, SRA BLAST allows access to SRA with no knowledge of genomic mapping or command line interfaces -- although there is a command line interface for larger jobs, and we are developing a simple user interface for our BLAST based RNAseq mapper. Finally, for those interested in metagenomics, we have developed moleBLAST, a pushbutton tool that defines operational taxonomic units (OTUs).

3:10 Custom Visualizations to Support Scientific Decision Making

Christian Blumenroehr, Ph.D., Senior Scientist, Roche Innovation Center Basel, F. Hoffmann-La Roche

Data analysis is often a visual process. Especially in times of Big Data, how you visualize your data is very important to be able to draw the right conclusions. Learn new ideas on how to leverage modern HTML5-, JS-, and CSS-based visualizations in combination with a data analysis tool.

3:40 Sponsored Presentation (Opportunity Available)

4:10 St. Patrick’s Day Celebration in the Exhibit Hall with Poster Viewing

5:00 Breakout Discussions in the Exhibit Hall 

6:00 Close of Day

Day 1 | Day 2 | Day 3 | Download Brochure

Wednesday, March 9

7:00 am Breakfast Presentation (Sponsorship Opportunity Available) or Morning Coffee

8:00 Plenary Session Panel

10:00 Refreshment Break and Poster Competition Winner Announced in the Exhibit Hall


10:50 Chairperson’s Remarks

Bonnie Feldman, D.D.S., MBA, Digital Health Analyst and Chief Growth Officer, DrBonnie360

11:00 PANEL DISCUSSION: The Collaboratory at Work in Multiple Sclerosis and Beyond

Marcia Kean, Chairman, Strategic Initiatives, Feinstein Kean Healthcare

Kenneth Buetow, Ph.D., Director, Computational Sciences and Informatics, Complex Adaptive Systems Initiative (CASI), Arizona State University

Robert McBurney, Ph.D., CEO, Accelerated Cure Project for MS

PCORnet, the national research network, is catalyzing collaborations across academe, government, industry and advocacy organizations to change the research enterprise. iConquerMSTM, a Patient-Powered Research Network that recently was awarded Phase II funding, has collected patient-generated health data and a portfolio of emerging collaborations, allowing Big Data analysis by ASU’s Next Generation Cyber Capability of high performance hardware, software, and people. Resulting insights will change clinical practice and accelerate research. The audience will gain the learnings from the iConquerMSTM team, including technical and cultural challenges, patient data collection methods, research collaboration strategy, tools for Big Data integration/analysis and transformation into knowledge, and potential use of this initiative as a model.

12:00 pm Innovation from the Clinical Laboratory – The New Role of -Omics-Based Testing & Decision Support

Andreas Matern, Vice President, Commercial Partnerships and Innovation, BioReference Laboratories, Inc.

In this talk we’ll discuss how clinical laboratories are changing beyond the simple “send sample, give us results” paradigm to an information-driven ecosystem, working in close partnership with providers, pharmaceutical companies, and hospitals to leverage the knowledge they have accumulated to drive new medical discoveries and improve patient care and outcomes. The emphasis will be on bioinformatics, the combination of large data sets, and building systems that work across multiple end users and groups.

12:30 Session Break

12:40 Luncheon Presentation (Sponsorship Opportunity Available) or Enjoy Lunch on Your Own

1:10 Refreshment Break in the Exhibit Hall and Last Chance for Poster Viewing

1:50 Chairperson’s Remarks

Bonnie Feldman, D.D.S., MBA, Digital Health Analyst and Chief Growth Officer, DrBonnie360

2:00 PANEL DISCUSSION: Big Data and Unmet Clinical Needs: Two Problems Separated by a Common Language

Michael Liebman, Ph.D., Managing Director, IPQ Analytics, LLC

Additional Panelists to be Announced

This panel session explores themes of bio/informatics, business, operational, clinical and real world perspectives and how each area works collaboratively to meet unstated medical needs (not just unmet needs). We will explore different models (or business models that have been inverted) to not only show how technology can work or data collection can work but how to work with data to improve a diagnosis and stratify a disease.

3:00 Integrated Analytics of GBM Tumors from FMI and TCGA Patient Data

Eric Neumann, Ph.D., Vice President, Knowledge Informatics, Foundation Medicine, Inc.

The development of diagnostic and predictive analytics is key for effectively leveraging the potential of complete genomic profiles (CGP) to transform the healthcare model. We show that the classifications of genomic alterations can be applied to multiple tumor types as well as different data sets, such as ours and TCGA. This system can then be used to discover clinical relevant relations across sample sets and even predict outcomes.

3:30 Sponsored Presentation (Opportunity Available)

4:00 Session Break

4:10 Chairperson’s Remarks

Bonnie Feldman, D.D.S., MBA, Digital Health Analyst and Chief Growth Officer, DrBonnie360

4:15 Leveraging Mobile Devices to Integrate Patient Generated Health Data in the Electronic Health Record

Rajiv B. Kumar, M.D., Medical Director, Clinical Informatics, Stanford Children’s Health Clinical Assistant Professor of Pediatric Endocrinology & Diabetes, Stanford University Attending Physician at Stanford Children’s Health, California Pacific Medical Center and John Muir Medical Center

The electronic health record (EHR) is the home of patient variables, healthcare provider workflow/analytics, and the method to integrate these data and outcomes across institutional boundaries on our path to effective individualized care plans. Here we will discuss methods of populating said data in the EHR without requiring increased time or effort for patients or providers alike.

4:45 Digital Tools for the Microbiome – An Emerging Field in Genomics and Medicine

Bonnie Feldman, D.D.S., MBA, Digital Health Analyst and Chief Growth Officer, DrBonnie360

New research shows an association between changes in the microbiome in Lupus and Rheumatoid arthritis. With the convergence of large population data sets and personal data we are beginning to make progress in research, development and clinical trials in autoimmune disease. This talk will highlight new companies using data and digital tools to improve our understanding and treatment of autoimmunity.

5:15 A Full Stack Solution to Pharmacogenomics

Greyson Twist, Software Engineer, Center for Pediatric Genomic Medicine, Children’s Mercy Genome Center

Realizing the world personalized medicine requires integrating data from many disparate sources. To accomplish this we are developing a 3 tiered software solution. Astraea to handle locus specific knowledge management through expert curation, Constellation to handle locus allele identification from Next-gen data, and Astronomer to handle drug phenotype prediction. Each of these tools has unique problems, data source integration, standardization, and biologically driven heuristic choice.

5:45 Close of Conference Program

Day 1 | Day 2 | Day 3 | Download Brochure

Japan-Flag Korea-Flag China-Simplified-Flag China-Traditional-Flag  

Premier Sponsors:  

Bina Technologies




Jackson Laboratory - small logo 






 Precision for Medicine 



Silicon Biosystems


Thomson Reuters-Large