SeqHBase: a big data toolset for family-based sequencing data analysis

High-throughput sequencing technologies are now increasingly used to find disease genes, but it is difficult to infer biological insights from massive amounts of data in a short period of time. We developed a software framework called SeqHBase to help quickly identify disease genes. SeqHBase was developed based on Apache Hadoop and HBase infrastructure, which works through distributed and parallel manner over multiple data nodes. Its input includes coverage information of 3 billion sites, over 3 million variants and their associated functional annotations for each genome. With 20 data nodes, SeqHBase took about 5 seconds for analyzing whole-exome sequencing data for a family quartet and approximately 1 minute for analyzing whole-genome sequencing data for a 10-member family. We demonstrated SeqHBase’s high efficiency and scalability with several real sequencing data sets. (By Min He, Ph.D., http://jmg.bmj.com/content/early/2015/01/13/jmedgenet-2014-102907 )