Today we will explore the application and inference driven from the path breaking data science approach from big data analytics in the field of Genomics. Research and data sciences driven approach in this field lays emphasis on sequencing technology which has resulted in the dramatic increase of sequencing data, which, in turn, requires efficient management of computational resources, such as computing time, memory requirements as well as prototyping of computational pipelines. I will be exploring these new techniques and solution oriented approaches in this blog entry.
Genomics broadly relates, among other things, to study of Genetic Material. It involves processes like sequencing, mapping, and analysis of RNA and DNA codes of a large variety of living beings. The recent years of rapid advancements in Life Science & Healthcare has seen a lot of focus on determining the entire DNA sequence of humans to understand the genetic features of the human body. The primary objective of this research is to find out the relation between genes and heredity which is going to be instrumental in a much effective and conclusive disease prevention and cure.
Just like any basic problem statement, let’s see how we can bring a definite structure to this problem utilizing a solution driven approach. The vast amount of data generated from genome sequencing, genome mapping, and analysis has made it critical for the scientific field to adopt Big Data. The data that is created due to Genomics is huge, with each human genome having 20,000-25,000 genes, each comprising of 3 million base pairs. Thus an average human genome data could amount to around 100 Gigabytes. Thus Genomics Study, in general would be dealing with petabytes of data with data addition increasing in an exponential fashion as we dwell more into it to find the answers of human health.
Big Data Analytics becomes critical in Genomics because of its ability to store, transform and analyze large amounts of genomic information which can unearth highly valuable medical insights for disease prevention and cure.
The Genome Wide Association Studies (GWAS) who is at the forefront to Genomics are using multiple Big Data & Analytics models to conduct research based on exploring the connections between genes and diseases. There have been more than 1600 genome studies which have established distinct connection between 2000 genes and more than 400 common human disease symptoms. Some of the applications that GWAS is engaging are:
Genomics covers number of biomedical processes, each requiring vast amount of data processing, analysis and Big Data storage and manipulation. These can be broadly categorized to the following, among other points:
5. Synteny: A process involving assessing two or more genomic regions to deduce if it comes from a single ancestral genomic region. This has similar system requirements to Genomic Comparisons basing on complex statistical correlation algorithms.
Just a few years ago, the cost of human genome sequencing was around $100 million. Today it may be less than $50 million. This downward trend is only going to continue in the coming years with stronger and wider adoption of Big Data. Both the research community and the Healthcare Industry have been working towards making genome sequencing affordable and accessible to the general public. Today, an individual human genome sequencing costs only around $5000.
In a traditional setup, with vast amount of data stored in databases, tests that use extract, transform, load (ETL) would take an incredibly long time with that much data. With Big Data solution like Hadoop, there is no ETL. Hence analysis of data is relatively very quick, which would save on a lot of time. Onset of newer and more widely ‘being adopted’ tools like Spark and Python based systems enable a steady handshake and easier time to market solutions to work on this Big Data with far greater results and possibilities.
Big Data Systems like Hadoop allows us options to perform very custom analysis that would not be possible in a traditional business intelligence tool, neither would it work for an SQL relational type setup.
The Big Data generation and acquisition create vast challenges for storage, transfer and security of information. It may now be less costly to create data than it is to store it. As an example, the NCBI which has been a frontier of Big Data efforts in biomedical science going all the way since 1988 hasn’t been able to come up with a comprehensive, safe & inexpensive solution to the data storage problem.
There is always a gap between implementing or envisioning Big Data overall when it comes to Data Sciences. The challenge is to demystify the complex Big Data problem into smaller more achievable data problems and solutions. The most critical adoption hurdle is the cost to benefit analysis, which works out well in your favor, as a researcher only when you can get to the basics of a business problem and stich it together with the intent to bring out a quantifiable RoI churn.
Big Data management involving large investment makes it beyond the reach of small laboratories or institutions. This poses a big challenge in conducting extensive and parallel biomedical research.
Another challenge is to transfer data from one place to another. At the current setup, it is mainly performed by external hard disks through postal service. An alternative to it could be the use of Biotorrents for data transfer, which will allow sharing of scientific data using a peer-to-peer file sharing technology. Torrents which were primarily developed to facilitate transfer of large amounts of data over the internet could be applied to biomedical research.
Security and privacy of data from individuals is also a concern. Possible solution of implementing advanced encryption algorithms like those implemented in banking in the financial sector to secure client data could be a valid workaround.
3AI is India’s largest platform for AI & Analytics leaders, professionals & aspirants and a confluence of leading and marquee AI & Analytics leaders, experts, influencers & practitioners on one platform.
3AI platform enables leaders to engage with students and working professionals with 1:1 mentorship for competency augmentation and career enhancement opportunities through guided learning, contextualized interventions, focused knowledge sessions & conclaves, internship & placement assistance in AI & Analytics sphere.
3AI works closely with several academic institutions, enterprises, learning academies, startups, industry consortia to accelerate the growth of AI & Analytics industry and provide comprehensive suite of engage, learn & scale engagements and interventions to our members. 3AI platform have 16000+ active members from students & working professionals community, 500+ AI & Analytics thought leaders & mentors and an active outreach & engagement with 430+ enterprises & 125+ academic institutions.