Bioinformatics and Computational Biology are two extremely related fields, and as such, many people in the scientific and academic worlds will refer to these two areas interchangeably. But is this truly accurate? Are these fields so close together, that we can lump them into one big category, or are there significant differences that need to be discussed and understood? In this article, we'll go over the commonalities and differences that these two fields face, and which path would be most suitable for you, depending on your research interests.
Difference Between Computational Biology and Bioinformatics: What's the Major Difference?
Like we mentioned above, to some people, Computational Biology and Bioinformatics hold no difference. To other, such as Dr. Russ Altman of Standford University, there is a very concrete difference between the two. He believes that Bioinformatics is where you create the tools, software, and algorithms that can be used to handle and work with large biological data systems. Likewise, in his mind, Computational Biology is all about learning and study biology, by using the computational tools and software made by Bioinformaticians.
So, according to Dr. Altman's definition, if you're somebody who primarily enjoys being on the creation sides of things, and wanting to add to the available tools and resources for people to analyze their biological data with, Bioinformatics would be the path for you. But, if you'd rather use existing computational tools to study and understand biology better, then you'd probably want to go towards Computational Biology.
Computational Biology vs Bioinformatics Academic Programs
Luckily for many of you who go the Computational Biology and Bioinformatics academic route, most graduate programs combine these two fields into a one-degree program, which allows you more flexibility in figuring out exactly which niche fits you best. But what happens if you're interested in programs that are either Bioinformatics or Computational Biology instead of both? Which one should you choose?
If this is your situation, you shouldn't make a decision based upon the name. Instead, look at the faculty members of each program, and see what kind of research they are doing. You might find that a program that is Bioinformatics-based has a lot of Computational Biology research, and vice versa. By researching faculty, and the type of research that is being conducted at prospective universities, you can get an idea of what kind of research you might be involved in if you choose to go to that specific school.
Which One Should You Pursue?
Choosing whether you want to become a Bioinformatician or a Computational Biologist comes down to figuring out whether you want to be at the forefront of creating computational software for biology or if you'd rather be using these tools to conduct your research. But just remember, things aren't as cut and dry as we've made it out to be, and you still may see job positions or academic programs use these two terms interchangeably. Instead of relying on the title to help decide where you wnat to go, figure out exactly what your research interests are and which program or position does the best job at utilizing them.
About the Author
Basil Khuder is the director and founder of YDSOA. He started YDSOA in 2015, hoping to create an online community for those new to the fields of Data Science and Informatics. When he's not running the organization, he's busy with his research and studies as a Doctoral Bioinformatics student at Iowa State University. You can follow Basil through any of his social media accounts.
Introduction to Transcriptomics
Transcriptomics is the study of all of the transcripts produced by a single cell, individual or population. It has gained much traction since the creation of RNA-Seq, a Next-Generation sequencing method that allows for high-throughput analysis of transcripts. But the question remains: how can we benefit as researchers from studying RNA and transcripts that we couldn't from looking at the DNA level?
There was a time when scientists believed that anything that didn't code for a protein was junk. This misbelief meant that we only cared about transcripts that were being translated into proteins. Over time, researchers began to realize that non-coding regions of the genome were not junk, and held significant and biologically functional roles. For example, we now know introns play vital roles in gene regulation, so if we disregard all of the non-protein coding regions, we are missing out on a lot of relevant information.
Because of this newfound belief, science has a seen a substantial increase in many researchers harnessing the powers of Next-Generation Sequencing, especially RNA-Seq.
Transcriptomic Software and Tools
So we already mentioned that RNA-Seq is one of the primary methods to finding out all of the RNA that a particular cell or tissue. But, you'll need some downstream pipeline or software tool ready to be able to process all the information produced by it. We've compiled a list of software that can be used when studying transcriptomics.
RNA-Seq by Expectation Maximization: RSEM is a software package that allows the users to find expression level information about transcripts, present within their genomic data. If you're using RNA-Seq data, there's a pipeline available that allows for simultaneous genomic alignment of your data, and expression information. Once the pipeline is run, RSEM will output how much transcriptional expression each transcript has, and gives you valuable visualization tools based on your data as well.
Trinity RNA-Seq: Trinity is a transcriptome assembly and annotation software package. It allows for de novo transcriptome assembly based on RNA-Seq data. Some of the downstream analysis that it provides include:
Quantifying the abundance of genes and transcripts
Checking the quality of samples and replicates
Conducting differential gene expression analysis.
VennBLAST: VennBLAST is a transcriptome tool that allows for transcriptome visualization comparison across samples. The researchers who created VennBLAST refer to it as a downstream transcriptome tool. Specifically, they state the following:
When dealing with Next-Generation Sequencing data for the first time, you might be a little confused when seeing all the different types of sequencing files that are out there. Although it may seem intimidating at first, a little bit of time around these files and you'll become a sequencing pro in no time!
FASTQ files are sometimes referred to as the raw sequencing reads. They are usually the format file that you receive from whatever company you have chosen to conduct the Next-Generation Sequencing of your data (or the machine itself, if you performed the sequencing.) The reason we refer to them as raw reads is because the file has all of the reads from your data, without any additional processes conducted on them. The other format files that we talk about later will have had something done to them, as to change the way we can process the data.
The image below shows an extremely simplified view of how the FASTQ file comes to be. For example, let's say you are interested in getting heart tissue sequenced for your research. You isolate the heart tissue sample and send it off to a company to get it sequenced. Due to how sequencing is currently conducted by the most popular companies, the file that you will end up getting will be chunks of your original DNA sequence in X amounts of base-pairs (anywhere between 75-200), with a quality score right below the nucleotides. The quality score will be a character that corresponds to a particular number. In our example, we have included the @ quality score, which has a value of 31.
Aligned Format Files: BAM and SAM
Raw sequencing files can give you an idea of the quality of the sequencing that was conducted and other general information about your data. But what if you wanted to find out how your heart tissue data was different than the tissue of other individuals? You would not be able to find this information out by just analyzing your raw FASTQ file. This is where genomic alignment comes into play. Genomic alignment is the process of taking your raw sequencing data and aligning it to a reference genome. (If you don't know what a reference genome, it's an assembled genome sequence that is representative of a particular species.)
The SAM file, which stands for sequence aligned mapping file, will have all the reads of your data, just like the FASTQ file had, but it will also have what the reference genome at that particular nucleotide is, right below it. So, going back to our example data, if we had aligned it to a reference genome, we may see something like this:
As you can see, all of our data matches the references, besides the bolded G. So what does this mean? It could be that at that position, our data has a single nucleotide polymorphism or it could be some sequencing error.
Variant Call Format Files: VCF We just mentioned, that comparing our data to a reference genome is useful in finding how our data is different than what the consensus genomic sequence is. At this stage, you could use a genomic viewer, such as the Integrative Genomic Viewer and manually analyze these differences. Or, you could run something called variant-calling, and produce a list of all of the variants that are present, in a file format called a Variant Call Format File, or VCF. A VCF file will tell you the exact position of the variant present, what the allele should have been in comparison to the consensus genome (reference allele), and what the allele currently is for your individual (alternative allele.) An example VCF file is shown below:
The first column of a VCF file is chromosomal location. Depending on what reference genome was used for alignment, you may get chromosome number listed similar to the image (with the chromosome abbreviation, chr, and the number of a chromosome), or you may only get the chromosome number. The second column has the actual location, within the specified chromosome. The third column in our example has a period, but VCF files typically will have a variant identification number, denoted as a SNP id, in this column, which means that this variant has been identified and is listed within various databases. The fourth column is the reference allele that we referred to above, while the fifth column is the alternative allele. The last two columns both contain tidbits of information that we will discuss in a later article. For now, just know that the sixth column refers to a variant quality score, while the seventh column refers to whether that variant passed or failed a statistical test to remove false-positives.