Jump to content

Blogs

Featured Entries

  • Basil

    Different Types of Sequencing Files

    By Basil

    When dealing with Next-Generation Sequencing data for the first time, you might be a little confused when seeing all the different types of sequencing files that are out there. Although it may seem intimidating at first, a little bit of time around these files and you'll become a sequencing pro in no time!  FASTQ Format FASTQ files are sometimes referred to as the raw sequencing reads. They are usually the format file that you receive from whatever company you have chosen to conduct the Next-Generation Sequencing of your data (or the machine itself, if you performed the sequencing.) The reason we refer to them as raw reads is because the file has all of the reads from your data, without any additional processes conducted on them. The other format files that we talk about later will have had something done to them, as to change the way we can process the data. The image below shows an extremely simplified view of how the FASTQ file comes to be. For example, let's say you are interested in getting heart tissue sequenced for your research. You isolate the heart tissue sample and send it off to a company to get it sequenced. Due to how sequencing is currently conducted by the most popular companies, the file that you will end up getting will be chunks of your original DNA sequence in X amounts of base-pairs (anywhere between 75-200), with a quality score right below the nucleotides. The quality score will be a character that corresponds to a particular number. In our example, we have included the @ quality score, which has a value of 31.         Aligned Format Files: BAM and SAM Raw sequencing files can give you an idea of the quality of the sequencing that was conducted and other general information about your data. But what if you wanted to find out how your heart tissue data was different than the tissue of other individuals? You would not be able to find this information out by just analyzing your raw FASTQ file. This is where genomic alignment comes into play. Genomic alignment is the process of taking your raw sequencing data and aligning it to a reference genome.  (If you don't know what a reference genome, it's an assembled genome sequence that is representative of a particular species.) The SAM file, which stands for sequence aligned mapping file, will have all the reads of your data, just like the FASTQ file had, but it will also have what the reference genome at that particular nucleotide is, right below it.  So, going back to our example data, if we had aligned it to a reference genome, we may see something like this:    As you can see, all of our data matches the references, besides the bolded G. So what does this mean? It could be that at that position, our data has a single nucleotide polymorphism or it could be some sequencing error.  Variant Call Format Files: VCF
    We just mentioned, that comparing our data to a reference genome is useful in finding how our data is different than what the consensus genomic sequence is. At this stage, you could use a genomic viewer, such as the Integrative Genomic Viewer and manually analyze these differences. Or, you could run something called variant-calling, and produce a list of all of the variants that are present, in a file format called a Variant Call Format File, or VCF. A VCF file will tell you the exact position of the variant present, what the allele should have been in comparison to the consensus genome (reference allele), and what the allele currently is for your individual (alternative allele.) An example VCF file is shown below:  The first column of a VCF file is chromosomal location. Depending on what reference genome was used for alignment, you may get chromosome number listed similar to the image (with the chromosome abbreviation, chr, and the number of a chromosome), or you may only get the chromosome number.  The second column has the actual location, within the specified chromosome. The third column in our example has a period, but VCF files typically will have a variant identification number, denoted as a SNP id, in this column, which means that this variant has been identified and is listed within various databases. The fourth column is the reference allele that we referred to above, while the fifth column is the alternative allele. The last two columns both contain tidbits of information that we will discuss in a later article. For now, just know that the sixth column refers to a variant quality score, while the seventh column refers to whether that variant passed or failed a statistical test to remove false-positives. 
    • 1 comment
    • 706 views
  • Basil

    Introduction to SQL

    By Basil

    When you think about handling and processing the huge amount of data, what comes to mind? For many, thoughts of utilizing Python coupled with machine learning algorithms arise. What may not initially come to thought is the notion of using SQL instead. You might be scratching your head at this prospect of using SQL in Data Science or Bioinformatics when there are other alternatives. Or, perhaps, you’re not familiar enough with this particular language to jump to a conclusion. The fact of the matter is that SQL is a programming language you should familiarize yourself with if you’re looking to jump into the world of big data. So why would somebody use SQL instead of the many alternatives? Simply put, SQL provides simplicity and robustness that you can seldom find anywhere else. Add to the equation that Data Science careers sometimes require more than handling big data. A big skill set that one can have is the ability to conduct database management on web applications; a feat for SQL and an RDMS (we’ll discuss this a little bit later.) Database Introduction (Introduction to SQL) For a thorough understanding of SQL and its potential role in Data Science, some basics are needed including an introduction to databases. First off, what exactly is a database? For simplicity, a database is just an organized collection of data. Within this collection, we have even more organization in the form of tables. Tables have specific bits of information stored inside them, and within these tables, there are individual columns that have even more specificity to them. All of this may seem a little confusing, so we’ll go ahead and see a table, called “table1,” inside a database to clear things up. (Word of caution:  The database table and associated column names were created with simplicity in mind. You’ll probably never run across a table titled “table1,” or a column titled “address,” especially when dealing with large databases.)  This sample table contains some information about a fabricated client base. In our actual database, there will be other tables that contain more relevant information, but let’s pretend this is all we need for the moment. As we have mentioned, each column contains a particular characteristic and here we can see the values of clientID, address, city and state. Every single row in our table contains specific data (in this case, a particular client), whereas the columns include universal values or traits. SQL is the language, while a Database Management System (DBMS) is the software that contains and manages the data. Something that a lot of people get confused with is when they hear things about MySQL, SQLite or NoSQL and don’t quite understand its relation to SQL. In our example, we showcased our database table inside a simple Excel file. In a real world example, your data will more than likely be stored in some other software dedicated to database management. This idea is what we refer to as Database Management Software, or DBMS for short. MySQL, SQLite, and NoSQL are all examples of DBMS. You should not worry about mastering DBMS’ until you get the hang of SQL itself. Most DBMS for SQL follow the same protocols, with some minor changes that you can learn later. A particular kind of DBMS named a Relational Database Management System (RDMS) and uses a specific type of modeling called a relational model. The RDMS, in particular, is called MySQL and is a popular database choice for websites. In fact, it’s what our site uses for database management. If you are lost with some of the technical jargon, just remember: SQL is the programming language, and a DBMS is the database system we will be using to manage our data.  Where Can I Practice My Code? Since we aren’t going to go in-depth with the DBMS, you’re probably wondering how you’ll be playing around with SQL code. There are some programs you can download to do so, or you can utilize this free online SQL interpreter. This website allows you to test out some basic SQL code without having to download more complicated DBMS’. The Almighty Select Statement In any SQL or SQL in Data Science course, the first statement you’ll learn is select. It’s quite simple as to why this is the first statement you hear.  In SQL, the objective is to alter and configure the database to suit your needs. To do this, you must select certain attributes and do something with them. However, just selecting something won’t do much. You’ll need to pick something, tell the system where you are selecting it from and then what you want to do with it. This concept of actually doing something with the data leads us to the from statement. Whenever you utilize select, you’ll almost always use from as well. Take a look at the following: Say, for example; you want to choose all of the client’s addresses from the table example from above.  You would utilize the following SQL statement (given that the table is named table1): Select address From table1; The results would yield you all of the addresses for every single row. You could use the same structure if you wanted to pull all of the ClientID’s or any of the other columns from the table. Don’t Forget The Semi-Colon The semi-colon in SQL denotes termination, so you’ll need to place it at the end of your SQL statement. The Where Clause Makes Things Happen While the select and from statements allow you to pick which specific data you want to handle, the where statement allows you to conduct the altering itself. There are numerous operators that the where clause can use to help you make things happen. Using the where clause for equality, non-equality, and showing greater than or less than.  (=, <>, >, <.) Using the where clause to join other clauses. Using the where clause for mathematical calculations. Using the where clause to group data. The list goes on and on for how you can incorporate where into SQL, but just remember that it allows you to make things happen within SQL. Taking the Basics and Applying it to Data Our introduction to SQL was overly simplified and was aimed at providing a very brief introduction to those who have never used SQL. If you’re eager to learn more, we recommend a free MOOC provided by Stanford Lagunita.  I’ve personally taken this course, and it’s an excellent introduction to SQL! We touched at the beginning of this article about some of the ways you might be using SQL in Data Science or Bioinformatics careers. Ultimately, it does depend on the particular job. You may never touch SQL once you are in your future career. Or, on the contrary, you may find that your position uses SQL extensively alongside other programming languages and software.  Due to the unexpected nature of whether or not you’re going to need SQL, it’s worth a shot to know it at some level. At the minimum, I would recommend you at least have a basic understanding of SQL and how to do simple database analysis and alterations. Again, you never know if this can come in handy. Some careers may also not care about the specific programming languages, as long as you can conduct the data analysis that they need. Whether you’re using Python, Perl or SQL doesn’t matter nearly as much as whether or not you can perform the tasks.  Final Thoughts If you have used any other SQL MOOC’s or have any useful materials for beginners to grasp this programming language, feel free to post the resources below! Be sure also to join us over at the YDSOA Community Forums as we discuss a wide variety of related topics!
    • 4 comments
    • 1,394 views
 

Scikit-ribo: Aminoacyl Site Prediction and Translation Efficient Estimation Software

A New open-source software has been developed by researchers from Cold Spring Harbor Laboratory and John Hopkins University, to address the need for more accurate measurement of protein translation.  Named Scikit-ribo, the tool enables accurate aminoacyl site prediction and estimates of translational efficiency from either Ribo-seq or RNASeq data. The software can be downloaded from the Scikit-bibo Github page, https://github.com/hanfang/scikit-ribo. Additionally, Scikit-ribo was published in the bioRxiv preprint server at the following link: http://www.biorxiv.org/content/early/2017/06/27/156588

Basil

Basil

 

PathwayMapper: A New Web Editor for Cancer Pathways

PathwayMapper Although there are many visualization tools available, that allow researchers to explore and analyze their cancer datasets, not many of them provide simplified diagrams. To answer the need for simplification, researchers from Bilkent University, Cornell University, and Oregon Health and Science University have a developed a new online tool. Named PathwayMapper, the web editor allows researchers to create more clear pathways and diagrams, similar to those found in The Cancer Genome Atlas. The online tool can be accessed via the PathwayMapper website (www.pathwaymapper.org) or can be downloaded from the PathwayMapper's Github Page.  Our Interaction with PathwayMapper Our brief interaction with the online web version of PathwayMapper showed us just how easy it was to develop sophisticated pathways while keeping things easy to follow and understand. Users can quickly select different node palettes such as the particular gene, complex, family, etc. Another category allows you to create the various interactions that are occurring between the complexes. Additionally, the software allows you to create customizations to the layout, that gives you the most control over the entire design of the pathway.   

Basil

Basil

 

OrthoReD: Bioinformatic Orthology Prediction Tool With Low Computational Demand

Although finding orthologous genes is a major step in phylogenetics, the fact that many of these orthology tools and software utilize extensive amounts of computational resources makes it a challenging issue for researchers. Recently, scientists at the University of California, and UC Davis' Department of Plant Sciences revealed an orthology prediction tool, which can help alleviate this computational matter. The tool is called OrthoReD and was recently published in BMC Bioinformatics by Kai Battenberg, Ernest Lee Dr. Joanna Chiu, Dr. Alison Berry, and Dr. Daniel Potter.  When OrthoReD was benchmarked against other currently published orthology prediction tools, OrthoReD was able to demonstrate similar biological results, while minimizing the number of computational powers that were needed. An image that describes the overview of how OrthoReD functions can be seen below:    More information on OrthoReD can be viewed in the BMC Bioinformatics journal article: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1726-5  

YDSOA

YDSOA

 

Difference Between Computational Biology and Bioinformatics

Bioinformatics and Computational Biology are two extremely related fields, and as such, many people in the scientific and academic worlds will refer to these two areas interchangeably. But is this truly accurate? Are these fields so close together, that we can lump them into one big category, or are there significant differences that need to be discussed and understood? In this article, we'll go over the commonalities and differences that these two fields face, and which path would be most suitable for you, depending on your research interests.  Difference Between Computational Biology and Bioinformatics: What's the Major Difference?  Like we mentioned above, to some people, Computational Biology and Bioinformatics hold no difference. To other, such as Dr. Russ Altman of Standford University, there is a very concrete difference between the two. He believes that Bioinformatics is where you create the tools, software, and algorithms that can be used to handle and work with large biological data systems. Likewise, in his mind, Computational Biology is all about learning and study biology, by using the computational tools and software made by Bioinformaticians.  So, according to Dr. Altman's definition, if you're somebody who primarily enjoys being on the creation sides of things, and wanting to add to the available tools and resources for people to analyze their biological data with, Bioinformatics would be the path for you. But, if you'd rather use existing computational tools to study and understand biology better, then you'd probably want to go towards Computational Biology.  Computational Biology vs Bioinformatics Academic Programs Luckily for many of you who go the Computational Biology and Bioinformatics academic route, most graduate programs combine these two fields into a one-degree program, which allows you more flexibility in figuring out exactly which niche fits you best. But what happens if you're interested in programs that are either Bioinformatics or Computational Biology instead of both? Which one should you choose?  If this is your situation, you shouldn't make a decision based upon the name. Instead, look at the faculty members of each program, and see what kind of research they are doing. You might find that a program that is Bioinformatics-based has a lot of Computational Biology research, and vice versa. By researching faculty, and the type of research that is being conducted at prospective universities, you can get an idea of what kind of research you might be involved in if you choose to go to that specific school.  Which One Should You Pursue?  Choosing whether you want to become a Bioinformatician or a Computational Biologist comes down to figuring out whether you want to be at the forefront of creating computational software for biology or if you'd rather be using these tools to conduct your research. But just remember, things aren't as cut and dry as we've made it out to be, and you still may see job positions or academic programs use these two terms interchangeably. Instead of relying on the title to help decide where you wnat to go, figure out exactly what your research interests are and which program or position does the best job at utilizing them.    About the Author Basil Khuder is the director and founder of YDSOA. He started YDSOA in 2015, hoping to create an online community for those new to the fields of Data Science and Informatics. When he's not running the organization, he's busy with his research and studies as a Doctoral Bioinformatics student at Iowa State University. You can follow Basil through any of his social media accounts.                                                                               

Basil

Basil

 

High-throughput Assay to Determine PCR Errors

Polymerase Chain Reaction is an extremely popular molecular biology technique. However, errors can be present throughout the PCR process, and it can be troublesome to find methods on how to detect and determine these errors. Recently, researchers at Pirogov Russian National Research Medical University in Moscow, Russia have demonstrated a method to use high-throughout assays that can determine errors within PCR. These researchers recently had their findings published in Nature Scientific Reports, in a paper titled A High-Throughput Assay for Quantitative Measurement of PCR Errors. In the paper, Shagin et al describe their five-step protocol for the high-throughout sequencing assay for quantification of errors in PCR. A schematic image of this protocol can be seen below:    The full protocol, alongside the researcher's articles, can be seen at the following link: https://www.nature.com/articles/s41598-017-02727-8  

YDSOA

YDSOA

 

Intervene: Software Package for Genomic Visualization 

A brand new Bioinformatic software has been released that aims at helping researchers with visualizations of multiple gene or genomic regions. The software is called Intervene and it was just published in the journal BMC Bioinformatics. The creators, Dr. Aziz Khan and Dr. Anthony Mathelier from the University of Oslo, state the aim of creating this software was to address the gap that is present within current visualization and intersection software.    Full information on how to use and install the software can be found at Intervene's documentation website.

YDSOA

YDSOA

 

Transcriptomics

Introduction to Transcriptomics  Transcriptomics is the study of all of the transcripts produced by a single cell, individual or population. It has gained much traction since the creation of RNA-Seq, a Next-Generation sequencing method that allows for high-throughput analysis of transcripts. But the question remains: how can we benefit as researchers from studying RNA and transcripts that we couldn't from looking at the DNA level?  Why Transcripts?  There was a time when scientists believed that anything that didn't code for a protein was junk. This misbelief meant that we only cared about transcripts that were being translated into proteins. Over time, researchers began to realize that non-coding regions of the genome were not junk, and held significant and biologically functional roles. For example, we now know introns play vital roles in gene regulation, so if we disregard all of the non-protein coding regions, we are missing out on a lot of relevant information. Because of this newfound belief, science has a seen a substantial increase in many researchers harnessing the powers of Next-Generation Sequencing, especially RNA-Seq. Transcriptomic Software and Tools So we already mentioned that RNA-Seq is one of the primary methods to finding out all of the RNA that a particular cell or tissue. But, you'll need some downstream pipeline or software tool ready to be able to process all the information produced by it. We've compiled a list of software that can be used when studying transcriptomics.  RNA-Seq by Expectation Maximization:  RSEM is a software package that allows the users to find expression level information about transcripts, present within their genomic data. If you're using RNA-Seq data, there's a pipeline available that allows for simultaneous genomic alignment of your data, and expression information. Once the pipeline is run, RSEM will output how much transcriptional expression each transcript has, and gives you valuable visualization tools based on your data as well.  Trinity RNA-Seq: Trinity is a transcriptome assembly and annotation software package. It allows for de novo transcriptome assembly based on RNA-Seq data. Some of the downstream analysis that it provides include:  Quantifying the abundance of genes and transcripts  Checking the quality of samples and replicates Conducting differential gene expression analysis.  VennBLAST: VennBLAST is a transcriptome tool that allows for transcriptome visualization comparison across samples. The researchers who created VennBLAST refer to it as a downstream transcriptome tool. Specifically, they state the following:   

Basil

Basil

 

Intricacies in Arrangement of SNP Haplotypes Suggest “Great Admixture” That Created Modern Humans

In the latest online publication of BMC Genomics, researchers at The University of Toledo demonstrate their Bioinformatic approach at deciphering human relatedness and ancestry. Led by Dr. Alexei Fedorov and his Doctoral student, Rajib Dutta, their research article, titled Intricacies in Arrangement of SNP Haplotypes Suggest “Great Admixture” That Created Modern Humans, demonstrated their approach. Using the haplotypes built from common SNPS, and computer simulation, they postulate that a "Great Admixture" event occurred that created modern-day humans. They believe that this mixture occurred somewhere between 100 and 300 thousands years ago between two ancestral populations.  Be sure to read their full article:  https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3776-5  

YDSOA

YDSOA

 

Introduction to SQL

When you think about handling and processing the huge amount of data, what comes to mind? For many, thoughts of utilizing Python coupled with machine learning algorithms arise. What may not initially come to thought is the notion of using SQL instead. You might be scratching your head at this prospect of using SQL in Data Science or Bioinformatics when there are other alternatives. Or, perhaps, you’re not familiar enough with this particular language to jump to a conclusion. The fact of the matter is that SQL is a programming language you should familiarize yourself with if you’re looking to jump into the world of big data. So why would somebody use SQL instead of the many alternatives? Simply put, SQL provides simplicity and robustness that you can seldom find anywhere else. Add to the equation that Data Science careers sometimes require more than handling big data. A big skill set that one can have is the ability to conduct database management on web applications; a feat for SQL and an RDMS (we’ll discuss this a little bit later.) Database Introduction (Introduction to SQL) For a thorough understanding of SQL and its potential role in Data Science, some basics are needed including an introduction to databases. First off, what exactly is a database? For simplicity, a database is just an organized collection of data. Within this collection, we have even more organization in the form of tables. Tables have specific bits of information stored inside them, and within these tables, there are individual columns that have even more specificity to them. All of this may seem a little confusing, so we’ll go ahead and see a table, called “table1,” inside a database to clear things up. (Word of caution:  The database table and associated column names were created with simplicity in mind. You’ll probably never run across a table titled “table1,” or a column titled “address,” especially when dealing with large databases.)  This sample table contains some information about a fabricated client base. In our actual database, there will be other tables that contain more relevant information, but let’s pretend this is all we need for the moment. As we have mentioned, each column contains a particular characteristic and here we can see the values of clientID, address, city and state. Every single row in our table contains specific data (in this case, a particular client), whereas the columns include universal values or traits. SQL is the language, while a Database Management System (DBMS) is the software that contains and manages the data. Something that a lot of people get confused with is when they hear things about MySQL, SQLite or NoSQL and don’t quite understand its relation to SQL. In our example, we showcased our database table inside a simple Excel file. In a real world example, your data will more than likely be stored in some other software dedicated to database management. This idea is what we refer to as Database Management Software, or DBMS for short. MySQL, SQLite, and NoSQL are all examples of DBMS. You should not worry about mastering DBMS’ until you get the hang of SQL itself. Most DBMS for SQL follow the same protocols, with some minor changes that you can learn later. A particular kind of DBMS named a Relational Database Management System (RDMS) and uses a specific type of modeling called a relational model. The RDMS, in particular, is called MySQL and is a popular database choice for websites. In fact, it’s what our site uses for database management. If you are lost with some of the technical jargon, just remember: SQL is the programming language, and a DBMS is the database system we will be using to manage our data.  Where Can I Practice My Code? Since we aren’t going to go in-depth with the DBMS, you’re probably wondering how you’ll be playing around with SQL code. There are some programs you can download to do so, or you can utilize this free online SQL interpreter. This website allows you to test out some basic SQL code without having to download more complicated DBMS’. The Almighty Select Statement In any SQL or SQL in Data Science course, the first statement you’ll learn is select. It’s quite simple as to why this is the first statement you hear.  In SQL, the objective is to alter and configure the database to suit your needs. To do this, you must select certain attributes and do something with them. However, just selecting something won’t do much. You’ll need to pick something, tell the system where you are selecting it from and then what you want to do with it. This concept of actually doing something with the data leads us to the from statement. Whenever you utilize select, you’ll almost always use from as well. Take a look at the following: Say, for example; you want to choose all of the client’s addresses from the table example from above.  You would utilize the following SQL statement (given that the table is named table1): Select address From table1; The results would yield you all of the addresses for every single row. You could use the same structure if you wanted to pull all of the ClientID’s or any of the other columns from the table. Don’t Forget The Semi-Colon The semi-colon in SQL denotes termination, so you’ll need to place it at the end of your SQL statement. The Where Clause Makes Things Happen While the select and from statements allow you to pick which specific data you want to handle, the where statement allows you to conduct the altering itself. There are numerous operators that the where clause can use to help you make things happen. Using the where clause for equality, non-equality, and showing greater than or less than.  (=, <>, >, <.) Using the where clause to join other clauses. Using the where clause for mathematical calculations. Using the where clause to group data. The list goes on and on for how you can incorporate where into SQL, but just remember that it allows you to make things happen within SQL. Taking the Basics and Applying it to Data Our introduction to SQL was overly simplified and was aimed at providing a very brief introduction to those who have never used SQL. If you’re eager to learn more, we recommend a free MOOC provided by Stanford Lagunita.  I’ve personally taken this course, and it’s an excellent introduction to SQL! We touched at the beginning of this article about some of the ways you might be using SQL in Data Science or Bioinformatics careers. Ultimately, it does depend on the particular job. You may never touch SQL once you are in your future career. Or, on the contrary, you may find that your position uses SQL extensively alongside other programming languages and software.  Due to the unexpected nature of whether or not you’re going to need SQL, it’s worth a shot to know it at some level. At the minimum, I would recommend you at least have a basic understanding of SQL and how to do simple database analysis and alterations. Again, you never know if this can come in handy. Some careers may also not care about the specific programming languages, as long as you can conduct the data analysis that they need. Whether you’re using Python, Perl or SQL doesn’t matter nearly as much as whether or not you can perform the tasks.  Final Thoughts If you have used any other SQL MOOC’s or have any useful materials for beginners to grasp this programming language, feel free to post the resources below! Be sure also to join us over at the YDSOA Community Forums as we discuss a wide variety of related topics!

Basil

Basil

 

Is Data Science a Good Career?

The buzz around Data Science continues to grow astronomically. It’s almost monthly that you’ll see an article on Forbes or Indeed discussing how great of a career Data Science is. But just because these websites claim that this is a good job doesn’t mean it’s the best career for you.  A lot of factors come into place in deciding whether you should pursue a career in Data Science. Gone are the days when tech jobs were only available at Google and Facebook. Today, almost all industries need to hire tech employees. Companies are drowning in all the tsunamic wave of data which is an invaluable asset for drafting business strategies. As a result, the companies need to hire Data Scientists to be able to manage it, analyze it, and use it to identify, predict and solve problems. With the demand for data-savvy professionals increasing at a faster rate, The McKinsey & Company has projected a global excess demand for 1.5 million new data scientists. By 2018, a projected talent gap of 140,000 to 190,000 qualified data science workers is predicted. According to Glassdoor’s list of best jobs for best Work-Life Balance, the data scientist is the best job in America for 2016. One can expect a median base salary of $116,840, with plenty of job openings available. But what does a work day of a data scientist look like? Are they just confined to an office crunching numbers for the rest of their working life? Not exactly. Data Scientists are constantly trying to predict the future by using numbers. They are working with clients’ or employers problems and replicating models to solve them. What you offer as a data scientist is a comprehensive analysis of the customer’s whole business. This versatility means constant movement and frequent discussions with employees at all levels of your company. So sure, you’ll have a desk with a fancy computer to get the job done, but don’t think of Data Scientists having your stereotypical 9-5 desk job. The Day-to-Day Activities Although we just gave you a pretty decent primer on the buzz around Data Science, we haven’t quite answered the topic of whether or not Data Science is a good career. From the perspective of an outsider, Data Science screams loads of mathematics and science. However, if they would take a look at the job sites, they might be shocked at first to find skill qualities such as ‘works well with others,’ ‘knows how to report and communicate’ as part of the job description. Since the roles of Data Scientists mean working across the board with employees of all levels, it’s crucial that you be able to communicate properly. You might be the only Data Scientists in a company, and many of the people you work with would have no relation to statistics or mathematics for years.   Communication is one of the most underrated skills for a Data Science. If you know you're not somebody who enjoys communicating sophisticated
and intricate information to the masses, Data Science might not be the best career choice for you.    In other situations, Data Science might permeate into individual units. The chances are that you will be working in the marketing department, the product design department and even the sales department. You can expect to solve real life problems by providing practical solutions. One should also be forward-thinking as you will be using a large amount of data to solve real time problems as they are happening. So why are we telling you all of this? One must realize that choosing whether or not Data Science is an excellent career choice goes further than just knowing the science behind it. You must understand all the skills necessary and the day-to-day activities that it encompasses. If you know all the programming and statistics, but can’t properly communicate with others; this might not be the field for you. The Programming So we’ve gone over the outlook and a brief synopsis of the day-to-day activities for a Data Science. In other articles on YDSOA, we’ve touched on some of the programming and sciences that are needed for a successful Data Science career, including our Machine Learning and SQL articles.  However, there are some more steps you can take to become familiar with the traditional software you’ll be needing to use for jobs in this field. -R: Let’s start with R.  R is one of the best places to start for those looking to get into Data Science for the fact it has a very active community, and the software itself is free to use. R is traditionally used for statistical analysis, but can also be used for data mining and visualization.  We’ll be rolling out an introductory post into the workings of R programming, but for now, this a great online course to get your feet wet. -Python: The second language that’s good to have some mastering in is Python. Python is currently one of the most popular programming languages in the world and for good reason.  It’s simplicity, and the overwhelming amount of resources create a user-friendly environment. -Perl: Perl was originally built in 1987 by a computer programmer named Larry Walt with the purpose of being able to process and handle massive amounts of text. It showed the most popularity in the 1990’s and although it doesn’t have the following it once had, it still remains a force that has stood the test of time in the world of programming. In addition to its powerful text processing tools such as Regular Expression and other useful abilities, it has several useful add-ons in its repertoire. Besides the big three, I have listed some other languages and tools that would be helpful add-ons. -Scala: The hottest language right now, ideal for working with real-time data. We’ll be touching more on Scala in a later article. -SQL: SQL remains a powerful and easy-to-use programming language, mostly used in database management. Our full SQL introduction can be viewed here.  -Excel: Seeing Excel on this list may come as a bit of surprise,  but Excel remains one of the most useful pieces of software a Data Scientist can know. Its incorporation with VBA allows the user to conduct some extremely sophisticated analysis.  Is Data Science a Good Career? Being that YDSOA focuses primarily on Data Science and Bioinformatics, you could say that we might be a little bias in our overall consensus on whether Data Science is a good career or not. However, we do believe we've presented some strong evidence on how great the opportunities are in the world of Data Science, and what an interesting a career it truly can be. With that being said, you must understand all aspects that go into the job. Sure, knowing the programming and science behind this career is crucial, and you won’t get farther than a job interview without it. However, don’t underestimate the personal and communication side of things. Realize that you’ll be working with people from a broad spectrum and knowing how to communicate with them properly will be crucial.

Rajib Dutta

Rajib Dutta

Bioinformatics vs Data Science

The worlds of Bioinformatics and Data Science share a lot of commonalities. Although one focuses more on biological sciences than the other (Bioinformatics), they still use a lot of the same programming languages, software, and general principles. In this article, we go over exactly the differences and similarities between Bioinformatics vs Data Science and show you which path is right for you! What is Bioinformatics? What is Data Science? In a broad sense, Bioinformatics is the field involving the use of tools, software, and programming languages to understand and interpret biological data. Data Science is the field involving the use of similar tools and programs, but to understand data in general. In terms of programming languages, some examples of what Data Scientists and Bioinformaticians use could include Python, PERL, or Java. For software and tools, some examples are R, SAS, Pandas, Apache spark, and Tableau.   A generalized image to give an overview of Data Science vs Bioinformatics   Two Fields, One Common Goal Although Bioinformatics and Data Science have many differences, there’s still somewhat of a same underlying goal; using algorithms, tools, and programs to understand and process data. Now if you are a Bioinformatician, that might mean using instruments to help you understand biological data, whereas a Data Scientist may be using similar tools to understand business or marketing data. Does this mean only a Bioinformatician can analyze biological data? No! Both Data Scientists and Bioinformaticians can handle all types of data, but Bioinformaticians have more of a focus on biology than Data Scientists do.  Which Should You Major or Focus In? Up until the last couple of years, there was no such thing as a Data Science degree or major. That has changed with the popularity of the field growing at astronomical levels. The answer to whether or not you should major in Bioinformatics, Computational Biology or Data Science lies on what type of career you’d like to pursue. If you want to focus more on the biological science side of things, pursue Bioinformatics or Computational Biology, which gives you a firm grasp on the sciences needed to handle large biological data. If you want to focus purely on managing data for all disciplinaries, and have no interest in broadening your biological skill-set, err on the side of a Data Science degree. Once again, we're not saying that Bioinformatics or Computational Data majors/degrees do not give you ample knowledge or handling all types of data. However, a lot of your time in these programs is spent going over biological and chemical systems, so you need to have a passion in these fields, or else you’ll not be enjoying yourself. Just to elaborate on this point, my first year as a Bioinformatics Masters student included challenging courses on human genetics, molecular and cellular biology. and biological research methods. Someone without any passion in these subjects would have had a torturous time!  At the End of the Day, It's Not Your Degree; It's Your Skills The biggest takeaway message we have is that it ultimately doesn’t matter what degree you chose, but the skill sets you gain from these majors. Are there jobs that have the requirement of a particular type of degree? Absolutely. These jobs are typically the exception. Instead, most jobs want a set of skills, which anybody can develop regardless if your major is in Data Science or Bioinformatics. Learn as much as you can and hone your skills, and you’ll find that you can make it in all sorts of data-oriented jobs.     About the Author Basil Khuder is the director and founder of YDSOA. He started YDSOA in 2015, hoping to create an online community for those new to the fields of Data Science and Informatics. When he's not running the organization, he's busy with his research and studies as a Doctoral Bioinformatics student at Iowa State University. You can follow Basil through any of his social media accounts.                                                                                           

Basil

Basil

 

Different Types of Sequencing Files

When dealing with Next-Generation Sequencing data for the first time, you might be a little confused when seeing all the different types of sequencing files that are out there. Although it may seem intimidating at first, a little bit of time around these files and you'll become a sequencing pro in no time!  FASTQ Format FASTQ files are sometimes referred to as the raw sequencing reads. They are usually the format file that you receive from whatever company you have chosen to conduct the Next-Generation Sequencing of your data (or the machine itself, if you performed the sequencing.) The reason we refer to them as raw reads is because the file has all of the reads from your data, without any additional processes conducted on them. The other format files that we talk about later will have had something done to them, as to change the way we can process the data. The image below shows an extremely simplified view of how the FASTQ file comes to be. For example, let's say you are interested in getting heart tissue sequenced for your research. You isolate the heart tissue sample and send it off to a company to get it sequenced. Due to how sequencing is currently conducted by the most popular companies, the file that you will end up getting will be chunks of your original DNA sequence in X amounts of base-pairs (anywhere between 75-200), with a quality score right below the nucleotides. The quality score will be a character that corresponds to a particular number. In our example, we have included the @ quality score, which has a value of 31.         Aligned Format Files: BAM and SAM Raw sequencing files can give you an idea of the quality of the sequencing that was conducted and other general information about your data. But what if you wanted to find out how your heart tissue data was different than the tissue of other individuals? You would not be able to find this information out by just analyzing your raw FASTQ file. This is where genomic alignment comes into play. Genomic alignment is the process of taking your raw sequencing data and aligning it to a reference genome.  (If you don't know what a reference genome, it's an assembled genome sequence that is representative of a particular species.) The SAM file, which stands for sequence aligned mapping file, will have all the reads of your data, just like the FASTQ file had, but it will also have what the reference genome at that particular nucleotide is, right below it.  So, going back to our example data, if we had aligned it to a reference genome, we may see something like this:    As you can see, all of our data matches the references, besides the bolded G. So what does this mean? It could be that at that position, our data has a single nucleotide polymorphism or it could be some sequencing error.  Variant Call Format Files: VCF
We just mentioned, that comparing our data to a reference genome is useful in finding how our data is different than what the consensus genomic sequence is. At this stage, you could use a genomic viewer, such as the Integrative Genomic Viewer and manually analyze these differences. Or, you could run something called variant-calling, and produce a list of all of the variants that are present, in a file format called a Variant Call Format File, or VCF. A VCF file will tell you the exact position of the variant present, what the allele should have been in comparison to the consensus genome (reference allele), and what the allele currently is for your individual (alternative allele.) An example VCF file is shown below:  The first column of a VCF file is chromosomal location. Depending on what reference genome was used for alignment, you may get chromosome number listed similar to the image (with the chromosome abbreviation, chr, and the number of a chromosome), or you may only get the chromosome number.  The second column has the actual location, within the specified chromosome. The third column in our example has a period, but VCF files typically will have a variant identification number, denoted as a SNP id, in this column, which means that this variant has been identified and is listed within various databases. The fourth column is the reference allele that we referred to above, while the fifth column is the alternative allele. The last two columns both contain tidbits of information that we will discuss in a later article. For now, just know that the sixth column refers to a variant quality score, while the seventh column refers to whether that variant passed or failed a statistical test to remove false-positives. 

Basil

Basil

×