Navigating the Bioinformatics Workflow for Whole Exome Sequencing: A Step-by-Step Guide
SummaryNext-generation sequencing (NGS), which makes millions to billions of sequence reads at a fast rate, has greatly sped up genomics research. At the moment, Illumina, Ion Torrent/Life Technologies, 454/Roche, Pacific Bioscience, Nanopore, and GenapSys are all NGS platforms that can be used. They can produce reads of 100–10,000 bp in length, enabling sufficient coverage of the genome at a lower cost. But faced with the enormous amount of sequence data, how do we best deal with it? And what are the most appropriate computational methods and analysis tools for this purpose? In this review, we focus on the bioinformatics pipeline of whole exome sequencing (WES).
- Author Name: Dianna Gellar
Next-generation sequencing (NGS), which makes millions to billions of sequence reads at a fast rate, has greatly sped up genomics research. At the moment, Illumina, Ion Torrent/Life Technologies, 454/Roche, Pacific Bioscience, Nanopore, and GenapSys are all NGS platforms that can be used. They can produce reads of 100–10,000 bp in length, enabling sufficient coverage of the genome at a lower cost. But faced with the enormous amount of sequence data, how do we best deal with it? And what are the most appropriate computational methods and analysis tools for this purpose? In this review, we focus on the bioinformatics pipeline of whole exome sequencing (WES).
Whole exome sequencing is a genomic technique for sequencing the exome (all protein-coding genes). It is widely used in basic and applied research, especially in the study of Mendelian diseases. You can read the article on the principle and workflow of whole exome sequencing to know more about WES. A typical workflow of WES analysis includes these steps: raw data quality control, preprocessing, sequence alignment, post-alignment processing, variant calling, variant annotation, and variant filtration and prioritization.
Raw data quality control
Sequence data generally has two common standard formats: FASTQ and FASTA. FASTQ files can store Phred-scaled base quality scores to better measure sequence quality. It is, therefore, widely accepted as the standard format for NGS raw data. There are multiple tools developed to assess the quality of NGS raw data, such as FastQC, FastQ Screen, FASTX-Toolkit, and the NGS QC Toolkit.
Read the QC parameters:
Base quality score distribution
Sequence quality score distribution
GC content distribution
Sequence duplication level
PCR amplification issue
Biasing of k-mers
With a comprehensive QC report (which generally involves the above parameters), researchers can determine whether data preprocessing is necessary. Preprocessing steps generally involve 3’ end adapter removal, low-quality or redundant read filtering, and undesired sequence trimming. Several tools can be used for data preprocessing, such as Cutadapt and Trimmomatic. PRINSEQ and QC3 can achieve both quality control and preprocessing.
There are algorithms for shot-read mapping, including the Burrows-Wheeler Transformation (BWT) and Smith-Waterman (SW) algorithms. Bowtie2 and BWA are two popular short read alignment tools that implement the Burrows-Wheeler Transformation (BWT) algorithm. MOSAIK, SHRiMP2, and Novoalign are important short-read alignment tools that are implementations of the SW algorithm with increased accuracy. Additionally, multithreading and MPI implementations allow a significant reduction in the runtime. Of all the tools mentioned above, Bowtie2 is outstanding for its fast running time, high sensitivity, and high accuracy.
After read mapping, the aligned reads are post-processed so as to remove undesired reads or alignment, such as reads exceeding a defined size and PCR duplicates. Tools such as Picard MarkDuplicates and SAMtools can distinguish PCR duplicates from true DNA materials. Subsequently, the second step is to improve the quality of gapped alignment via indel realignment. Some aligners (such as Novoalign) and variant callers (such as GATK HaplotypeCaller) involve indel alignment improvement. After indel realignment, BQSR (BaseRecalibrator from the GATK suite) is recommended to improve the accuracy of base quality scores prior to variant calling.
Variant analysis is important to detect different types of genomic variants, such as SNPs, SNVs, indels, CNVs, and larger SVs, especially in cancer studies. It is vital to distinguish somatic from germline variants. Somatic variants present only in somatic cells and are tissue-specific, while germline variants are inherited mutations presented in the germ cells and are linked with the patient’s family history. Variant calling is used to identify SNPs and short indels in exome samples. The common variant calling tools are listed in Table 1. Some studies have evaluated these variant callers. Liu et al. recommended GATK, and Bao et al. recommended a combination of Novoalign and FreeBayes.
After variants are identified, they need to be annotated for a better understanding of disease pathogenesis. Variant annotation generally involves information about genomic coordinates, gene position, and mutation type. Many studies focus on the non-synonymous SNVs and indels in the exome, which account for 85% of known disease-causing mutations in Mendelian disorders and a great deal of mutations in complex diseases.
There are many databases that can give you more information about the variants than just the basic annotation. ANNOVAR is a powerful tool that uses data from over 4,000 public databases, like dbSNP, 1000 Genomes, and NCI-60 human tumor cell line panel exome sequencing data, to annotate variants. This tool can be used for minor allele frequency (MAF) prediction, deleterious prediction, indication of conservation of the mutated site, experimental evidence for disease variants, and prediction scores from GERP, PolyPhen, and other programs. Other common databases include OncoMD, OMIM, SNPedia, 1000 Genomes, bdSNP, and personal genome variants.
Variant filtration and prioritization
WES can generate thousands of variant candidates. The number can be reduced by variant prioritization to generate a short but prioritized candidate mutation list for further experimental validation. Variant prioritization involves three steps: 1) the removal of less reliable variant calls; 2) the depletion of common variants (due to the assumption that rare variants are more likely to cause disease); and 3) the prioritization of variants relative to the disease using discovery-based and hypothesis-based approaches. The available tools for variant filtration and prioritization include VAAST2, VarSifer, KGGseq, PLINK/SEQ, SPRING, the GUI tool Gnome, and Ingenuity Variant Analysis.
Whole exome sequencing could become a standard method for treating diseases in the next few years. And a lot of medical centers have already done genetic testing using NGS technologies like WES. The next challenge will be data management with millions of genomic variants, and the integration of genomic variants, clinical records, and patient information. CD Genomics offers a full range of whole exome sequencing services, including sample preparation, exome capture, library construction, high-throughput sequencing, raw data quality control, and bioinformatics analysis.