Contact Contact Us
Contact Us

frequently asked questions


Ideas and breakthroughs fueled by computational prediction and next generation sequencing.

Our group has been contributing to different research projects by providing support and expertise in programming and advanced data analysis, focusing primarily on high-throughput genomics technologies. These include microarrays, genotyping, and next-generation sequencing (RNA-seq, ChIP-seq, SNP-seq, etc.). We also provide virtual server environments, secure public web portals, a large suite of open source applications, and hands-on tutorials and workshops on a wide variety of informatics topics, as well as custom data analysis and consultation services. We commit to the establishment of research collaborations with scientists from different departments.

get a quotation Get In Touch

Frequently Asked Questions


How much sequencing depth (number of reads) needed for a Differential Gene Expression (DEG) analysis?

  1. a.The amount of sequencing needed for a given sample is determined by the goals of the experiment and the nature of the RNA sample.
  2. b.Experiments whose purpose is to evaluate the similarity between the transcriptional profiles of two polyA+ samples may require only modest depths of sequencing (e.g. 30M pair-end reads of length > 30NT, of which 20-25M are 3 mappable to the genome or known transcriptome.
  3. c.Experiments whose purpose is the discovery of novel transcribed elements and strong quantification of known transcript isoforms requires more extensive sequencing. The ability to detect reliably low copy number transcripts / isoforms depends upon the depth of sequencing and on a sufficiently complex library.
  4. d.For experiments from a typical mammalian tissue or in which sensitivity of detection is important, a minimum depth of 100-200 M 2 x 76 bp or longer reads is currently recommended.
  5. e.Specialized studies in which the prevalence of different RNAs has been intentionally altered (e.g. “normalizing” using DSN) as part of sample preparation need more than the read amounts (>30M paired-end reads) used for simple comparison. Reasons for this include: (1) overamplification of inserts as a result of an additional round of PCR after DSN, and (2) much more broad coverage given the nature of A(-) and low abundance transcripts.

Paired-end or Single-end reads?

  1. a.Single-end is fine for simple RNA-Seq analysis.
  2. b.Paired-end would be required if we are looking for novel transcripts. Paired-end RNA sequencing (RNA-Seq) enables discovery applications such as detecting gene fusions in cancer and characterizing novel splice isoforms. For paired-end RNA-Seq, use the TruSeq RNA Library Prep Kits with an alternate fragmentation method, followed by standard Illumina paired-end cluster generation and sequencing.

In terms of coverage / amount of data generated, how does one lane of NextSeq compare with one lane of HiSeq?

The HiSeq maximum yield is about 200 mil reads. Roughly, one could figure 160 to 180 mil reads to be a safe count. This is enough for a 4 to 6 plex samples at 30M reads per RNA sample. The NextSeq has a max of about 400M reads. Figure 360M to be safe. This is enough for a 10 to 12 plex at 30M reads per RNA sample. Yes, the NextSeq do twice the plex level as the HiSeq. Plus it does a 75bp read as opposed to a 50bp read. So it is cheaper per base.

Length of reads?

50bp is sufficient for simple RNA-Seq analysis

How many biological replicates are needed?

At least 3 (more if you can afford)

Do we need to include spike-in controls for RNA-seq normalization?

There is no standard whether to use spike-in control or not, I guess it varies case to case. During sequencing there are so many unknown factors one can’t control, so including a spike-in would be better as this would allow to evaluate performance of library preparation and sequencing, and thus less unbiased call for differential gene expression in later stages. However, there have been studies that says the spike-in controlled data may be very biased to various degree unless the experiments are designed in a careful way to preserve and capture true biological variations represented in your sample. As the spike-in datasets represent only technical replicates with minimal variation, this could be complemented with more practical comparisons in real datasets with true biological replicates. For example, this study ( reports that spike-ins are not reliable enough to be used in standard global-scaling or regression-based normalization procedures. One of the advantages of using spike controls for normalization is the possibility of relaxing the common assumption that majority of the genes are not DE between the conditions under study. So if you want to use the spike-ins for normalization, two conditions must be satisfied: (i) spike-in read counts are not affected by the biological factors of interest, and (ii) the unwanted variation should affect the spike-in and gene read counts similarly.

One should not confuse with the definition of spike-ins, the one between PhiX (used for quality-control during sequencing) and the spike-ins for normalization purposes. To some extent, both are useful for quality control and for library-size normalization. There is an External RNA Control Consortium (ERCC) that had developed a set of 92 synthetic spike-in standards that are commercially available and relatively easy to add to a typical library preparation. So you can select various formulation mixes of controls from these kits (or could be custom made) and use in your experiments. Ambion (Life Technologies) is the only company that manufactures such spike-in controls for normalization purpose. You can find step-by-step details here on how and when to use Spike-ins The main idea here is to achieve a standard measure for data comparison across gene expression experiments, measure sensitivity (lower the limit of detection) and dynamic range of an experiment, and quantitate differential gene expression.

How important is the use of Spike-ins?

In the view of DE analysis, all the normalization methods what we employ during the bioinformatics analysis downstream for between-sample normalizations thus far work properly, because we assume that majority of the genes are NOT differentially expressed between conditions under study, a reasonable assumption in most applications. Note that in practice, most of the normalization procedures (bioinfo based, not spike-in based) work well even when a high proportion of genes are DE, provided that they are roughly equally distributed between up-and down-regulation. However, in cases where there is a major global shift in expression, the usual between-sample normalization procedures will fail. In this case, normalization based on control sequences (spike-ins) may be the only option. There is a very nice book chapter here ( where it explains in detail about the role of spike-ins for normalization of RNA-seq.

Is it standard to use spike-ins in every RNA-seq experiment?

It depends on a case to case basis. Although there are very few reports in literature where people have actually used spike-ins in real experiments. This is surprising. Maybe because of time and money, this is not always possible. Because most of the major RNA-seq guidelines and forums strongly recommend using spike-ins during the experiment designing and sequencing. The approach is strongly recommended by the ENCODE consortium ( Then there is one recent paper just published in 2016 (A survey of best practices for RNA-seq)( which also recommends including the spike-ins for both the quality control as well as for library-size normalization. Another good read is (Synthetic spike-in standards for RNA-seq experiments) (, and Revisiting global gene expression analysis (

In any case, it is essential to ensure that spike-in standards behave as expected and to develop a set of controls that are stable enough across the replicate libraries and robust to both the differences in library composition as well as library preparation methods.

How do we check the quality and trim/filter low quality reads?

One can run FastQC and FASTX tool kit

Reference genome or Transcriptome?

  1. a.Depends on the purpose of the experiment
  2. b.Build a reference transcriptome if not available (Trinity, Trans-ABySS, Velvet / Oases)
  3. c.What if I don’t have a good reference genome?
    To determine whether the reference genome is well annotated or not; one approach could be, for example, take a random sample of raw reads (100 or so) from both the first and second strand of your paired-end data and BLAST them. If the genome assembly is poor, you would see that one pair of the read map to a different scaffold of the other read. Usually in the paired-end RNA-seq data, the reads must be less than 1000 bases apart on the same chromosome, or much closer. If this technique does not work, then you have to do de novo assembly of your reads and then align your reads against this transcriptome. I think you can combine both approaches. It is common to perform de novo assembly when reference genome is not available or is poorly annotated. Basically you are assembling your reads into longer contigs and then to treat those contigs as the expressed transcriptome to which the raw reads are mapped back again for quantification. There are several tools available for this.

What alignment program one can use to map the reads to a reference genome?

TopHat, Bowtie2, BWA

Unique or multiple mapping?

  1. a.Unique
  2. b.A good mapping % is between 70 to 90%

How to get read counts?

HTSeq with using the option ‘union’

What statistical methods one can use for DEG analysis?

Why do we use more than one method?

All the DEG analysis tools are developed based on different normalization methods and assumptions. There is no single tool that can be declared better over other.

How to select genes?

FDR (1% to 5%), Fold Change (FC), Pathway

What is RPKM and how itis calculated?

C = Number of reads mapped to a gene
N = Total mapped reads in the experiment
L = exon length in base-pairs for a gene
Equation = RPKM = (10^9 * C)/(N * L)

What is the difference between RPKM and FPKM?

FPKM=RPKM if we have single-end reads
FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and soit doesn’t count this fragment twice).

In the output file, what the five different columns represent / means in a usual edgeR pair-wise DEG analysis?

  1. a. Column-1 (logFC): For a particular gene, a log2 fold change of −1 for condition treated vs untreated means that the treatment induces a change in observed expression level of 2^−1 = 0.5 compared to the untreated condition.
    logFC -Fold change, generally refers to the ratio of average expression between two groups. edgeR uses exactTest() performs pair-wise tests for differential expression between two groups.
    When you're comparing two things: A and B, the fold change is A/B. A and B could be data sets reflecting gene expression measured under different conditions. If gene1 is 2-fold higher in A, the A/B ratio for gene1 is 2. If gene2 is 2-fold higher in B than A, the gene2 ratio is 0.5. However, edgeR displays the log2 of the ratio. Therefore genes up or down by a given amount (i.e. two-fold) have the same distance from equality, but the sign (+/-) changes. Thus gene1 would have logFC: 1, and gene2 would have logFC: -1, reflecting 2-fold up or down respectively. In log2 space the numbers are close enough. For instance: -2 is 4-fold down, -1 is 2-fold down, 0 is equal in A and B (a ratio of 1), 1 is 2-fold up, 2 is 4-fold up, etc. To convert the logFC value from edgeR (which are essentially ratios of the normalized count values per gene) into an up or down ratio, you simply take 2 to the power of the logFC number. e.g. for a gene with logFC of 0.5 or -0.5: the ratio is 2^0.5 = 1.4, or 2^-0.5 = 0.7, respectively.
    Let's say there are 50 read counts in control and 100 read counts in treatment for gene A. This means gene A is expressing twice in treatment as compared to control (100 divided by 50 =2) or fold change is 2. This works well for over expressed genes as the number directly corresponds to how many times a gene is overexpressed. But when it is other way round (i.e, treatment 50, control 100), the value of fold change will be 0.5 (all underexpressed genes will have values between 0 to 1, while overexpressed genes will have values from 1 to infinity). To make this leveled, we use log2 for expressing the fold change. I.e, log2 of 2 is 1 and log2 of 0.5 is -1.
    The log Fold change value is calculated by taking the log2 of avg CPM value of mutant vs control. log2(avg.cpm.mut)/(avg.cpm.wt)
  2. Column-2 (logCPM):
    log CPM refers to log (Counts per million reads).
    To calculate CPM manually in R it would be :
    cpm <-apply(countmatrix,2, function(x) (x/sum(x))*1000000)
    # the 1 added to log function is to avoid log 0 values
    log.cpm <-log(cpm + 1)
  3. c. Column-3 (LR): Log Ratio / Likelihood Ratio: log2(X1 RPKM/ X2 RPKM).
    But, edgeR does not use RPKM, instead it calculates CPM (Counts per million). edgeR does not consider the length of the gene for normalization. Read counts can generally be expected to be proportional to length as well as to expression for any transcript, but edgeR does not generally need to adjust for gene length because gene length has the same relative influence on the read counts for each RNA sample. For this reason, normalization issues arise only to the extent that technical factors have sample-specific effects.
  4. d. Column-4 (PValue):
  5. e. Column-5 (FDR): FDR values were calculated using the method of Benjamini and Hochberg from the distribution of 2-way ANOVA p-values, and fold-change values were calculated on a linear scale using least squares mean.

In a DEG analysis using edgeR, I noticed that few genes show 'NA'' in the output file. What does this represent, and how it is calculated within edgeR?

edgeR uses a filtering criteriato remove the genes with low filter counts. Please refer to page 11 of the edgeR user guide.
“A requirement for expression in two or more libraries is used as the minimum number of samples in each group is two”. This ensures that a gene will be retained if it is only expressed in at least two groups. Internally, edgeR is using the following code to filter out the genes based on the CPM (Counts per million) values.
> keep <-rowSums(cpm(y)>1) >= 2
> y <-y[keep, , keep.lib.sizes=FALSE]
So by default, the cases that do not contain CPM of 1 or greater for at least 2 replicates, will be filtered out and thus gives 'NA' in the results file. Generally, it is preferred to use 3 replicates for RNA-Seq analysis, so a threshold of 2 works fine. In any case, if one would like to use a threshold of 1, the code could be modified.

How much total RNA is required for RNA-Seq projects?

The following specifications apply to total RNA samples:
  • Total amount: ≥ 5 μg
  • Concentration: ≥ 80 ng/μl
  • OD260/280 Range: 1.8-2.2
  • Re-suspended in nuclease-free water

Resources for RNA-Seq analysis:

Gene Ontology Enrichment Analysis

Background Frequency and Sample Frequency

Background frequency is the number of genes annotated to a GO term in the entire background set, while sample frequencyis the number of genes annotated to that GO term in the input list. For example, if the input list contains 10 genes and the enrichment is done for biological process in S. cerevisiaewhose background set contains 6442 genes, then if 5 out of the 10 input genes are annotated to the GO term: DNA repair, then the sample frequency for DNA repair will be 5/10. Whereas if there are 100 genes annotated to DNA repair in all of the S. cerevisiaegenome, then the background frequency will be 100/6442.


P-value is the probability or chance of seeing at least x number of genes out of the total n genes in the list annotated to a particular GO term, given the proportion of genes in the whole genome that are annotated to that GO Term. That is, the GO terms shared by the genes in the user's list are compared to the background distribution of annotation. The closer the p-value is to zero, the more significant the particular GO term associated with the group of genes is (i.e. the less likely the observed annotation of the particular GO term to a group of genes occurs by chance). In other words, when searching the process ontology, if all of the genes in a group were associated with DNA repair, this term would be significant. However, since all genes in the genome (with GO annotations) are indirectly associated with the top level term biological_process, this would not be significant if all the genes in a group were associated with this very high level term.


What is methyl-Seq?

Is a technique that can determine DNA methylation patterns. The major difference from regular sequencing experiments is that, in bisulfite sequencing DNA is treated with bisulfite which convert cytosine residues to uracil, but leaves 5-methylcytosine residues unaffected. By sequencing and aligning those converted DNA fragments it is possible to call methylation status on the methylation states of individual cytosines.

What is Epigenetics?

Epigenetic is the study of post translational gene modification that leads to change in gene expression without changing the original DNA sequence of the gene. There are main mechanisms of epigenetics: DNA methylation, histone modification, and RNA regulation.

What is DNA Methylation?

DNA methylation is the best-studied epigenetic modification. This process occurs when a hydrogen atom at fifth carbon residue is replaced by a methyl group. DNA methylation occurs in a variety of biological processes:
  • Aging
  • Cancer
  • cardiovascular disease
  • Neuro-degenerative disease
  • Auto-immune disease
DNA methylation presents a good diagnostic and therapeutic target.

What are CpG Islands?

CpG islands (CGI) are genomic areas with a high proportion of CpG dinucleotides. Because they are methylated, they play an important role in transcription regulation. CGIs are frequently found near or within promoters (70 percent of promoters have CGIs). Methylation of CGIs is one of the major mechanisms involved in gene expression regulation.

What are the different types of Bi-sulfite sequencing?

  • Targeted Bisulfite Sequencing (TBS): Bisulfite sequencing method to detect base resolution DNA methylation at a targeted region of interest.
  • Reduced Representation Bi-sulfite Sequencing (RRBS): Bisulfite sequencing method utilized to detect nearly all CpG sites across the entiregenome at base resolution detail.
  • Whole Genome Bi-sulfite Sequencing (WGBS): Complete genome coverage of methylation at every CpG site and less common non-CpG sites such as CNG.

Whole Genome Bisulfite Sequencing (WGBS)?

This method combines the power of high-throughput DNA sequencing with sodium bisulfite treatment of DNA. When DNA is treated with sodium bisulfite, a chemical compound, unmethylated cytosines are converted to uracil.

How much sequencing depth (number of reads) needed for WGBS?

Paired-end sequencing with either 100 or 150bp reads is recommend

What happen after the entire genome is treated with sodium bisulfite?

Sequence reads diverge from the reference genome at each converted CpG site. Therefore, to map WGBS data to the reference genome, the reference needs also be bisulfite converted.

What are the advantages and limitations of WGBS?

The advantage of using WGBS resides in the fact that it results in higher CpG densities, more than 10 CpG/100bp. In general, this method identifies 2 CpG/100bp, which corresponds to roughly 50% of the genome. It also gives the capability to evaluate the methylation status of nearly every CpG site. On the other hand, this method needs sufficient read depths to reliably determine methylation status and it is cost intensive when dealing with large genomes.

Reduced Representation Bisulfite Sequencing (RRBS)

To enrich for sections of the genome with a high CpG content, RRBS uses restriction enzymes and bisulfite sequencing. Focus coverage is made on the sections of the genome that have CpG dinucleotides, instead of the entire genome. It allows greater read depth compared to WGBS.

What are the advantages and limitations of RRBS?

With RRBS method the data volume is reduced as well as the cost to perform the procedure. Also, DMRs are typically found in higher CpG density regions of 3 CpG/100 bp, which accounts for about 20% of the genome. However, since only a small portion of the genome is examined, the RRBS method has limitations when it comes to functional conclusions in species that do not have a good reference genome.

Methylated DNA Immunoprecipitation (MeDIPs)

MeDIP-Seq method uses immunoprecipitation to enrich for the portion of the genome containing either (5mC) or (5hmC), which is then followed by high-throughput sequencing. The first whole-genome methylation profile of a mammalian genome was created using MeDIP-Seq. It uses anti-5-methylcytosine antibodies and magnetic beads to look for methylated genomic regions.

What are the are the advantages and limitations of MeDIPs?

MeDIPs provides the highest percentage of coverage out of all other bisulfite sequencing methods. It covers CpG and non-CpG 5mC throughout the genome. It can be done on the Genome-wide scale or in any regions of interest and it requires low input data. However, this method is low density bias, the antibody used can only recognize methylation in regions with a certain threshold of CpG density. It is also not possible to do base pair resolution analysis with method.

Targeted Bisulfite Sequencing (TBS)

This method uses a hybridization-based step on platforms with pre-designed oligos that capture CpG islands, gene promoters, and other significantly methylated regions. Alternatively, a PCR-based step could be used to amplify multiple bisulfite-converted DNA regions in a single reaction.

What alignment program one can use to map the reads to a reference genome?

  • BSMAP: the first mapper for bisulfite data alignment. It indexes the genome using an efficient HASH table seeding algorithm, bitwise masking each nucleotide in the reads and the reference, and efficiently matching them to each other.
  • GSNAP is a general-purpose mapper that can also handle bisulfite data. It uses a wild-card approach to match read seeds to genome regions and is based on special hash tables.
  • BS-Seeker2 an extension of BS Seeker that uses a three-letter approach to map bisulfite data. In addition to gapped alignment, it can filter out reads with incomplete bisulfite conversion, increasing specificity.
  • BWA-meth is built on the BWA-mem aligner.
  • Segemehl was originally intended to be a general-purpose mapper, but it has since been extended to handle bisulfite data. For the seed search, it employs a wild-card approach based on suffix arrays, as well as the Myers bit-vector algorithm for computing semi-global alignments.
  • Bismark is a set of tools for the time-efficient analysis of Bisulfite-Seq (BS-Seq) data. Bismark performs alignments of bisulfite-treated reads to a reference genome and cytosine methylation calls at the same time. It is written in Perl and is run from the command line. Bisulfite-treated reads are mapped using the short-read aligner Bowtie 2, or alternatively HISAT2.

How do I choose aligner for methyl-seq data?

Choosing your aligner depends on the type of data. If using RRBS and WGBS data, bismark is a good one to use. When you need to align Methylated DNA immunoprecipitation(MeDIP or mDIP) MeQA, or MEDIPS pipelines are useful.

How do you ensure best output and visualization of methylation data?

Pre-processing of sequencing reads is a must for best possible output and the visualization of the methylation data.

What is good sequencing coverage for Methyl-Seq Sequencing?

  1. a. Large target regions: Probes should cover several hundred kilobases to several megabases of contiguous sequence including exons, introns, 5’ regulatory regions, 3’ regulatory regions, and flanking regions.
  2. b. Great specificity having longer probes (200-300bp)
  3. c. Coverage of the targeted region including repeats

Non-directional or paired-end RRBS libraries?

Non-directional bisulfite sequencing is less common, but has been performed in a number of studies(Cokus et al. (2008), Popp et al. (2010), Smallwood et al. (2011), Hansen et al. (2011), Kobayashi al.(2012)). In this type of library, sequence reads may originate from all four possible bisulfite DNAstrands (original top (OT), complementary to OT (CTOT), original bottom (OB) or complementary toOB (CTOB)) with roughly the same likelihood.
Paired-end reads do by definition contain one read from one of the original strands as well as onecomplementary strand. Please note that the CTOT or CTOB strands in a read pair are reversecomplements of the OT or OB strands, respectively, and thus they carry methylation information for the exact same strand as their partner read but not for the other original DNA strand. Similar todirectional single-end libraries, the first read of directional paired-end libraries always comes fromeither the OT or OB strand. The first read of non-directional paired-end libraries may originate fromany of the four possible bisulfite strands.(see

What is CpG methylation?

CpG islands are defined as stretches of DNA 500-1500 bp long with a CG: GC ratio of more than 0.6, and they are normally found at promoters and contain the 5' end of the transcript (reviewed in Cross and Bird, 1995). From: Advances in Genetics, 2002.

How do I identify CpG islands?

CpG islands are defined as sequence ranges where the Obs/Exp value is greater than 0.6 and the GC content is greater than 50%. The expected number of CpG dimers in a window is calculated as the number of 'C's in the window multiplied by the number of 'G's in the window, divided by the window length > topics > neuroscience > cpg-island

How do we account for bias in Methylation data?

There are several hurdles to cross when analysing Methyl-seq data, particularly during the identification of DMRs. Normalization of read counts contribute to eliminate biases as a result of variability in sequencing depth between samples. While, global read count normalization can help address this problem, it does not account for ‘competition’ effects. For example, in RNA-seq specific highly expressed genes can lead to a depressed read count in other genes introducing a bias when comparing samples. Same situation can be found with MeDIP-seq, where sample-specific repeat methylation could potentially diminish reads in other genomic regions and introduce bias tothe analysis. Methods for calculating methylation have proven to be useful when identifying large global changes, for example hypomethylation, or hypermethylation regions of a given sample. However, these methods have not provided a framework for determining the location of DMRs in a statistically rigorous manner just yet. By using methods such as DESeq that estimate variance in a local fashion, it is possible to remove potential selection biases. Additionally, DESeq estimates a flexible, mean-dependent local regression rather than attempting to reliably estimate both the variance and mean parameters of the distribution from limited numbers of replicates. Finally, False positive could be included in the interpretation of your data by differences in DNA fragment size distributions between samples. Performing fragment length normalization through read sub-sampling to equalize the distributions can eliminate this potential bias. Additional tools such as EdgeR, or methylkit can also be used.


What are SNPS?

The measuring of genetic changes of single nucleotide polymorphisms (SNPs) between members of a species is known as SNP genotyping. Single nucleotide polymorphisms (SNPs) are the most common type of genetic variation found in humans. Each SNP represents a variation in the DNA known as a nucleotide. For example, a SNP may replace the nucleotide cytosine (C) with the nucleotide thymine (T) in the DNA.

SNPs have been used as markers for use in quantitative trait loci (QTL) analyses and association studies in place of microsatellites because they are believed to be conserved throughout evolution. NGS technologies are now being used to reveal the genetic basis of traits, local adaptation, and evolution in non-model animals and plants, including cattle, pigs, rice, maize, soybean, and cucumber etc.

What causes single nucleotide polymorphism?

A SNP is defined as a variant in which more than 1% of a population does not carry the same nucleotide at a certain place in the DNA sequence. When an SNP exists within a gene, the gene is referred to as having several alleles. SNPs can cause changes in the amino acid sequence in several circumstances.

What coverage (sequenced reads) do I need for my SNP sequenced samples?

It depends on the goals of your experiment and the organism of interest. For short reads it is recommended:
  • Germline variant analysis: 20-50x
  • Somatic variants: 100-1000x
  • Tumor vs Normal: ≥60x tumor, ≥30x normal
  • Population studies: 20-50x
  • De novo assembly: 100-1000x
For long-reads it is recommended:
  • Gap filling and scaffolding: 10x
  • Large structural variant detection: 10x
  • Germline/frequent variant analysis: 20-50x
  • De novo assembly: 100-1000x

How do I calculate coverage?

The Lander/Waterman equation is used to calculate genome coverage. The general equation is: C = LN / G
  • C stands for coverage
  • G is the haploid genome length
  • L is the read length
  • N is the number of reads
Coverage = (read length) x (total number of reads) / (genome size). For example, humans can achieve 30x coverage with 600 million 150 bp reads (or 300M paired-end reads): 30x = (150 bp/read) x (600 x 106reads) / (3 x 109bp)

What alignment tools can I use for SNP analysis?

GATK, DNASTAR, NGSEP, Plink, and bowtie2.

What are the filter parameters for determining the genotype for each SNP site?

Two commonly used filter parameters are genotype quality (GQ) >= 20 and read depth (DP) >= 10.

What is the best way to visualize my SNP data?

  1. A Manhattan plot, which plots statistical significance as -log10(p-value) on the y-axis against chromosomes on the x-axis, is an effective way to display millions of genetic variants in a single figure. It is simple to identify regions of the genome that exceed a certain level of significance. Furthermore, it simplifies the identification of regions that can be replicated.
  2. A q-q graph is a good way to see if there is a common distribution between two data sets.

What is the difference between SNV and SNP?

SNVs are sometimes referred to as single nucleotide polymorphisms (SNPs), however the terms are not interchangeable. To be considered an SNP, a variant must be found in at least 1% of the population.

What type of variation analysis can I do with my SNP datasets?

Detection and genotyping of Single Nucleotide Variants (SNVs), small and large indels, short tandem repeats (STRs), inversions, and Copy Number Variants (CNVs.

What type of files do I need to start my SNP analysis?

You can start with raw reads and use software that are built to start SNP analysis from raw reads to variant detection. Tools such as NGSEP, SystempipeR, and bowtie. Or start with a VCF file generated from your raw reads and use tools such as Plink.

Can a single nucleotide polymorphism (SNP) be considered an epigenetic variation?

SNP is not an epigenetic modification; it is a genetic modification, also known as a mutation. However, SNPs have the potential to be an excellent marker for correlating different levels of allele expression and allele structure. Even if the SNP is not directly responsible for the variation in expression level, it can be linked to other mutations on the same allele sequence, allowing the causal relationship to be established.

What is GWAS analysis?

The genome wide association study (GWAS) is a new technique for identifying causative genes in the genome for key phenotypes.

How do I control false positive in GWAS analysis?

  1. 1: Using the Q matrix
    • The systematic difference in allele frequencies between subgroups of a group caused by non-random mating between individuals is referred to as population structure. Using the Q matrix to incorporate the fractions of individuals belonging to a subpopulation can help reduce false positives. It has been reported that the outcome of PCs (Population structure) is like the outcome of Q matrices. As a result, using the Q matrix in GWAS aids in reducing false positives.
  2. 2: Using a Kinship Matrix
    • Incorporating hidden cryptic relationships between individuals in a mixed linear model can reduce false positives. We can solve this problem by using the Kinship matrix. So, by taking population structure and family relatedness into account, the Mixed Liner model (MLM) can help to reduce false positives to some extent. MLM methods only consider a single locus and can only test one marker at a time, and these methods fail to match the true genetic model of complex traits that are controlled by multiple loci at the same time. Bonferroni correction, False Discovery Rate (FDR), per-mutation test, and Bayesian approaches can be used to address this issue. As a result, the Multi loci mixed linear model can resolve false positives.

Single Cell RNA-Seq (scRNA)

What is singe cell RNA-Seq?

Measures the distribution of expression levels for each gene across a population of cells

What can scRNA analysis address?

Allows researchers to investigate novel biological topics in which cell-specific transcriptome alterations are significant:
  • Identification of cell type
  • Cellular response heterogeneity
  • Gene expression stochasticity.
  • Gene regulatory network inference across cells

What is the difference between RNA-seq and Single Cell RNA?

We compare two tissues using RNA-seq by looking at the average expression of each gene detected across both tissues. The differential expression is then calculated as the ratio of a given gene's expression in one tissue versus another. With scRNA, we no longer compare tissues, but rather cells to cells. Each cell is assigned a gene profile, which describes the relative number of genes found within it. A gene profile describes a cell type ideally, and many cells have the same gene profile.

What are barcodes?

Barcodes are one-of-a-kind sequences added to each RNA molecule. Cell Barcodes tell us which cell the transcript is from. They are not unique to the molecule, but rather to the cell, so that if two RNA molecules exist in the same cell, they will be tagged by the same cell barcode. Cell barcodes will differ between RNA molecules from different cells.

What are UMIs?

Unique Molecular Identifiers (UMIs) are random barcodes (4-10bp) that are added to transcripts during reverse transcription. They allow individual transcript molecules to be assigned to sequencing reads. By comparing it to other transcripts from the same gene with the same UMI tag, UMIs can tell us how much the transcript was amplified.

How to read counting with UMIs?

Reads with different UMIs mapping to the same transcript are biological duplicates derived from different molecules, each read should be counted. Reads with the same UMI came from the same molecule and are technically duplicates; the UMIs should be collapsed to count as a single read.

How many cells should I have in my samples?

Depending on their sequencing platform, researchers can specify the number of cells to be sequenced, ranging from 500 to 10,000 cells per sample. For most experiments, it is recommended that 5,000 cells per sample be used.

How many reads do I need for my scRNA-Seq?

To find the bulk of low-abundance transcripts, obtaining 1-2 million readings per cell is advised.

How many samples should I multiplex on a lane?

The number of samples per flowcell is a simple calculation once the number of reads per sample has been determined. Simply divide the number of expected reads by the number of reads required per sample to get the number of reads per flowcell or lane.

What files do I need for quantification of my scRNA-seq data?

  • The sample index: determines which sample the read came from (Added during library preparation - needs to be documented).
  • Cellular barcode: identifies which cell the read came from (Each library preparation method has a stock of cellular barcodes used during the library preparation).
  • The unique molecular identifier (UMI): identifies the transcript molecule from which the read originated (The UMI will be used to collapse PCR duplicates)
  • Read1: the Read1 Sequence
  • Read2: the Read2 Sequencing

What are the steps to generate a count matrix for my scRNA-seq data?

This procedure consists of four steps:
  • Filtering and formatting noisy cellular barcodes
  • Sample demultiplexing
  • Transcriptome mapping
  • UMI collapsing and read quantification

What are some protocols I can for scRNA sequencing?

  • SMART-seq2 (Picelli et al. 2013)
  • CELL-seq (Hashimshony et al. 2012)
  • Drop-seq (Macosko et al. 2015)
  • InDrop-seq (Klein et al. 2015)
  • MARS-seq (Jaitin et al. 2014)
  • SCRB-seq (Soumillon et al. 2014)
  • Seq-well (Gierahn et al. 2017)
  • STRT-seq (Islam et al. 2013)
Commercial platforms available:
  • Fluidigm C1
  • Wafergen ICELL8
  • 10X Genomics Chromium

What are some tools I can use to analyze my scRNA data?

Several different analysis tools available
  • Bioconductor is a open-source, open-development software project for the analysis of high-throughput genomics data, including packages for the analysis of single-cell data.
  • Seurat is an R package used for QC, analysis, and exploration of single cell RNA-seq data.
  • scanpy is a Python-based tool for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, and differential expression testing.
  • Falco a single-cell RNA-seq processing that is on the cloud.
  • ASAP (Automated Single-cell Analysis Pipeline) is an interactive web-based platform for single-cell analysis.


What is metagenomics?

When researching microbial communities where it is impossible to distinguish one microbe from another, metagenomics is commonly used. For example, two bacteria may grow together, so when you take their DNA sequence, you're obtaining the DNA sequence of both bacteria. Metagenomics provide a way to distinguish between the sequences. Metagenomics permits the study of all microorganisms, regardless of whether they can be cultivated or not, by analyzing genomic data extracted directly from an environmental sample, revealing the species present and permitting the extraction of information about microbial functionality.

Is there a distinction between genomics and metagenomics?

The nature of the sample is the primary distinction between genomics and metagenomics. Genomics examines the complete genetic information of a single organism, whereas metagenomics examines a mixture of DNA from multiple organisms and entities, such as viruses, viroids, and free DNA.

What are some of the benefits of metagenomics research?

Metagenomics eliminates the requirement to culture organisms in the lab, removing the biases associated with classic cultivation-based approaches such as plate counts.

What software can I use for metagenomics data analysis?

Mothur, DADA2, and QUIME are very popular.

What are the reference databases I can use to map my metagenomics data?

SILVA, Greengenes, RDP, and Unite2

What is the difference between 16SRNA, 18SRNA, and ITS?

16SRNA is a DNA sequence that encodes the small subunit rRNA of Prokaryotes. It has a length of approximately 1542 bp, a moderate molecular size, and a low mutation rate. In the study of bacterial systematics, 16SRNA is the most commonly used marker. The 16SrDNA sequence is made up of 10 conservative regions and 9 variable regions. Conserved region sequences reflect genetic relationships between species, whereas variable region sequences reflect species differences. 16SRNA is primarily used to investigate the diversity of bacteria and archaea.
18SRNA is a DNA sequence encoding small subunit rRNA of eukaryotes ribosomes. With the exception of v6, the 18SRNA sequence is made up of conservative and variable sections (v1-v9), similar to 16SRNA. V4 provides the most thorough database information and the best classification effect among variable areas. For 18SRNA gene analysis, it is the most commonly used and best option. The species distinctions among eukaryotic organisms in a given sample are reflected by 18S rDNA sequencing.
The non-transcriptional portion of the fungal rRNA gene contains ITS (Internal Transcribed Spacer). ITS1 and ITS2 are the most common ITS sequences utilized for fungal identification. Because the rRNA genes 5.8s, 18s, and 28s are substantially conserved in fungi. Due to low natural selection pressure, ITS may tolerate greater variation in the evolutionary process and exhibits highly wide sequence polymorphism in most eukaryotes. Simultaneously, the ITS conservation type is very stable within species, and the discrepancies between species are clear. ITS sequence fragments are short (350 bp to 400 bp) and simple to examine. They've been utilized extensively in fungal phylogenetic analysis.

What are the information that can be obtained from metagenomics analysis?

  • Sequence variation
  • Specie classification
  • Specie abundance
  • Population structure
  • Phylogenetic evolution
  • Community comparison

What are OTUs?

OTUs (Operational Taxonomic Units) are used to classify groups of individuals who are closely related. In general, if the similarity of two rDNA sequences exceeds 97 percent, those sequences can be classified as an OTU. Each OTU corresponds to a different rDNA sequence, implying that each OTU represents a single species. The microbial diversity and abundance of different microorganisms in the sample can be determined using OTU analysis.

How much DNA is required for 16S metagenomics library preparation?

The standard protocol for the Zymo kit calls for 5-20 ng of total DNA. The Swift kit recommends 1ng, but sufficient yields can be obtained with as little as 10pg-50ng depending on sample type.

What is the difference between Shotgun vs. Targeted metagenomics?

After collecting samples from the environment, you must prepare libraries so that they can be easily analyzed. The two most recent methods for creating libraries for analysis are:

Targeted metagenomics entails focusing on a specific region of a genome (for example, 16S rRNA and 18S rRNA) that is shared by multiple organisms and samples. It provides more precise and detailed data, but it may result in unequal amplification for specific targeted regions.

Shotgun metagenomics: the ability to sequence everything in a sample. It is suitable for all organisms. It provides higher resolution to genetic content (particularly DNA), but results in extremely complex datasets.

Shotgun Metagenomics Analysis

When dealing with metagenomic datasets, there are three main approaches:
Marker gene analysis: sequences are compared to databases of taxonomically or phylogenetically informative sequences known as marker genes, their similarity is examined, and the sequences are taxonomically annotated. The most commonly utilized marker genes are single-copy ribosomal RNA (ribonucleic acid) genes found in all microbial genomes.
Binning divides sequences into comparable groups that correspond to taxonomic categories like species, genus, and higher levels.
Assembly: Put all the tiny sequences in your sample together to generate much longer sequences that represent genomes.

Targeted Metagenomics Analysis

To perform targeted metagenomics, genetic material from samples is extracted, and genes of interest are PCR amplified based on regions of interest . The 16S ribosomal RNA gene is the most commonly used gene for this purpose. This gene is referred to as the "universal phylogenetic marker." It is found in all living microorganisms in a single copy. The resulting data is then processed and analyzed with various tools. From there , we can identify operational taxonomic units (OTUs), community structure aspects, and functional roles in microbial communities.

Eukaryotes Assembly

What Eukaryotic assembly?

The process of putting nucleotide sequences in the correct order is referred to as genome assembly. Because sequence read lengths are much shorter than most genomes or even most genes, assembly is required. Organisms with eukaryotic cells are known as eukaryotes. Eukaryotic assembly refers to the genome assembly of prokaryotic organisms. The assembly entails taking a large number of DNA reads, looking for areas where they overlap, and then gradually piecing together the 'jigsaw' puzzle. It's an attempt to rebuild the original genome.

What are eukaryotic organisms?

Eukaryotes are organisms with a nucleus and other membrane-bound organelles in their cells. Eukaryotic creatures include all mammals, plants, fungi, and protists, as well as the majority of algae. Eukaryotes are multicellular or single-celled organisms.

When it comes to alignment and assembly, what's the difference?

The terms "mapping" and "alignment" are interchangeable. Your reads are compared to a known standard. Assembly is the process of taking reads and constructing contigs entirely based on the reads.

What is a good coverage for genome assembly?

According to our findings, the optimal read depth for assembling these genomes using almost all assemblers is 50X. Furthermore, de novo assembly from 50X read data only requires 6–40 GB RAM. To put this in context, once a human genome has been fully sequenced, we will have approximately 100 gigabases (100,000,000,000 bases) of sequence data.

What do we mean by coverage?

So, each nucleotide in the genome must be accounted for in order to account for potential errors? Coverage refers to the amount of times a sequence is repeated. 30 times (30-fold) coverage, for example, meaning that each base is sequenced 30 times.

What do I need, paired-end or single-end reads for genome assembly?

The primary advantage of paired-end reads is that scientists can determine the distance between the two ends. This makes assembling them into a continuous DNA sequence easier. Paired-end reads are especially useful when constructing a de novo sequence because they provide long-range information that would otherwise be unavailable in the absence of a gene map.

What are single-end reads?

Single reads are sequences that cover only one end or the entire length of a DNA fragment. These sequences can then be joined together to form the full DNA sequence by locating overlapping regions in the sequence.

What are paired-end reads?

Paired-end reads are where both ends of a fragment of DNA are sequenced. The distance between paired-end reads can range between 200 and several thousand base pairs

What are software available for eukaryotic assembly?

Velvet, TransABySS, Trinity, SOAPdenovo2

What is de novo assembly?

De novo sequencing is the process of sequencing an organism's genome for the first time. There is no existing reference genome sequence for that species to use as a template for the assembly of its genome sequence in de novo assembly.

Prokaryotes Assembly

What is prokaryotic assembly?

Prokaryotes are organisms made up of only one prokaryotic cell. Plants, mammals, fungi, and protists all have eukaryotic cells. They have a diameter of 10-100 m and their DNA is housed in a membrane-bound nucleus. The assembly for prokaryotes only differs from eukaryotes by the different software tools that are used to perform the analysis.

What are software available for prokaryotic assembly?

SOAPdenovo2, Spades, PGAP.

What's the difference between genome assembly and annotation?

De novo assembly of short-read fragments is used in genomic analysis to reconstruct full-length base sequences without relying on a reference genome sequence. The annotation step then identifies gene sites within the base sequences and determines the structures and functions of these genes.

What is the process of gene annotation?

A simple approach of gene annotation involves using homology-based search tools, such as BLAST, to look for homologous genes in certain databases, and then using the information to annotate genes and genomes.