Contact Contact Us
Contact Us

Prokaryotes: De Novo Genome Assembly and Annotation

How Prokaryotes Assembly and Annotation Works

Genome assembly is the process of taking many individual disconnected graphs (pieces of the DNA) that are processed independently by an assembler and putting them back together to create the original assembly. A high-quality assortment/annotation of genomes has become a vital instrument for improving biological understanding of all species. The biggest challenge of genome assembly is “assembly error”. Assembly errors occur for a variety of reasons. Pieces are frequently discarded incorrectly as mistakes or repeats, while others are joined up in the wrong places or orientations. It is recommended that you use long, high-quality reads for your analysis to address many of these issues.

Single and paired reads are used to assemble a genome. Single reads are simply short sequenced fragments that can be joined together through overlapping regions to form a continuous sequence known as a 'contig'. Paired reads are roughly the same length as single reads, but they come from opposite ends of DNA fragments. Depending on the sequencer used, this distance can range from 200 base pairs to several tens of kilobases. Knowing that paired reads were generated from the same piece of DNA can help better to organize contigs into 'scaffolds.' Paired read data can also be used to determine the size of repetitive regions.

A genome assembly is considered good quality on the basis of

  • The number of scaffolds and contigs that represent the genome
  • The proportion of reads that are assembled
  • The absolute length of contigs and scaffolds
  • The length of contigs and scaffolds relative to the size of the genome

The most commonly used metric to evaluate new genome assembly is N50, the smallest scaffold or contig above which 50% of an assembly would be represented. Our de dovo assembly pipeline for prokaryotes consists of two major parts:

Pre Assembly Analysis
  • Raw subreads overlapping for error correction
  • Preassembly and error correction
  • Overlapping detection of the error corrected reads
  • Overlap filtering
  • Construct graph from overlaps
  • Construct contig from graph
  • Construct scaffolds or chromosome
Post assembly analysis:
  • Step 1 - Assembly QC assessment
  • Step 2 - Gene Prediction
  • Step 3 - Protein coding regions identification
  • Step 4 - Functional annotation
  • Step 5 - Results as tables, mapping BAM files, and summary statistics