Genome assembly is the process of taking many individual disconnected graphs (pieces of the DNA) that are processed independently by an assembler and putting them back together to create the original assembly. A high-quality assortment/annotation of genomes has become a vital instrument for improving biological understanding of all species. The biggest challenge of genome assembly is “assembly error”. Assembly errors occur for a variety of reasons. Pieces are frequently discarded incorrectly as mistakes or repeats, while others are joined up in the wrong places or orientations. It is recommended that you use long, high-quality reads for your analysis to address many of these issues.
Single and paired reads are used to assemble a genome. Single reads are simply short sequenced fragments that can be joined together through overlapping regions to form a continuous sequence known as a 'contig'. Paired reads are roughly the same length as single reads, but they come from opposite ends of DNA fragments. Depending on the sequencer used, this distance can range from 200 base pairs to several tens of kilobases. Knowing that paired reads were generated from the same piece of DNA can help better to organize contigs into 'scaffolds.' Paired read data can also be used to determine the size of repetitive regions.
A genome assembly is considered good quality on the basis of
The most commonly used metric to evaluate new genome assembly is N50, the smallest scaffold or
contig above which 50% of an assembly would be represented. Our