Report on Assembly Recommendations

Version 1 May 2023 - confirmed after Lausanne EBP Phase 2 meeting

Version 2 October 2023 - added Workflows section

Version 3.0 - JUne 2024


To accompany the Recommendations, the EBP provides Sequencing and Assembly Standards.

Our previous document provided standards to be achieved for reference genomes. The key elements of this are >1 Mb contiguity (contig NG50), >90% assigned to chromosomal scaffolds, <1 in 10,000 error rate = 6.C.Q40.  In addition, there are further requirements including identifying and separating out organellar genomes, contaminants, sex chromosomes, etc.

The standards document intentionally did not provide requirements for primary data or software to be used. There are competitive alternatives at all steps. Furthermore, technologies change, both for sequencing and computational methods, and different combinations of data can be used to achieve these goals.  However it is useful to provide concrete recommendations based on the experience of EBP major projects, at least as an exemplar to help others plan and get going.


It is our experience that to achieve these standards it is necessary to use both:

  1. Long read data for the primary assembly. Current options (mid-2024) are:

  • Pacific Biosciences CCS: very high accuracy, typically 10-20 kb length

  • Oxford Nanopore (ONT): moderate to high accuracy using the R10.4 chemistry, length from 10-1000 kb with yield decreasing with higher mean length. Accuracy of ONT reads can now be considerably increased up to the approximate level of CCS using the duplex process, or by error correction e.g. with herro.


    2. Some form of long range scaffolding data.  Primary options are:

  • Hi-C data (short read pairs using proximity ligation data). Widely used commercial kits for assembly scaffolding by EBP projects include Dovetail Omni-C, Arima Hi-C and Phase Hi-C, but other Hi-C library making options are possible.  A possible alternative is long read Pore-C data.

  • Ultra-long reads (currently from ONT), although these may still fall short of producing whole chromosomes.

If the read based assembly delivers chromosomal contigs, as can happen in some cases when ultra-long reads are included, then additional scaffolding is not required.


Additional information that can be very valuable includes:

1. WGS data from the parents, if available, for essentially complete haplotype separation and validation.  Short read data are frequently used for this.

2. Deep paired end high accuracy short read WGS read data, for example for base-pair accuracy “polishing”.

Less easily available and less used these days are:

3. BioNano optical restriction maps.

4. Strand-Seq data for contig orientation in scaffolding, where available.  We note that this requires a cell line to make the library.

5. Genetic maps built from crosses with reasonable marker density.  Excellent for chromosomal (formally linkage group) assignment and long range ordering, but comparatively low resolution.

6. Fluorescence in situ hybridisation (FISH) data. Low throughput but high information. Good for resolving complex medium- to long-range structural assignment, and confirming breakpoints.

7. Identification of telomeric repeat arrays at the end of presumed "complete chromosomal" scaffolds or contigs. We note that not all taxa (e.g. drosophilid Diptera) have short tandem repeats as telomeric arrays, and some taxa (e.g. Saccharomyces yeasts) have arrays that contain a number of related short sequences (i.e are not strictly tandem repeats of identical units).


We provide three recommended specific sequencing recipes based on experience as at May 2024.  These are typically adequate to achieve the standards, but won’t work 100% of the time for 100% of the organisms.  First for assembly:

A. CCS-based

a. 15x per haplotype PacBio CCS (HiFi) data.  For a diploid organism this means 30x.  For higher ploidy correspondingly higher depth is required. 

B. ONT-based-1

a. >20x ONT data per haplotype with read N50 ~50 kb and at least 5x coverage >100 kb.  Error correction of this data using herro.

b. >25x per haplotype paired end short read data for polishing.

C. ONT-based-2

a. >15x per haplotype ONT duplex data.

b. >25x per haplotype paired end short read data for polishing.

Whichever of these is used, you should be starting from a single organism of your species for all the long read data, and the short read polishing data. If available, and if it is known that the organism has sex chromosomes, the heterogametic sex (e.g., ZW or XY individual, or the UV stage) should be chosen. The polishing data can come from a linked read library, or even a non-restriction-enzyme based Hi-C library such as Dovetail’s Omni-C but in either of these cases must be greater depth to compensate for greater unevenness in coverage.

In addition to the assembly data (A, B or C) we recommend to obtain >50x Hi-C data for scaffolding a diploid.  Ideally this will also be from the same individual, but if that is not possible, e.g. because the specimen was too small, it is possible to scaffold with Hi-C data from another organism of the same species. Ideally, this specimen should be from the same sex if the species has genotypic differences between the sexes (e.g., differentiated sex chromosomes). Omni-C data can double up for scaffolding and polishing but remember that you then need greater depth (because of uneven coverage), e.g. 100x for a diploid, and also if the sample is polyploid you will need even greater depth, 50x per haplotype.

Next we give an outline of a recommended assembly pipeline, based on experience as of February 2022 with subsequent updates in May 2023 and May 2024, with options for the different data types.

  1. Register your sample for a tolid (Tree of Life ID) at id.tol.sanger.ac.uk and name your assembly after this tolid.

  2. Build k-mer tables for CCS and short read data, and use them to evaluate genome size, heterozygosity, mean coverage and ploidy.  Keep the k-mer tables - they are useful later. There are multiple tool-chains for this, using incompatible file formats (sadly):

    a. FastK, followed by GENESCOPE.FK.  Also MERQURY.FK Smudge if you suspect polyploidy. 

    b. Meryl with GenomeScope2 and Merqury.

    c. KMC?

  3. Assemble the long reads into contigs.  We now strongly recommend an assembler that separates the haplotypes. e.g. HiFiAsm or Verkko2, If you have parental Illumina data then use trio-binning mode.

  4. These assemblers create either hap1/hap2 pairs of pseudo-chromosomes if scaffolding data are included in the assembly, or primary and alternate sets of contigs.  Even when they do this it may be necessary to remove haplotypic duplicates from the primary contigs using purge_dups, before scaffolding.

  5. Also remove contaminants and co-bionts.  Software such as BlobToolkit, FCS-GX and Kraken2 can help identify and mark these for removal.

  6. Identify and separate out contigs corresponding to organelles (mitochondrion, plastid for plants, potentially others for some organisms).  For animal mitochondria, reassemble with MitoHifi (for CCS data) or MitoVGP (for ONT or CLR data).  Non-animal mitochondria and plastids are more complex, with long repeats that support recombination and variable topologies; Oatk is specifically designed to address them.  

  7. Scaffold with Hi-C data using YaHS  (or SALSA2 or HiRise for Omni-C, or 3D-DNA).  

  8. If the contigs were made with ONT data (options B or C) you then should polish with short read data using bwa (to map), FreeBayes or DeepVariant (to variant call), Merfin (to filter calls) and bcftools (to apply corrections).  If the contigs were made with CCS data then their accuracy will be much higher to start with, but it can potentially be improved by remapping the CCS reads with WinnowMap, calling variants with DeepVariant, and then correcting as above. Correction with short reads may also improve accuracy, particularly for indels, but care must be taken to avoid inserting errors because of less certainty of short read mapping. It is important to map the polishing reads to all the assembly material (primary, alternate, contamination, organelles) to avoid mismapping and false calls that occur if part of the DNA is missing.


This process will generate assemblies, but it is then necessary to carry out a curation and QC step:

  1. Recheck for contamination using FCS-GX and remove contaminated contigs/regions.

  2. Re-map the Hi-C data, and review/edit scaffolds in PretextView or JuiceBox to generate a set of large scaffolds that you trust to represent chromosomes.  Workflows for this are available (GRIT Rapid Curation, TreeVal). Some small contigs/scaffolds will remain in the primary assembly, and can be submitted as unlocalised, but the goal is for these to represent less than 10% of the sequence.

  3. Assign chromosome numbers to chromosomal scaffolds, using genetic data from the species if available to connect to established chromosome/linkage group nomenclature (taking care to conform to the established orientation), or numerically in size order, taking into account any karyotype information that is available. If there is complete one-to-one orthology to the chromosome set of a closely related species with an established chromosome nomenclature, then it is acceptable to adopt that nomenclature. 

  4. Confirm unique single copy material representation using GENESCOPE.FK.

  5. Estimate base pair accuracy using MERQURY.FK or Merqury or yak.

  6. Estimate single copy gene representation completeness with BUSCO.

It is important that when you submit the finished assembly to the public INSDC databases you also submit the primary data used for the assembly process, and link it via a shared BioProject identifier.  This permits a variety of scientific downstream analyses unbiased by the assembly, as well as independent generation of summary metrics and potential third party evaluation of assembly queries. 


Telomere to Telomere (T2T) assembly:

The recommendations above are designed to achieve the EBP assembly quality target.  While this standard ensures good reference genomes, and in some species some chromosomes may be contiguous, such assemblies typically include a number of gaps within chromosomes (frequently hundreds, for some species thousands), and some assembled contigs that can not be placed in the chromosomal scaffolds.  Gaps typically occur at highly repetitive sequence such as centromeres, ribosomal DNA clusters or long segmental repeats, or where there is read drop-out because of sequence composition.  For example PacBio CCS as of the end of 2021 can lose coverage in very GA-rich regions.  

In 2022, there was published a complete assembly for human (CHM13 double haploid cell line) which includes all chromosomal sequence without gaps. There are efforts to develop a standard pipeline to be able to attain this goal for other species, including for heterozygous diploid samples. Current recommendations are to generate >60x PacBio CCS sequence and >40x ultralong (>100kb) ONT sequence, for a diploid, in addition to Hi-C for confirmation. A dedicated assembler for this configuration is verkko, and recent versions of hifiasm also support integration of CCS, ultralong ONT and Hi-C data. There are also investigations into an ONT-only recipe, using APK, ULK and PoreC/Hi-C.


Workflows

We support EBP projects using workflows registered in https://workflowhub.eu/.

Workflows developed using the approaches and tools listed above (and most likely others) are available under these WorkflowHub collections


General references

  1. Rhie, Arang, Shane A. McCarthy, Olivier Fedrigo, Joana Damas, Giulio Formenti, Sergey Koren, Marcela Uliano-Silva, et al. 2021. “Towards Complete and Error-Free Genome Assemblies of All Vertebrate Species.” Nature 592 (7856): 737–46.

    - older CLR pipeline available at https://github.com/VGP/vgp-assembly

  2. Larivière, Delphine, Linelle Abueg, Nadolina Brajuka, Cristóbal Gallardo-Alba, Bjorn Grüning, Byung June Ko, Alex Ostrovsky, et al. 2023. “Scalable, Accessible, and Reproducible Reference Genome Assembly and Evaluation in Galaxy.” bioRxiv. https://doi.org/10.1101/2023.06.28.546576.

    - Galaxy pipeline tutorial available at https://training.galaxyproject.org/training-material/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html

  3. Kerstin Howe, William Chow, Joanna Collins, Sarah Pelan, Damon-Lee Pointon, Ying Sims, James Torrance, Alan Tracey, Jonathan Wood, “Significantly improving the quality of genome assemblies through curation”, GigaScience, Volume 10, Issue 1, January 2021, giaa153, https://doi.org/10.1093/gigascience/giaa153

  4. Heng Li and Richard Durbin “Genome assembly in the telomere-to-telomere era”, Nat Rev Genet. 2024 doi:10.1038/s41576-024-00718-w


    Tool/workflow references

  5. HiFiAsm Cheng, Haoyu, Gregory T. Concepcion, Xiaowen Feng, Haowen Zhang, and Heng Li. 2021. “Haplotype-Resolved de Novo Assembly Using Phased Assembly Graphs with Hifiasm.” Nature Methods 18 (2): 170–75.

    - https://github.com/chhylp123/hifiasm

  6. HiCanu Nurk, Sergey, Brian P. Walenz, Arang Rhie, Mitchell R. Vollger, Glennis A. Logsdon, Robert Grothe, Karen H. Miga, Evan E. Eichler, Adam M. Phillippy, and Sergey Koren. 2020. “HiCanu: Accurate Assembly of Segmental Duplications, Satellites, and Allelic Variants from High-Fidelity Long Reads.” Genome Research 30 (9): 1291–1305.

  7. Verkko Rautiainen, Mikko, Sergey Nurk, Brian P. Walenz, Glennis A. Logsdon, David Porubsky, Arang Rhie, Evan E. Eichler, Adam M. Phillippy, and Sergey Koren. 2023. “Telomere-to-Telomere Assembly of Diploid Chromosomes with Verkko.” Nature Biotechnology, February. https://doi.org/10.1038/s41587-023-01662-6.

    - https://github.com/marbl/verkko

  8. Oatk

    - https://github.com/c-zhou/oatk

  9. FASTK

    - https://github.com/thegenemyers/FASTK

    - https://github.com/thegenemyers/GENESCOPE.FK

    - https://github.com/thegenemyers/MERQURY.FK

  10. KMC Marek Kokot, Maciej Długosz, Sebastian Deorowicz. 2017. “KMC 3: counting and manipulating k-mer statistics” Bioinformatics 33:2759–2761. https://doi.org/10.1093/bioinformatics/btx304

    - https://github.com/refresh-bio/KMC

  11. GenomeScope Vurture, Gregory W., Fritz J. Sedlazeck, Maria Nattestad, Charles J. Underwood, Han Fang, James Gurtowski, and Michael C. Schatz. 2017. “GenomeScope: Fast Reference-Free Genome Profiling from Short Reads.” Bioinformatics 33 (14): 2202–4.

    - https://github.com/schatzlab/genomescope

  12. Meryl/Merqury Rhie, A., Walenz, B.P., Koren, S. et al. 2020. “Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21:245. https://doi.org/10.1186/s13059-020-02134-9

    - https://github.com/marbl/meryl

  13. Shasta Shafin, K., Pesout, T., Lorig-Roach, R. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol 38, 1044–1053 (2020). https://doi.org/10.1038/s41587-020-0503-6

    - https://github.com/paoloshasta/shasta 

  14. Flye Mikhail Kolmogorov, Derek M. Bickhart, Bahar Behsaz, Alexey Gurevich, Mikhail Rayko, Sung Bong Shin, Kristen Kuhn, Jeffrey Yuan, Evgeny Polevikov, Timothy P. L. Smith and Pavel A. Pevzner "metaFlye: scalable long-read metagenome assembly using repeat graphs", Nature Methods, 2020 https://doi.org/s41592-020-00971-x

    - https://github.com/fenderglass/Flye

  15. Marvel

    - https://github.com/schloi/MARVE

  16. Falcon-unzip Chen-Shan Chin, Paul Peluso, Fritz J Sedlazeck, Maria Nattestad, Gregory T Concepcion, Alicia Clum, Christopher Dunn, Ronan O'Malley, Rosa Figueroa-Balderas, Abraham Morales-Cruz, Grant R Cramer, Massimo Delledonne, Chongyuan Luo, Joseph R Ecker, Dario Cantu, David R Rank, Michael C Schatz “Phased diploid genome assembly with single-molecule real-time sequencing”, Nat Methods 2016

    - https://github.com/PacificBiosciences/FALCON/wiki/FALCON-FALCON-Unzip-%22For-Phased-Diploid-Genome-Assembly-with-Single-Molecule-Real-Time-Sequencing%22

  17. wtdbg2 Ruan, J. and Li, H. “Fast and accurate long-read assembly with wtdbg2”. Nat Methods 2019

    - https://github.com/ruanjue/wtdbg2

  18. YaHS Chenxi Zhou, Shane A McCarthy, Richard Durbin “YaHS: yet another Hi-C scaffolding tool “ Bioinformatics 2023

    - https://github.com/c-zhou/yahs

  19. SALSA2 Jay Ghurye,Arang Rhie,Brian P. Walenz,Anthony Schmitt,Siddarth Selvaraj,Mihai Pop,Adam M. Phillippy ,Sergey Koren “Integrating Hi-C links with assembly graphs for chromosome-scale assembly” PLOS Computational Biology 2019

    - https://github.com/marbl/SALSA

  20. HiRise

    - https://github.com/DovetailGenomics/HiRise_July2015_GR

  21. 3D-DNA Dudchenko, O., Batra, S.S., Omer, A.D., Nyquist, S.K., Hoeger, M., Durand, N.C., Shamim, M.S., Machol, I., Lander, E.S., Aiden, A.P., et al. (2017). De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science. Apr 7; 356(6333):92-95. doi: https://doi.org/10.1126/science.aal3327. Epub 2017 Mar 23.

    - https://github.com/aidenlab/3d-dna

  22. purge_dups Dengfeng Guan, Shane A McCarthy, Jonathan Wood, Kerstin Howe, Yadong Wang, Richard Durbin (2020). Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics, Volume 36, Issue 9, May 2020, Pages 2896–2898

    - https://github.com/dfguan/purge_dups

  23. BlobToolKit Laetsch DR, Blaxter ML (2017) BlobTools: Interrogation of genome assemblies. F1000Research https://f1000research.com/articles/6-1287/v1

    - https://github.com/blobtoolkit/blobtoolkit?tab=readme-ov-file

  24. Kraken2 Derrick E. Wood, Jennifer Lu & Ben Langmead (2019) Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257

    - https://github.com/DerrickWood/kraken2?tab=readme-ov-file

  25. MitoHiFi Marcela Uliano-Silva, João Gabriel R. N. Ferreira, Ksenia Krasheninnikova, Darwin Tree of Life Consortium, Giulio Formenti, Linelle Abueg, James Torrance, Eugene W. Myers, Richard Durbin, Mark Blaxter & Shane A. McCarthy (2023) MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads. BMC Bioinformatics volume 24, Article number: 288

    - https://github.com/marcelauliano/MitoHiFi

  26. MitoVGP Giulio Formenti, Arang Rhie, Jennifer Balacco, Bettina Haase, Jacquelyn Mountcastle, Olivier Fedrigo, Samara Brown, Marco Rosario Capodiferro, Farooq O. Al-Ajli, Roberto Ambrosini, Peter Houde, Sergey Koren, Karen Oliver, Michelle Smith, Jason Skelton, Emma Betteridge, Jale Dolucan, Craig Corton, Iliana Bista, James Torrance, Alan Tracey, Jonathan Wood, Marcela Uliano-Silva, Kerstin Howe, Shane McCarthy, Sylke Winkler, Woori Kwak, Jonas Korlach, Arkarachai Fungtammasan, Daniel Fordham, Vania Costa, Simon Mayes, Matteo Chiara, David S. Horner, Eugene Myers, Richard Durbin, Alessandro Achilli, Edward L. Braun, Adam M. Phillippy, Erich D. Jarvis & The Vertebrate Genomes Project Consortium (2021) Complete vertebrate mitogenomes reveal widespread repeats and gene duplications. Genome Biology volume 22, Article number: 120

    - https://github.com/gf777/mitoVGP

  27. Medaka

    - https://github.com/nanoporetech/medaka

  28. Arrow

    - https://www.pacb.com/wp-content/uploads/SMRT_Tools_Reference_Guide_v10.1.pdf

  29. bwa Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2

    - https://github.com/lh3/bwa

  30. DeepVariant

    - https://github.com/google/deepvariant

  31. Merfin Formenti, G., Rhie, A., Walenz, B.P. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods (2022). https://doi.org/10.1038/s41592-022-01445-y

    - https://github.com/arangrhie/merfin

  32. bcftools Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li (2021) Twelve years of SAMtools and BCFtools. GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008

    - https://github.com/samtools/bcftools

  33. winnowmap Chirag Jain, Arang Rhie, Nancy Hansen, Sergey Koren and Adam Phillippy (2022) Long-read mapping to repetitive reference sequences using Winnowmap2. Nature Methods,  19, pages705–710

    - https://github.com/marbl/Winnowmap

  34. FCS-gx Astashyn A, Tvedte ES, Sweeney D, Sapojnikov V, Bouk N, Joukov V, Mozes E, Strope PK, Sylla PM, Wagner L, Bidwell SL, Brown LC, Clark K, Davis EW, Smith-White B, Hlavina W, Pruitt KD, Schneider VA, Murphy TD (2024) Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biol. 2024 Feb 26;25(1):60.

    - https://github.com/ncbi/fcs-gx

  35. PretextView

    - https://github.com/sanger-tol/PretextView

  36. JuiceBox Neva C. Durand, James T. Robinson, Muhammad S. Shamim, Ido Machol, Jill P. Mesirov, Eric S. Lander, and Erez Lieberman Aiden (2017) Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 2016 Jul; 3(1): 99–101.

    - https://github.com/aidenlab/Juicebox

  37. yak

    - https://github.com/lh3/yak

  38. BUSCO Mathieu Seppey, Mosè Manni, Evgeny M Zdobnov (2019) BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods Mol Biol: 1962:227-245. doi: 10.1007/978-1-4939-9173-0_14.

    - https://busco.ezlab.org/

  39. herro Stanojevic, D., Lin, D., Florez De Sessions, P., & Sikic, M. (2024). Telomere-to-telomere phased genome assembly using error-corrected Simplex nanopore reads. bioRxiv, 2024-05. doi:10.1101/2024.05.18.594796

    - https://github.com/lbcb-sci/herro

  40. LJA Anton Bankevich, Andrey V. Bzikadze, Mikhail Kolmogorov, Dmitry Antipov & Pavel A. Pevzner (2022) Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology volume 40, pages1075–1081

    - https://github.com/AntonBankevich/LJA

  41. tidk Brown, M., González De la Rosa, P. M. and Mark, B. (2023) ‘A Telomere Identification Toolkit’. Zenodo. doi: 10.5281/zenodo.10091385.

    - https://github.com/tolkit/telomeric-identifier

  42. GRIT Rapid Curation

    - https://gitlab.com/wtsi-grit/rapid-curation

  43. TreeVal

    - https://pipelines.tol.sanger.ac.uk/treeval


EBP - Chromosome Graphic - Website.png

ABOUT THE SUBCOMMITTEE

This Report on Assembly Standards was developed by EBP’s Scientific Subcommittee for Sequencing and Assembly.