Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases

Tørresen, Ole K; Star, Bastiaan; Mier, Pablo; Andrade-Navarro, Miguel A; Bateman, Alex; Jarnot, Patryk; Gruca, Aleksandra; Grynberg, Marcin; Kajava, Andrey V; Promponas, Vasilis J; Anisimova, Maria; Jakobsen, Kjetill S; Linke, Dirk

doi:10.1093/nar/gkz841

Bitte benutzen Sie diese Kennung, um auf die Ressource zu verweisen: https://doi.org/10.21256/zhaw-18481

Publikationstyp:	Beitrag in wissenschaftlicher Zeitschrift
Art der Begutachtung:	Peer review (Publikation)
Titel:	Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
Autor/-in:	Tørresen, Ole K Star, Bastiaan Mier, Pablo Andrade-Navarro, Miguel A Bateman, Alex Jarnot, Patryk Gruca, Aleksandra Grynberg, Marcin Kajava, Andrey V Promponas, Vasilis J Anisimova, Maria Jakobsen, Kjetill S Linke, Dirk
et. al:	No
DOI:	10.1093/nar/gkz841 10.21256/zhaw-18481
Erschienen in:	Nucleic Acids Research
Band(Heft):	47
Heft:	21
Seite(n):	10994
Seiten bis:	11006
Erscheinungsdatum:	4-Okt-2019
Verlag / Hrsg. Institution:	Oxford University Press
ISSN:	0305-1048 1362-4962
Sprache:	Englisch
Schlagwörter:	Genomics; Bioinformatics
Fachgebiet (DDC):	572: Biochemie
Zusammenfassung:	The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with 'ready-to-use' deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where misannotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.
URI:	https://digitalcollection.zhaw.ch/handle/11475/18481
Volltext Version:	Publizierte Version
Lizenz (gemäss Verlagsvertrag):	CC BY 4.0: Namensnennung 4.0 International
Departement:	Life Sciences und Facility Management
Organisationseinheit:	Institut für Computational Life Sciences (ICLS)
Publiziert im Rahmen des ZHAW-Projekts:	Discovering evolutionary innovations by assessing variation and natural selection in protein tandem repeats
Enthalten in den Sammlungen:	Publikationen Life Sciences und Facility Management

Dateien zu dieser Ressource:

Datei	Beschreibung	Größe	Format
2019Toerresen_tandem-repeats-lead-to-sequence-assembly-errors_NucleidAcidsResearch.pdf		916.98 kB	Adobe PDF	Öffnen/Anzeigen

Zur Langanzeige

Tørresen, O. K., Star, B., Mier, P., Andrade-Navarro, M. A., Bateman, A., Jarnot, P., Gruca, A., Grynberg, M., Kajava, A. V., Promponas, V. J., Anisimova, M., Jakobsen, K. S., & Linke, D. (2019). Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Research, 47(21), 10994–11006. https://doi.org/10.1093/nar/gkz841

Tørresen, O.K. et al. (2019) ‘Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases’, Nucleic Acids Research, 47(21), pp. 10994–11006. Available at: https://doi.org/10.1093/nar/gkz841.

O. K. Tørresen et al., “Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases,” Nucleic Acids Research, vol. 47, no. 21, pp. 10994–11006, Oct. 2019, doi: 10.1093/nar/gkz841.

TØRRESEN, Ole K, Bastiaan STAR, Pablo MIER, Miguel A ANDRADE-NAVARRO, Alex BATEMAN, Patryk JARNOT, Aleksandra GRUCA, Marcin GRYNBERG, Andrey V KAJAVA, Vasilis J PROMPONAS, Maria ANISIMOVA, Kjetill S JAKOBSEN und Dirk LINKE, 2019. Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Research. 4 Oktober 2019. Bd. 47, Nr. 21, S. 10994–11006. DOI 10.1093/nar/gkz841

Tørresen, Ole K, Bastiaan Star, Pablo Mier, Miguel A Andrade-Navarro, Alex Bateman, Patryk Jarnot, Aleksandra Gruca, et al. 2019. “Tandem Repeats Lead to Sequence Assembly Errors and Impose Multi-Level Challenges for Genome and Protein Databases.” Nucleic Acids Research 47 (21): 10994–1006. https://doi.org/10.1093/nar/gkz841.

Tørresen, Ole K., et al. “Tandem Repeats Lead to Sequence Assembly Errors and Impose Multi-Level Challenges for Genome and Protein Databases.” Nucleic Acids Research, vol. 47, no. 21, Oct. 2019, pp. 10994–1006, https://doi.org/10.1093/nar/gkz841.

Alle Ressourcen in diesem Repository sind urheberrechtlich geschützt, soweit nicht anderweitig angezeigt.