Brownotate, a Comprehensive Solution to Generate Protein Sequence Databases for Any Species.

Fiche publication

Date publication

janvier 2026

Journal

Proteomics

Auteurs

Membres identifiés du Cancéropôle Est :
Dr CIANFERANI Sarah , Dr CARAPITO Christine

Tous les auteurs :
Brown A, Burel A, Cianférani S, Carapito C, Bertile F

Lien Pubmed

https://www.ncbi.nlm.nih.gov/pubmed/41493144

Résumé

Proteomics is strengthening research in biology and the diversification of the model organisms studied is very promising for fully understanding the complexity of biological principles. However, the lack of protein sequence databases for many species is a major bottleneck. Existing computational solutions are usually incomplete and/or only usable by bioinformaticians. We have built an open-source, user-friendly pipeline, called Brownotate, which allows anyone to generate protein sequence databases for any species as long as sequencing information is available. The pipeline can extract already existing protein sequences, but also automatically annotate any genome assembly or assemble and annotate any DNA sequence dataset. By testing the pipeline with numerous sequencing and assembly datasets covering a large part of the phylogenetic tree, we show that Brownotate generates fragmented but good quality assemblies and good quality annotations when compared to reference data. By comparing the use of protein databases generated by Brownotate or downloaded from NCBI to interpret proteomic data, we show very comparable results. The Brownotate pipeline is, therefore, an important new addition to the proteomics toolbox. The pipeline and its web interface are freely available at https://github.com/LSMBO/Brownotate and https://github.com/LSMBO/brownotate-app, respectively. SUMMARY: This study evaluated the performance of a newly developed pipeline, Brownotate, for the assembly and annotation of sequencing data for multiple species, from prokaryotes to eukaryotes. We compared their fragmentation level (assembly) and completeness based on evolutionary expectations of gene content, and we evaluated their overlap. Brownotate generated fragmented, slightly less complete assemblies. However, the overlap of proteins predicted was very good, despite an excess of predicted sequences of small size with Brownotate. In addition, the interpretation of proteomics data downloaded from PRIDE repository for 27 species was found to lead to very similar results regardless of the origin of the protein sequencing database used, whether it was generated by Brownotate or downloaded from NCBI. Brownotate, made available to the community, will, therefore, be a tool of choice to mitigate the lack of an appropriate protein sequence database for many species, and allow proteomists to analyse without delay samples from species for which only sequencing data are available.