RAT_COMBINED database for proteomics data mapping INTRODUCTION ------------ In proteomics, tandem mass spectra are typically annotated by searching against in silico-generated spectra based on a publicly available protein database. For rat, such a database is derived from the reference genome assembly of the BN rat. To create a sample-specific database for MS peptide searching, we extended the existing RefSeq-based peptide database by incorporating strain-specific peptides and predicted peptides. The de novo assembled transcriptome data was added for characterization of transcriptomes at nucleotide resolution. That allowed us to score editing events and splice isoforms. SEQUENCE DATABASE COMPILATION ----------------------------- We downloaded the annotated Ensembl rat protein FASTA (build 3.4.63) derived from the genome assembly of the Brown Norway (BN) strain as our foundation. Subsequently, to tailor-make an in-house rat protein database with enhanced comprehensiveness and precision, we modified and appended the original database with information derived from DNA re-sequencing and RNA-sequencing (RNA-Seq) of the BN-Lx and SHR strain used in this study. When considering strain-specific isoforms, each original Ensembl protein entry that contained variants between BN-Lx and SHR was replaced by the two allelic variants. In the vast majority of cases BN-Lx allele (with more similar genetic background to reference strain) was representing ENSEMBL entries, except for the rare cases where reference assembly (with estimated error rate 1/100kb) disagreed with WGS data from both rat strains. DATABASE CONTENT ---------------- The following datasets are represented in RAT_COMBINED.fasta file supplied in this archive: 26,785 proteins invariant between BNLx and SHR rats ( IDs with suffix "_SAME" ) 6,187 protein isoforms specific to BNLx rats ( IDs with suffix "_BNLX" ) 6,187 protein isoforms specific to SHR rats ( IDs with suffix "_SHR" ) 47,896 predicted proteins ( IDs starting with "GENSCAN" ) 1,755 canonical and non-canonical editing events ( IDs with suffix "_EDIT" ) 2,545 splice isoforms ( IDs with suffix "_SPLC" ) 253 proteins from Control set ( IDs starting with "CON" ) ====== 91,607 entries in total REFERENCES AND RESOURCES --------------------------- 1. Low TY, van Heesch S, van den Toorn H, Giansanti P, Cristobal A, Toonen P, Schafer S, Hübner N, van Breukelen B, Mohammed S, Cuppen E, Heck AJ, Guryev V. Quantitative and Qualitative Proteome Characteristics Extracted from In-Depth Integrated Genomics and Proteomics Analysis. Cell Rep. (2013) 5:1469-1478. 2. BNLx and SHR genome data The Sequence Read Archive accession numbers for the DNA data are: ERP001355 (BN-Lx genome) ERP001371 (SHR genome) ERP000510 (BN reference genome) 3. BNLx and SHR Liver RNA sequencing data Stored in ArrayExpress under the accession number E-MTAB-1666. 4. BNLx and SHR Liver Proteome The ProteomeXchange accession number for the MS data is PXD000131. CONTACT ------- Edwin Cuppen: e.cuppen@hubrecht.eu Albert Heck: a.j.r.heck@uu.nl Victor Guryev: v.guryev@umcg.nl