Format of Processed Ribosomal Protein Pseudogene Flat Files

The pseudogene annotation files are tab-delimited, multi-field text files. All the information relating to the ribosomal proteins were downloaded from the Ribosomal Protein Gene Database (RPDB).


Data Description

Each file has a header line describing the content in each column. The columns from left to right are:

  1. ID: unique identifier for each processed pseudogene in the format of chr$a_$b.$c where $a is chromosome name, $b is the name of ribosomal protein as designated in RPDB , and $c is the sequential numering of the pseudogene that matches protein $b on chromosome $a. Example: chr10_RPL7.2
  2. Short_ID: short version of pseudogene ID in the format of $a_$b where $a is the Swissprot protein accession number and $b is the sequential numbering of the pseudogene that matches protein $a in the whole genome. Example RPL7_18
  3. Chr: chromosome name
  4. Chrom_start: starting coordinate of the pseudogene on the chromosome, based on Release 36 of Ensembl.
  5. Chrom_strand: "-" or "+"
  6. Query_protein: name of the query ribosomal protein as named in RPDB
  7. Query_start: starting amino acid number on the query protein that the pseudogene matches.
  8. Query_end: end amino acid number on the query protein that the pseudogene matches.
  9. Query_len: sequence length of the query protein in column 7.
  10. Match_length: fractional length of the pseudogene compared to the query protein.
  11. E-value: expect value of the pseudogene in the TBLASTX search.
  12. AA_ident: amino acid sequence identity between the pseudogene and query protein.
  13. Polya: "0" or "1" or "2" or "3".
    • "0":no polyA tail (> 30 A in 50 bp window) detected of the pseudogene
    • "1" : has polyA tail and also polyadenylation signal with 50 bp of the begining of the tail
    • "2" : has polyA tail and polyadenylation signal within 50-100 bp of the begining of the tail
    • "3": has polyA tail but no polyadenylation detected.
  14. Disable: "0" or "d" or "D". "0" indicates no disablement i.e. . "d" indicates disablement in a region of low sequence identity. "D" indicates disablement in region of high sequence identity.