Refreshes the protein references for all peptide hits from a idXML file.
pot. predecessor tools | ![]() ![]() | pot. successor tools |
IDFilter or any protein/peptide processing tool | FalseDiscoveryRate |
Each peptide hit is annotated by a target_decoy string, indicating if the peptide sequence is found in a 'target', a 'decoy' or in both 'target+decoy' protein. This information is crucial for the FalseDiscoveryRate IDPosteriorErrorProbability tools.
This tool supports relative database filenames, which (when not found in the current working directory) is looked up in the directories specified by 'OpenMS.ini:id_db_dir' (see TOPP for Advanced Users).
By default the tool will fail, if an unmatched peptide occurs, i.e. the database does not contain the corresponding protein. You can force the tool to return successfully in this case by using the flag 'allow_unmatched'.
Some search engines (such as Mascot) will replace ambiguous AA's ('B', 'Z', and 'X') in the protein database with unambiguous AA' in the reported peptides, e.g., exchange 'X' with 'H'. This will cause this peptide not to be found by exactly matching its sequence to the database. However, we can recover these cases by using tolerant search (done automatically).
Two search modes are available:
No matter if exact or tolerant search is used, we require ambiguous AA's in peptide sequence to match exactly in the protein DB (i.e., 'X' in peptide only matches 'X' in database). The exact mode is much faster (about x10) and consumes less memory (about x2.5), but might fail to report a few protein hits with ambiguous AAs for some peptides. Usually these proteins are putative, however. The exact mode also supports usage of multiple threads (use @ -threads option) to speed up computation even further, at the cost of some memory. This is only for the exact search though (Aho Corasick). If tolerant searching needs to be done for unassigned peptides, the latter will consume the major time portion.
Once a peptide sequence is found in a protein sequence, this does not imply that the hit is valid! This is where enzyme specificity comes into play. By default, we demand that the peptide is fully tryptic (since the enzyme parameter is set to "trypsin" and specificity is "full"). So unless the peptide coincides with C- and/or N-terminus of the protein, the peptide's cleavage pattern should fulfill the trypsin cleavage rule [KR][^P]. We make one exception for peptides which start at the second AA of the protein where the first AA of the protein is methionin (M), which is usually cleaved off in vivo, e.g., the two peptides AAAR and MAAAR would both match a protein starting with MAAAR.
You can relax the requirements further by chosing semi-tryptic
(only one of two "internal" termini must match requirements) or none
(essentially allowing all hits, no matter their context).
The command line parameters of this tool are:
PeptideIndexer -- Refreshes the protein references for all peptide hits. Version: 1.11.1 Nov 14 2013, 11:18:15, Revision: 11976 Usage: PeptideIndexer <options> Options (mandatory options marked with '*'): -in <file>* Input idXML file containing the identifications. (valid formats: 'idXML') -fasta <file>* Input sequence database in FASTA format. Non-existing relative file-names are looked up via'OpenMS.ini:id_db_dir' (valid formats: 'fasta') -out <file>* Output idXML file. (valid formats: 'idXML') -decoy_string <string> String that was appended (or prepended - see 'prefix' flag below) to the accession of the protein database to indicate a decoy protein. (default: '_rev') -missing_decoy_action <action> Action to take if NO peptide was assigned to a decoy protein (which indicat es wrong database or decoy string): 'error' (exit with error, no output), 'warn' (exit with success, warning message) (default: 'error' valid: 'error ', 'warn') The enzyme determines valid cleavage-sites and the cleavage specificity set by the user determines how these are enforced.: -enzyme:name Enzyme which determines valid cleavage sites, e.g., for trypsin it should (unless at protein terminus) end on K or R and the AA-before should also be K or R, and not followed by proline. (default: 'Trypsin' valid: 'Trypsin ') -enzyme:specificity Specificity of the enzyme. 'full': both internal cleavage-sites must match. 'semi': one of two internal cleavage-sites must match. 'none': allow all peptide hits no matter their context. Therefore, the e nzyme chosen does not play a role here (default: 'full' valid: 'full', 'sem i', 'none') -write_protein_sequence If set, the protein sequences are stored as well. -prefix If set, the database has protein accessions with 'decoy_string' as prefix. -keep_unreferenced_proteins If set, protein hits which are not referenced by any peptide are kept. -allow_unmatched If set, unmatched peptide sequences are allowed. By default (i.e. this flag is not set) the program terminates with error status on unmatched peptides . -full_tolerant_search If set, all peptide sequences are matched using tolerant search. Thus poten tially more proteins (containing ambiguous AA's) are associated. This is much slower! -aaa_max <AA count> Maximal number of ambiguous amino acids (AAA) allowed when matching to a protein DB with AAA's. AAA's are 'B', 'Z', and 'X' (default: '4' min: '0') Common TOPP options: -ini <file> Use the given TOPP INI file -threads <n> Sets the number of threads allowed to be used by the TOPP tool (default: '1') -write_ini <file> Writes the default configuration file --help Shows options --helphelp Shows all options (including advanced)
INI file documentation of this tool:
OpenMS / TOPP release 1.11.1 | Documentation generated on Thu Nov 14 2013 11:19:24 using doxygen 1.8.5 |