Data Bases Classification
First read about Data Bases Classification
Basis of the DataBasesClassification
If a database is containing many species, it is said here to be of general use. See for instance Extraction May 07 (TS extraction at the end of the page), the number of Type Strains is indicative of the resolution power or specificity.
Choice of sequences databases
Summary
Databases are called "stringent" or "lax" (depending if the names are nomenclature compliant or not). TS databases contains only TS+complete genome sequences. SSU-rDNA 16S databases are of general use but taxinomy or identification important genes like gyrB, recA, sodA, rpob, tmRNA, tuf (Bacteria) and groel2-hsp65 (Actinobacteria) are also built.
Stringent or Lax databases
The aims were to give the tools to identify Bacteria and Archeae using phylogeny.
- Identifying may be needed at the species level, thus nomenclature stringent databases have been constructed.
- If we only have a look to closely related sequences, even if the sequence identification is dubious or uncertain, this can be done using lax databases.
Type strains only databases
Another important point is the confidence that we may expect in a set of sequences bearing a name in the nomenclature. The presence of type strains may be usefull to improve confidence. A side of these TS only databases effect is to reduce the number of identical sequences. Type strains only databases are built for SSU-rDNA banks.
SSU-rDNA 16S databases
Theses general databases are constructed for Bacteria and Archaea. They are heavilly containing not completely identified sequences, see the Extraction Ratio page. But the coverage of species is also maximum. These databases are also containing plenty of identical sequences and are contaminated at a rather high level by erroneous identifications (note the importance of type strains sequences in this case).
Lax, Stringent and TS only databases are constructed
Aside of theses, a <PHYLUM> level only database can be helpfull. It contains only one sequence for each Bacteria phylum.
Specific genes databases
These databases are constructed only for Bacteria. Genes usually used for identification purposes (or taxinomy) are gyrB, recA, sodA, rpob, tmRNA, tuf and groel2-hsp65 for Actinobacteria.
From GenBank to BibiLe
Every month, usually during the first week, the data bases used by Bibi Le are extracted from Gen Bank and compiled.
See Extraction Ratio and Extraction Evolution for stats
We use ACNUC database at [1] as the source of sequences. Extraction is done for each bank as one or many (big) files in Bank Format.
Searching information
Usual identification databases
- From Bank Format we extract for each nucleotidic sequence in the bank
- the sequence (minimum length 300 bp);
- the name of the species or any important information on the nature of the strain like
- Pathovar, serovar etc.
- the Type Strain status of the strain (explicitly quoted or as a T following the strain identifiant (collection Id) ;
- the "specific host" information as well as a "endosymbiont" status.
Controling information
- the species name
- The taxId is replaced by the correct name and trailing informations (strains number, host etc.) are suppressed
- if the "ORIGIN" or "DEFINITION" species name is different from the taxId translated name, the indication "cid" for "corrected identification" is added at the end of the sequence description.
- Names are compared to Nomenclature Data Base constructed from the DSMZ database [2] (the Excel file). If the name is validly published as in this Nomenclature Data Base, the strains is marked as Nomenclature compliant. In the opposite case the name is marked as not compliant to nomenclature.
- The Type Strain status
- The Nomenclature Data Base constructed from [3] (the Excel file) give us informations on the Type Strain status of a given strain, but we also use informations gathered in a local database constructed earlier from [4] by Gregory Devulder during his Thesis and the construction of the previous BIBI version (see [5]). This database is regularly improven.
- Dubious identification (X flag)
- Some strains are bearing a name but are phylogenetically not related to the species or genus, these strains are blacklisted.
- this information is given by users
- for Groel-2-HSP65 Actinobacteriae sequences, this blacklist is constructed from a global alignment and phylogenetic tree survey
- T4Bi (Tree for Bacteria Identification) will be used (we hope soon) to automatize the process
Saving and compiling a BLAST base
Two Bibi Data Bases are automatically built for each gene :
- A nomenclature stringent Data Base contains only the Nomenclature compliant strains sequences ;
- A nomenclature lax contains both these strains and all the other.
And for SSU-rDNA (16S), a Type Strains Data Base containing only type strains is also constructed.
Then a BLAST Data Base is automatically built for each, this the last work.
- The <PHYLUM> Bacteria sequence data base contains only sequences from the type strain of the type species of the type genus ... of the phylum. If no sequence was found in Gen Bank for the given strain, a sequence of another strain of the species (or of another species of the genus...) has been taken. In most cases the ends of the sequence have been reconstructed from consensus sequences of the genus. This work was done for the [[6] PhyID-CD project] by Anna Laura Erbino and Sophie Mignard. PhyID-CD may be used for identification but was exploring a new concept of Chimera detection.
BIBIle uses an ''avatar'' of the fasta format, the "T4BiFasta-like" format
BIBIle uses a modified Fasta Format, the "T4BiFasta-like" description of the sequence to condense all the main descriptors in the first (commentary) line ; as for bacteria identification, sequences are short (<2000 bp) the second line contains the nucleotidic sequence (this does not respect the 80 characters rule).
You may see the two caracteristic ~ in some part of the answer/technical annexes.
Formal description
>Genus_species[_subsp_subspecies]~[?/v/Xn]~[T/(N==i)]~GB-Id[=biovar, serovatr...,[Ly]] [<specific host/{host of the endosymbiont bacteria}][Md[1...n]/[]]
nucleotidic sequence
The species and subspecies name
The tilde ~ separates the name of the species or subspecies from other informations :
- this is a species
- >Corynebacterium_appendicis~v~T~AJ314919
- and a subspecies
- >Mycobacterium_avium_subsp._avium~v~T~AF126030
Of course the first descriptor line is followed by the sequence :
>Mycobacterium_avium_subsp._avium~v~T~AF126030
CGTGCTTAACACATGCAAGTCGGA...CTCGAGTGGCGAACGGGTGAGTAACACGTGG
In some cases the name may not be valid yet but the TS status is declared in Gen Bank
>Halomonas_alkantarctica~?~T
Nomenclature and phylogeny information
- [?/v/Xn]
- ? means unknown in the nomenclature like in >uncultured_Rhodococcus_sp.~?~
- v means valid name like in >Corynebacterium_ulcerans~v~ (this information is masked in the trees)
- X means that even if the name follows nomenclature, the denomination is doubtfull, exemple is >Corynebacterium_xerosis~X~
- Xn where n is a numerical score means that the denomination is doubtfull but with N as confidence index (this is not yet implemented)
Type strain status
- [T/(N==i)]
- T means that the strain is a Type Strain for instance >Propionimicrobium_lymphophilum~v~T~
- N means that the strain is a basic one, not a Type Strain (the N means "Not" ) like >Corynebacterium_diphtheriae~v~N~ (note that this tag is replaced by i in the trees to improve readibility)
GenBank-Id
In the following line AY489407 is the Gen Bank ID :
>Halomonas_anticariensis~v~N~AY489407
Specific host and endosymbionts
- <specific host :you will see something like in >Mycobacterium_canettii~?~N~AJ749940<"human". Specific host is a Gen Bank feature and may be missing or vague.
- <{endosymbont found in} like in >Buchnera_aphidicola~v~C~AE013218<{"Schizaphis_graminum"}. The endosymbiont status of a given species is screened from Gen Bank (notes, definition ...) and sometime the species is missing and noted {not in notes}.
Extra information
- [=Vbiovar, serovar...,[Ly]]
- V indicates a variant (serovar, pathovar etc.) like here >Curtobacterium_flaccumfaciens~v~N~AY273208=Vbeticola where beticola is the pathovar
- Ly means that the sequence has been deposited by our team (this is for internal controls, not important for the external user)
- [Md] is a separator between the informations, data concerning the renaming of the sequence follows.
- [] indicates that the species name in not the original one, it has been changed to a (more) correct and nomenclature friendly name : >Pseudomonas_coronafaciens~v~T~AB001440=VatropurpureaMd?
>Corynebacterium~?~N~AY581887=V2300500Md
- [1...n] indicates sequences under a common number (usually complete genomes sequences) like here :
>Corynebacterium_diphtheriae~v~C~BX248360=VMd2
NB : the post Md indications are not stabilized, they may change.
Strange names
During the extraction process, the program is looking for relevant informations in a mass of words, the Gen Bank description of the sequence. Unfortunately the descriptors are often weak and we have to reconstruct names and characteristics of the strains. Most of erroneous identification like
>H.influenzae
should now be corrected (thanks to the taxId translation) to
>Haemophilus_influenzae.
And the cases like :
>Plectonema_boryanum_UTEX_485~?~N~AY082652 where the name is polluted by a strain number, should hopefully not appear now.
These erroneous sequences names are of course seen only in the "lax" Bibi Data Bases.
But the extraction process is fully automatized, so rare cases may be seen elsewhere if the situation is really unexpected.
Resuts of extractions
See Extraction Ratio and Extraction Evolution for stats