Data Bases Classification

First read about Data Bases Classification

Basis of the DataBasesClassification

If a database is containing many species, it is said here to be of general use. See for instance Extraction May 07 (TS extraction at the end of the page), the number of Type Strains is indicative of the resolution power or specificity.

Choice of sequences databases

Summary

Databases are called "stringent" or "lax" (depending if the names are nomenclature compliant or not). TS databases contains only TS+complete genome sequences. SSU-rDNA 16S databases are of general use but taxinomy or identification important genes like gyrB, recA, sodA, rpob, tmRNA, tuf (Bacteria) and groel2-hsp65 (Actinobacteria) are also built.

Stringent or Lax databases

The aims were to give the tools to identify Bacteria and Archeae using phylogeny.

Type strains only databases

Another important point is the confidence that we may expect in a set of sequences bearing a name in the nomenclature. The presence of type strains may be usefull to improve confidence. A side of these TS only databases effect is to reduce the number of identical sequences. Type strains only databases are built for SSU-rDNA banks.

SSU-rDNA 16S databases

Theses general databases are constructed for Bacteria and Archaea. They are heavilly containing not completely identified sequences, see the Extraction Ratio page. But the coverage of species is also maximum. These databases are also containing plenty of identical sequences and are contaminated at a rather high level by erroneous identifications (note the importance of type strains sequences in this case).
Lax, Stringent and TS only databases are constructed
Aside of theses, a <PHYLUM> level only database can be helpfull. It contains only one sequence for each Bacteria phylum.

Specific genes databases

These databases are constructed only for Bacteria. Genes usually used for identification purposes (or taxinomy) are gyrB, recA, sodA, rpob, tmRNA, tuf and groel2-hsp65 for Actinobacteria.

From GenBank to BibiLe

Every month, usually during the first week, the data bases used by Bibi Le are extracted from Gen Bank and compiled.
See Extraction Ratio and Extraction Evolution for stats
We use ACNUC database at [1] as the source of sequences. Extraction is done for each bank as one or many (big) files in Bank Format.

Searching information

Usual identification databases

Controling information

Saving and compiling a BLAST base

Two Bibi Data Bases are automatically built for each gene :

And for SSU-rDNA (16S), a Type Strains Data Base containing only type strains is also constructed.

Then a BLAST Data Base is automatically built for each, this the last work.

BIBIle uses an ''avatar'' of the fasta format, the "T4BiFasta-like" format

BIBIle uses a modified Fasta Format, the "T4BiFasta-like" description of the sequence to condense all the main descriptors in the first (commentary) line ; as for bacteria identification, sequences are short (<2000 bp) the second line contains the nucleotidic sequence (this does not respect the 80 characters rule).

You may see the two caracteristic ~ in some part of the answer/technical annexes.

Formal description

>Genus_species[_subsp_subspecies]~[?/v/Xn]~[T/(N==i)]~GB-Id[=biovar, serovatr...,[Ly]] [<specific host/{host of the endosymbiont bacteria}][Md[1...n]/[]]
nucleotidic sequence

The species and subspecies name

The tilde ~ separates the name of the species or subspecies from other informations :

Of course the first descriptor line is followed by the sequence :

>Mycobacterium_avium_subsp._avium~v~T~AF126030
CGTGCTTAACACATGCAAGTCGGA...CTCGAGTGGCGAACGGGTGAGTAACACGTGG

In some cases the name may not be valid yet but the TS status is declared in Gen Bank
>Halomonas_alkantarctica~?~T

Nomenclature and phylogeny information

Type strain status

GenBank-Id

In the following line AY489407 is the Gen Bank ID :
>Halomonas_anticariensis~v~N~AY489407

Specific host and endosymbionts

Extra information

>Corynebacterium~?~N~AY581887=V2300500Md

>Corynebacterium_diphtheriae~v~C~BX248360=VMd2
NB : the post Md indications are not stabilized, they may change.

Strange names

During the extraction process, the program is looking for relevant informations in a mass of words, the Gen Bank description of the sequence. Unfortunately the descriptors are often weak and we have to reconstruct names and characteristics of the strains. Most of erroneous identification like
>H.influenzae
should now be corrected (thanks to the taxId translation) to
>Haemophilus_influenzae.
And the cases like :
>Plectonema_boryanum_UTEX_485~?~N~AY082652 where the name is polluted by a strain number, should hopefully not appear now.
These erroneous sequences names are of course seen only in the "lax" Bibi Data Bases.
But the extraction process is fully automatized, so rare cases may be seen elsewhere if the situation is really unexpected.

Resuts of extractions

See Extraction Ratio and Extraction Evolution for stats

Sun Aug 17 11:33:56 2008

Valid CSS! PRABI image Valid CSS! [Python Powered]