|
Below is a list of all the data compiled and integrated into the Plant-Specific Database.
For further details about this work see (Gutierrez, RA, et al., 2004).
Description field from the The Institute for Genomic Research (TIGR) Arabidopsis thaliana genome.
Comment field from the The Institute for Genomic Research (TIGR) Arabidopsis thaliana genome.
Classification based on the pattern of sequence similarity to proteins in other
organisms.
For further details see the Introduction.
Pub locus identifier.
GenBank accession number.
BLASTCLUST systematically clusters protein or DNA sequences based on pairwise matches found using the BLAST algorithm.
- The number of genes included in a gene family depends on the criteria for inclusion.
For this "Plant Specific" database, we have arbitrarily defined a gene family as those members whose
proteins are clustered by BLASTCLUST using the parameters L(length)=0.6 and S(similarity)=0.8.
THe list below indicates other Arabidopsis proteins which are clustered with protein AtXgXXXXX using these parameters.
- If you would like to explore the gene family using less or more stringent clustering parameters,
click on the "Gene Family" link below. This will return a graphical display of how L and S parameters influence
the number of members of the family and will provide links to clusters that are formed with other parameters.
However, this analysis will require one minute or more.
- For more information on BLASTCLUST: ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastclust.txt
The classification indicated in this field is based on the work by scientists expert in the particular gene family.
Most of this information was obtained from the Arabidopsis Information Resource
(http://www.arabidopsis.org).
See the (Gene Family Information
webpage for further details.
The gene families related to Lipid Metabolism were obtained from
the "Arabidopsis Lipid Gene Database".
Description of this database can be found in
Beisson, F. et al (2003).
The gene families related to RNA metabolism were kindly provided by Dr Vivek Anantharaman and Dr. Eugene V. Koonin. For a description of
the study of proteins involved in RNA metabolism see Anantharaman,V. et al (2002).
Protein molecular weight is based on the prediction in the TIGR genome.
Protein isoelectric point is based on the prediction in the TIGR genome.
Enzyme Commission number as annotated by the
Kyoto Encyclopedia of Genes and Genomes.
Links to external databases:
Predictions were performed using the TargetP program
(described in Emanuelsson, O. et al., (2000).)
available at the Center for Biological Sequence Analysis (http://www.cbs.dtu.dk/services/TargetP).
TargetP looks for N-terminal sorting signals by feeding the outputs from SignalP, ChloroP and an analogous mitochondrial predictor into
a neural network that makes the final choice between the different compartments.
It provides a score and a reliability class (a measure of the difference between the winner and runner-up models) to evaluate the significance of
the prediction. The TargetP web server size cutoff of 4000 aa precluded analysis of the complete sequence of four Arabidopsis protein-coding genes
(At1g48090.1, At1g67120.1, At3g02260.1 and At5g23110.1). In these cases, only the N-terminal portion of the protein was utilized for the prediction.
Caution should be exerted when looking at the individual predictions. TargetP program can yield false positives and
false negatives (Emanuelsson et al., 2000).
TargetP correctly discriminates between chloroplast, mitochondria, secretory pathway and other location 85% of the time when analyzing
Arabidopsis proteins (Emanuelsson et al., 2000). To facilitate correct interpretation of individual predictions, we provide
the complete output of the programs.
[Gene Expression in different plant organs]
EST data was prepared as described in Beisson, F. et al (2003).
The composition of the synthetic cDNA libraries used in this study is available from the
Arabidopsis Lipid Gene database.
To analyze the expression of genes in organs, we used a highly filtered dataset
prepared from the publicly available two-color microarray experiments performed by the Arabidopsis
Functional Genomics Consortium (AFGC). Briefly, all microarray hybridizations comparing gene expression in
organs were considered for the analysis. These include the following SMD identifiers: 7197, 7199, 7200, 7201,
7203, 7205, 21096, 21097, 21098, 21099, 2370, 2371. Spot quality parameters were applied to each hybridization
to filter out sub-optimal data points: (1) Sum of raw channel intensities >= 1000. (2) Channel intensity values
could not be saturated in more than 1 channel per hybridization. (3) 50% or more of the pixels in the spot had
to be greater than 1.5-times the background (in at least one channel per hybridization) (4) Flag = 0. (5) We included
only spots that were printed with DNA from good PCR reactions (SMD codes 0, 5 and 7). The lowess method by sector was
then used to normalize each hybridization (Yang et al., 2002).
All the organ hybridizations used in this study passed slide quality parameters:
(1) Hybridizations did not have strong gradient in the ratios after normalization
(Gutierrez et al, 2002).
(2) Data in replicate hybridizations was reproducible. Reproducibility was qualitatively assessed by scatter plots of
the replicates. EST clones that had been printed several times were averaged and a final data table was generated by
calculating the median of redundant EST clones (those that represent the same gene). The mean was calculated for the
replicates.
Putative transmembrane helices were predicted using TMHMM (
Krogh et al., 2001) through the web server available at http://www.cbs.dtu.dk/services/TMHMM/.
TMHMM uses a hidden markov model to predict transmembrane helices from the primary protein structure.
Caution should be exerted when looking at predictions for individual proteins. TMHMM program can yield false positives and
false negatives
(Krogh et al., 2001).
TMHMM success rate in discriminating soluble from membrane proteins is claimed to
be higher than 99% in proteins without a signal peptide
(Krogh et al., 2001). To facilitate
correct interpretation of individual predictions, we provide the complete output of the programs.
The automatic functional assignments for the Arabidopsis genes represented in PLASdb were obtained from the MIPS Arabidopsis thaliana database (http://mips.gsf.de/proj/thal/db/index.html).
Ontology assignments were obtained from the TIGR database.
Protein domain analysis for the Arabidopsis proteins in PLASdb was obtained from the TIGR Arabidopsis thaliana database.
Signal-P analysis for the Arabidopsis proteins in PLASdb was obtained from the TIGR Arabidopsis thaliana database.
References:
- Anantharaman V, Koonin EV, Aravind L (2002) Comparative genomics and evolution of proteins involved in RNA metabolism. Nucleic Acids Res 30: 1427-1464
- Beisson F, Koo AJ, Ruuska S, Schwender J, Pollard M, Thelen JJ, Paddock T, Salas JJ, Savage L, Milcamps A, Mhaske VB, Cho Y, Ohlrogge JB (2003) Arabidopsis genes involved in acyl lipid metabolism. A 2003 census of the candidates, a study of the distribution of expressed sequence tags in organs, and a web-based database. Plant Physiol 132: 681-697
- Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N- terminal amino acid sequence. J.Mol Biol. 300: 1005-1016
- Gutierrez RA, Ewing RM, Cherry JM, Green PJ (2002) Identification of unstable transcripts in Arabidopsis by cDNA microarray analysis: rapid decay is associated with a group of touch- and specific clock-controlled genes. Proc Natl Acad Sci U S A 99: 11513-11518
- Gutierrez RA, Larson, MD, Wilkerson, C (2004) The Plant-Specific Database: classification of Arabidopsis proteins based on their phylogenetic profile. Plant Phys. submitted.
- Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J.Mol Biol. 305: 567-580
- Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30: e15
|