Bioinformatics Vol. 19 no. 7 2003
Pages 811-817
© 2003 Oxford University Press
SamCluster: an integrated scheme for automatic discovery of sample classes using gene expression profile
1 Beijing Institute of Basic Medical Sciences,
PO Box 130(3), Beijing 100850, Peoples Republic of China
2 Human Genetics Center, School of Public Health, University of
Texas-Houston Health Science Center, TX 77225, USA
Received on April 17, 2002
; revised on October 27, 2002 and November 11, 2002
; accepted on November 13, 2002
Motivation: Feature (gene) selection can dramatically improve the accuracy of gene expression profile based sample class prediction. Many statistical methods for feature (gene) selection such as stepwise optimization and Monte Carlo simulation have been developed for tissue sample classification. In contrast to class prediction, few statistical and computational methods for feature selection have been applied to clustering algorithms for pattern discovery.
Results: An integrated scheme and corresponding program SamCluster for
automatic discovery of sample classes based on gene expression
profile is presented in this report. The scheme incorporates
the feature selection algorithms based on the calculation of CV
(coefficient of variation) and t-test into hierarchical
clustering and proceeds as follows. At first, the genes with
their CV greater than the pre-specified threshold are selected
for cluster analysis, which results in two putative sample
classes. Then, significantly differentially expressed genes in
the two putative sample classes with p-values
0.01, 0.05, or 0.1 from t-test are selected for further
cluster analysis. The above processes were iterated until the
two stable sample classes were found. Finally, the consensus
sample classes are constructed from the putative classes that
are derived from the different CV thresholds, and the best
putative sample classes that have the minimum distance between
the consensus classes and the putative classes are identified.
To evaluate the performance of the feature selection for
cluster analysis, the proposed scheme was applied to four
expression datasets COLON, LEUKEMIA72, LEUKEMIA38, and OVARIAN.
The results show that there are only 5, 1, 0, and 0 samples
that have been misclassified, respectively. We conclude that
the proposed scheme, SamCluster, is an efficient method for
discovery of sample classes using gene expression profile.
Availability: The related program SamCluster is available upon request or from the web page http://www.sph.uth.tmc.edu:8052/hgc/Downloads.asp
Contact: wujuli{at}yahoo.com