Labfans是一个针对大学生、工程师和科研工作者的技术社区。 | 论坛首页 | 联系我们(Contact Us) |
![]() |
|
![]() |
#1 |
游客
帖子: n/a
|
![]()
Using Genetic Algorithms to Select a Subset of Predictive Variables from a High-Dimensional Microarray Dataset
by Sam Roberts Because they enable the parallel measurement of large numbers of genes (as many as 25,000), microarrays are used extensively for basic and pharmaceutical research in a variety of applications, including drug discovery, toxicology, and oncology. The exploding amount of data generated by microarrays and other research methods, however, poses problems for experimental design, visualization, and statistical analysis and interpretation. What is a Microarray? A microarray consists of small lengths of DNA attached to a surface, typically glass or coated quartz. The DNA can either be spotted onto the surface using robotic technology or chemically synthesized in situ. Nucleic acids from an experimental sample are labeled chemically and hybridized to the DNA on the microarray. The microarray is then scanned to produce an image like Figure 1. We can display these images using commands from the Bioinformatics Toolbox. ma=affyread('0029_1209_H95Av2_KF0077.CEL'); maimage(ma); ![]() Feature Selection Many recent articles in the microarray literature have discussed the application of feature selection methods to high-dimensional datasets. Researchers can use these methods to select smaller subsets of interesting genes, aiding the interpretation of statistical models while retaining the highest possible degree of the accuracy of models developed on the full dataset. Sometimes, by allowing statistical learning algorithms to focus only on highly predictive genes while ignoring redundant variables and irrelevant noise, feature selection methods can even improve the accuracy of statistical models. Feature selection methods are often separated into two groups: filter and wrapper methods. Filter methods generally rank each gene individually by some quality criterion (for example, the p-value of t-test comparing two populations of interest with regard to the expression levels of the gene in the populations), and then select the subset of genes with the n highest quality criteria. Wrapper methods use a search algorithm to evaluate subsets of the variables as a group, rather than individually. An exhaustive search through all subsets is clearly not possible—there could be around 225,000 variable subsets to consider. Therefore, these search algorithms use heuristics to direct their search towards promising candidates. One such search method is the genetic algorithm. What is a Genetic Algorithm? The name “genetic algorithm” is not connected to the fact that we are analyzing gene expression data. A genetic algorithm is a general-purpose method for searching through a large number of objects to find one that optimizes a quantity of interest, using methods derived from Darwinian evolution. The algorithm randomly selects an initial population of objects and their “fitness” is evaluated by calculating the quantity of interest. The best ones are selected to become “parents” of a “child” population, by exchanging properties (“crossover”), and occasionally mutating to create new properties. This process is iteratively repeated, converging on a population that optimizes the quantity of interest. In the analysis described in this article, the objects are subsets of genes, and the quantity to be optimized is how well the subset can predict lymph node status. This article discusses the use of the Genetic Algorithm and Direct Search Toolbox to perform variable selection on a microarray dataset. Although our focus is on the analysis of microarray data, variable selection using genetic algorithms is a method applicable to many areas involving high-dimensional datasets: spectroscopy, quantitative structure-activity relationships, and many others. Data We will be working with the preprocessed microarray dataset. The original data is publicly available and described in “Gene Expression Predictors of Breast Cancer Outcomes” (Huang et al (2003) [1]). The authors obtained microarray data from samples of primary breast tumors from 89 patients, as well as information concerning each patient's clinical state (lymph node status). Invasion of a tumor into axillary lymph nodes is a significant prognostic factor for breast cancer. Currently, the best method of classifying patients into subgroups of severity is the investigation by a pathologist of lymph node biopsies. However, this method is invasive, and also imperfectly predictive (22–33% of patients classified as low-risk go on to develop breast cancer). The collection of gene expression data may add predictive value to the current clinical indicators, as it can provide new information that is believed to be relevant in sub-classifying tumors. The authors originally used cluster analysis and principal component analysis to construct a small number of “metagenes” with which they developed a predictive model of patient status. In this article, we demonstrate the use of genetic algorithms to select a subset of the genes that is highly predictive of lymph node status. Importing the Dataset We downloaded our raw data directly from the authors' Website. load HuangData numSamples = size(signal,1); numVariables = size(signal,2); whos Name Size Bytes Class call 89x12625 1123625 logical array descriptions 1x12625 4049096 cell array geneNames 1x12625 965230 cell array lnNodeStatus 89x1 712 double array numSamples 1x1 8 double array numVariables 1x1 8 double array sampleNames 89x1 9918 cell array signal 89x12625 8989000 double array Grand total is 4024632 elements using 15137597 bytes For ease of exposition, we preprocessed the data a little. In particular, the lymph node status has been re-ordered so as to match the order of the samples, and the microarray signal, absent/present call (see below), gene names, and gene descriptions have been extracted from the raw dataset into their own separate variables. Dimensionality Reduction We first remove all genes that are detected imperfectly or only at very low levels. The microarrays used in this study, which are from Affymetrix, give rise to an absent/present call for each gene (an indicator of whether the gene has been detected at significant levels). We include only genes that have been detected as present. for i=1:numVariables calledPresent(i) = all(call(:,i)); end geneNames = geneNames(calledPresent); descriptions = descriptions(calledPresent); signal = signal(:,calledPresent); call = call(:,calledPresent); numVariables = size(signal,2); whos Name Size Bytes Class call 89x1447 128783 logical array calledPresent 1x12625 12625 logical array descriptions 1x1447 468102 cell array geneNames 1x1447 110798 cell array lnNodeStatus 89x1 712 double array numSamples 1x1 8 double array numVariables 1x1 8 double array sampleNames 89x1 9918 cell array signal 89x1447 1030264 double array Grand total is 478185 elements using 1761226 bytes Setting up the Genetic Algorithm The default options for genetic algorithms in the Genetic Algorithm and Direct Search toolbox are most suitable for searching though continuous spaces. In this application, we are searching through a discrete space (subsets of variables), so we need to customize them. First, we write a custom fitness function. type varselect_gafit.m function errorRate = varselect_gafit(variableIndices,... data,groups,numSamples) % Create ten-fold cross-validation indices crossValidationIndices = mod(1:numSamples,10); % Initialise a count of the errors so far errorsSoFar = 0; % Repeat for each of the ten cross-validations for i = 0:9 % Create a test and training set testIndices = (crossValidationIndices == i); trainIndices = ~testIndices; % Generate a classifier to discriminate groups in the data classes = classify(... data(testIndices,variableIndices),... data(trainIndices,variableIndices),... groups(trainIndices,:)); % Find classification errors and update the errors so far errors = sum(classes ~= groups(testIndices,:)); errorsSoFar = errorsSoFar+errors; end % Extract the final error rate errorRate = errorsSoFar/numSamples; This function uses the classify tool from the Statistics Toolbox to discriminate between two groups (lymph node status negative and positive), using the subset of variables currently being evaluated. The error rate of the classification model generated is estimated using 10-fold cross-validation. The estimated error rate is the fitness function minimized by the genetic algorithm. We also need to customize the function used to generate an initial population for the genetic algorithm. type varselect_gacreate.m function population = varselect_gacreate(numSelectedVariables,FitnessFcn,... options,numVariables) % Repeat for each member of the population for i = 1:options.PopulationSize % Initialise a flag to indicate that all selected variables are % distinct from each other allVariablesUnique = 0; % Loop continuously until all selected variables are unique while ~allVariablesUnique % Randomly select a subset of variable indices variables = floor(rand(numSelectedVariables,1)*numVariables)+1; % Check that all variables are different. If so, set the flag % to break out of the loop, otherwise loop again if size(variables) == size(unique(variables)) allVariablesUnique = 1; end end % Add the current variable subset to the population population(i,:) = variables; end This function randomly selects a subset of variables from the dataset. If the variables are all different, the subset is included in the initial population for the genetic algorithm. If not, the function generates another random subset until an initial population of the desired size has been created. Lastly, we need to make small changes to the default crossover and mutation functions to ensure that they always return subsets of variables that do not contain repeated variables. To access these functions, enter the following commands: type varselect_crossoverscattered.m type varselect_mutationuniform.m Setting Algorithm Options We can set various options regarding the performance of the genetic algorithm, for example, the population size, the proportion of each new generation that is generated by crossover and by mutation, the criteria used to halt the algorithm (in terms of number of generations or time), and the command-line and graphical outputs of the algorithm. options = gaoptimset('CreationFcn',{@varselect_gacreate,numVariables},... 'CrossoverFcn',@varselect_crossoverscattered,... 'MutationFcn',@varselect_mutationuniform,... 'PopulationSize',100,... 'PopInitRange',[1;numVariables],... 'FitnessLimit', 0,... 'StallGenLimit',100,... 'StallTimeLimit',600,... 'TimeLimit',600,... 'Generations',100,... 'CrossoverFraction',0.5,... 'Display','iter',... 'PlotFcn',@gaplotbestf); Running the Genetic Algorithm We can now run the genetic algorithm to search for the subset of genes that best predicts patient lymph node status. numSelectedVars = 10; FitnessFcn = {@varselect_gafit,signal,lnNodeStatus,numSamples}; [selectedVars, errorRate,reason,output] = ga(FitnessFcn, ... numSelectedVars,options); selectedVars = Columns 1 through 5 1149 868 929 920 1170 Columns 6 through 10 792 1050 556 680 458 errorRate = 0.0225 The algorithm takes approximately 1,000 seconds to complete on a Pentium M processor with 1GB of RAM. Figure 2 shows the graphical output of the algorithm. Note that, as the genetic algorithm involves a random component, the results will not always exactly match this output. ![]() Evaluating and Interpreting the Results The selected subset of 10 genes predicts the lymph node status of patients with only a 2.25% error rate. However, this is probably an underestimate of the true error rate, as the subset of variables has been selected by searching through thousands of possible subsets, and it is likely that some will prove accurate predictors just by chance. To provide a more accurate estimate of the likely error rate of the model on unseen cases, we would ideally want to validate the predictions on a test set of unseen data. The genes also require biological validation to confirm that they are possible candidates for biomarkers of lymph node status. We can display the descriptions of the selected genes to get some idea of their function. selectedDescriptions = descriptions(selectedVars)'; for i=1:10 disp(selectedDescriptions{i}(15:80)) end AL109682:Homo sapiens mRNA full length insert cDNA clone EUROIMAGE J02940:Human platelet glycoprotein Ib alpha chain mRNA, complete c AF052145:Homo sapiens clone 24400 mRNA sequence /cds=UNKNOWN /gb=A AF007833:Homo sapiens kruppel-related zinc finger protein hcKrox m H55746:140A6 Homo sapiens cDNA /gb=H55746 /gi=1004390 /ug=Hs.28704 X97249:H.sapiens mRNA for leucine-rich primary response protein 1 W29040:55d6 Homo sapiens cDNA /gb=W29040 /gi=1308997 /ug=Hs.199320 AF054177:Homo sapiens chromodomain-helicase-DNA-binding protein mR AA079018:zm94e12.s1 Homo sapiens cDNA, 3 end /clone=IMAGE-545614 J04755:Human ferritin H processed pseudogene, complete cds /cds=UN To investigate further, we can submit the selected genes as a batch query to NetAffx, a publicly accessible database made available by Affymetrix that contains details of the genes on the microarrays. (Access to this Website requires a free registration with Affymetrix.) We create a batch query file, suitable for submission to the NetAffx Website, containing the names of the genes. selectedGeneNames=geneNames(selectedVars)'; fid = fopen('query.txt','w'); for i = 1:numSelectedVars fprintf(fid, '%s\n',selectedGeneNames{i,:}); end fclose(fid); type('query.txt'); 34538_at 33063_at 33601_at 33592_at 34559_at 32987_at 34081_at 31914_at 32396_f_at 31697_s_at We submit the file query.txt directly to the NetAffx Website as a batch query. Figure 3 shows the query results. ![]() We can also obtain more details on each gene by submitting individual queries on them to NetAffx (you must be logged in to NetAffx for this to work). Figure 4 shows a sample result from a query. for i = 1:numSelectedVars query=sprintf('https://www.affymetrix.com /analysis/netaffx/fullrecord.affx?pk=HG-U95AV2%%3A%s',selectedGeneNames{i,:}); web(query, '-new') end ![]() Many areas of science now generate huge volumes of data that present visualization, modeling, and interpretation challenges. Methods for selecting small but highly relevant and predictive data subsets are therefore receiving much attention. In this article, we demonstrated the use of genetic algorithms, from the Genetic Algorithms and Direct Search Toolbox, coupled with predictive modeling methods from the Statistics Toolbox, to select potential biomarkers of breast tumor severity from a microarray dataset. 更多... |
![]() |