Using Genetic Algorithms to Select a Subset of Predictive Variables from a High-Dimensional Microarray Dataset
by Sam Roberts
Because they enable the parallel measurement of large numbers of genes (as many as 25,000), microarrays are used extensively for basic and pharmaceutical research in a variety of applications, including drug discovery, toxicology, and oncology. The exploding amount of data generated by microarrays and other research methods, however, poses problems for experimental design, visualization, and statistical analysis and interpretation.
 
What is a Microarray?
  A microarray consists of small lengths of DNA attached to a surface, typically glass or coated quartz. The DNA can either be spotted onto the surface using robotic technology or chemically synthesized 
in situ. Nucleic acids from an experimental sample are labeled chemically and hybridized to the DNA on the microarray. The microarray is then scanned to produce an image like Figure 1.
 We can display these images using commands from the 
Bioinformatics Toolbox. 
 ma=affyread('0029_1209_H95Av2_KF0077.CEL');
maimage(ma); 
        
 Figure 1. Microarray image. The brightness of each spot is a measure of the level of expression of the gene it represents. A high level of expression indicates that the gene was active in the experimental sample. Click on image to see enlarged view.    
Feature Selection
       Figure 1. Microarray image. The brightness of each spot is a measure of the level of expression of the gene it represents. A high level of expression indicates that the gene was active in the experimental sample. Click on image to see enlarged view.    
Feature Selection
 Many recent articles in the microarray literature have discussed the application of 
feature selection  methods to high-dimensional datasets. Researchers can use these methods to select smaller subsets of interesting genes, aiding the interpretation of statistical models while retaining the highest possible degree of the accuracy of models developed on the full dataset. Sometimes, by allowing statistical learning algorithms to focus only on highly predictive genes while ignoring redundant variables and irrelevant noise, feature selection methods can even 
improve  the accuracy of statistical models.
 Feature selection methods are often separated into two groups: 
filter  and 
wrapper  methods. Filter methods generally rank each gene individually by some quality criterion (for example, the 
p-value of 
t-test comparing two populations of interest with regard to the expression levels of the gene in the populations), and then select the subset of genes with the 
n  highest quality criteria. Wrapper methods use a 
search algorithm  to evaluate subsets of the variables as a group, rather than individually. An exhaustive search through all subsets is clearly not possible—there could be around 225,000 variable subsets to consider. Therefore, these search algorithms use heuristics to direct their search towards promising candidates. One such search method is the genetic algorithm. 
 
What is a Genetic Algorithm?
 The name “genetic algorithm” is not connected to the fact that we are analyzing gene expression data. A genetic algorithm is a general-purpose method for searching through a large number of objects to find one that optimizes a quantity of interest, using methods derived from Darwinian evolution. The algorithm randomly selects an initial population of objects and their “fitness” is evaluated by calculating the quantity of interest. The best ones are selected to become “parents” of a “child” population, by exchanging properties (“crossover”), and occasionally mutating to create new properties. This process is iteratively repeated, converging on a population that optimizes the quantity of interest. In the analysis described in this article, the objects are subsets of genes, and the quantity to be optimized is how well the subset can predict lymph node status.
 This article discusses the use of the 
Genetic Algorithm and Direct Search Toolbox to perform variable selection on a microarray dataset. Although our focus is on the analysis of microarray data, variable selection using genetic algorithms is a method applicable to many areas involving high-dimensional datasets: spectroscopy, quantitative structure-activity relationships, and many others.
 
Data
 We will be working with the 
preprocessed microarray dataset. The original data is publicly available and described in “Gene Expression Predictors of Breast Cancer Outcomes” (
Huang et al (2003)  [1]). The authors obtained microarray data from samples of primary breast tumors from 89 patients, as well as information concerning each patient's clinical state (lymph node status).
 Invasion of a tumor into axillary lymph nodes is a significant prognostic factor for breast cancer. Currently, the best method of classifying patients into subgroups of severity is the investigation by a pathologist of lymph node biopsies. However, this method is invasive, and also imperfectly predictive (22–33% of patients classified as low-risk go on to develop breast cancer). 
 The collection of gene expression data may add predictive value to the current clinical indicators, as it can provide new information that is believed to be relevant in sub-classifying tumors. 
 The authors originally used cluster analysis and principal component analysis to construct a small number of “metagenes” with which they developed a predictive model of patient status. In this article, we demonstrate the use of genetic algorithms to select a subset of the genes that is highly predictive of lymph node status. 
 
Importing the Dataset
 We downloaded our raw data directly from the authors' Website.
 load HuangData
numSamples = size(signal,1);
numVariables = size(signal,2);
whos            Name       Size       Bytes       Class             call        89x12625     1123625     logical array           descriptions
     1x12625     4049096     cell array            geneNames     1x12625     965230     cell array            lnNodeStatus        89x1     712     double array            numSamples        1x1     8     double array           numVariables
     1x1     8     double array            sampleNames        89x1     9918     cell array            signal       89x12625     8989000      double array    
Grand total is 4024632 elements using 15137597 bytes   For ease of exposition, we preprocessed the data a little. In particular, the lymph node status has been re-ordered so as to match the order of the samples, and the microarray signal, absent/present call (see below), gene names, and gene descriptions have been extracted from the raw dataset into their own separate variables. 
 
Dimensionality Reduction
  We first remove all genes that are detected imperfectly or only at very low levels. The microarrays used in this study, which are from 
Affymetrix, give rise to an absent/present call for each gene (an indicator of whether the gene has been detected at significant levels). We include only genes that have been detected as present.
    for i=1:numVariables      calledPresent(i) = all(call(:,i)); end
geneNames = geneNames(calledPresent);
descriptions = descriptions(calledPresent);
signal = signal(:,calledPresent);
call = call(:,calledPresent);
numVariables = size(signal,2);
whos            Name       Size       Bytes       Class             call         89x1447       128783      logical array            calledPresent 
     1x12625      12625       logical array             descriptions      1x1447       468102      cell array            geneNames      1x1447       110798      cell array 
            lnNodeStatus      89x1       712       double array             numSamples 
     1x1     8     double array            numVariables     1x1     8     double array            sampleNames       89x1       9918       cell array             signal        89x1447       1030264      double array     
Grand total is 478185 elements using 1761226 bytes  
Setting up the Genetic Algorithm
 The default options for genetic algorithms in the Genetic Algorithm and Direct Search toolbox are most suitable for searching though continuous spaces. In this application, we are searching through a discrete space (subsets of variables), so we need to customize them. First, we write a custom 
fitness function.
   type varselect_gafit.m
  function errorRate = varselect_gafit(variableIndices,...
    data,groups,numSamples)
% Create ten-fold cross-validation indices
crossValidationIndices = mod(1:numSamples,10);
% Initialise a count of the errors so far
errorsSoFar = 0;
% Repeat for each of the ten cross-validations
for i = 0:9
    % Create a test and training set
    testIndices = (crossValidationIndices == i); 
    trainIndices = ~testIndices;
    % Generate a classifier to discriminate groups in the data
    classes = classify(...
        data(testIndices,variableIndices),...
        data(trainIndices,variableIndices),...
        groups(trainIndices,:));
    % Find classification errors and update the errors so far
    errors = sum(classes ~= groups(testIndices,:));
    errorsSoFar = errorsSoFar+errors;
end
% Extract the final error rate
errorRate = errorsSoFar/numSamples;  
This function uses the classify tool from the 
Statistics Toolbox to discriminate between two groups (lymph node status negative and positive), using the subset of variables currently being evaluated. The error rate of the classification model generated is estimated using 10-fold cross-validation. The estimated error rate is the fitness function minimized by the genetic algorithm. 
 We also need to customize the function used to generate an initial population for the genetic algorithm. 
 type varselect_gacreate.m
 function population = varselect_gacreate(numSelectedVariables,FitnessFcn,...
    options,numVariables)
% Repeat for each member of the population
for i = 1:options.PopulationSize
    % Initialise a flag to indicate that all selected variables are
    % distinct from each other
    allVariablesUnique = 0;
    % Loop continuously until all selected variables are unique
    while ~allVariablesUnique
        % Randomly select a subset of variable indices
        variables = floor(rand(numSelectedVariables,1)*numVariables)+1;
        % Check that all variables are different. If so, set the flag
        % to break out of the loop, otherwise loop again
        if size(variables) == size(unique(variables))
            allVariablesUnique = 1;
        end
    end
    % Add the current variable subset to the population
    population(i,:) = variables;
end  This function randomly selects a subset of variables from the dataset. If the variables are all different, the subset is included in the initial population for the genetic algorithm. If not, the function generates another random subset until an initial population of the desired size has been created. 
 Lastly, we need to make small changes to the default crossover and mutation functions to ensure that they always return subsets of variables that do not contain repeated variables. To access these functions, enter the following commands: 
 type varselect_crossoverscattered.m
type varselect_mutationuniform.m 
 
Setting Algorithm Options
 We can set various options regarding the performance of the genetic algorithm, for example, the population size, the proportion of each new generation that is generated by crossover and by mutation, the criteria used to halt the algorithm (in terms of number of generations or time), and the command-line and graphical outputs of the algorithm. 
   options = gaoptimset('CreationFcn',{@varselect_gacreate,numVariables},...
                         'CrossoverFcn',@varselect_crossoverscattered,...
            'MutationFcn',@varselect_mutationuniform,...
            'PopulationSize',100,...
            'PopInitRange',[1;numVariables],...
            'FitnessLimit', 0,...
            'StallGenLimit',100,...
            'StallTimeLimit',600,...
            'TimeLimit',600,...
            'Generations',100,...
                         'CrossoverFraction',0.5,...
            'Display','iter',...
            'PlotFcn',@gaplotbestf);   
Running the Genetic Algorithm 
 We can now run the genetic algorithm to search for the subset of genes that best predicts patient lymph node status. 
 numSelectedVars = 10;
FitnessFcn = {@varselect_gafit,signal,lnNodeStatus,numSamples};
[selectedVars, errorRate,reason,output] = ga(FitnessFcn, ...
        numSelectedVars,options); 
 selectedVars =    Columns 1 through 5 
        1149    868    929    920    1170
  Columns 6 through 10 
         792    1050    556    680    458
 errorRate =
0.0225        The algorithm takes approximately 1,000 seconds to complete on a Pentium M processor with 1GB of RAM. 
 Figure 2 shows the graphical output of the algorithm. Note that, as the genetic algorithm involves a random component, the results will not always exactly match this output. 
         
 Figure 2. The minimum and mean error rate of variable subsets in the current population, by generation. Click on image to see enlarged view.    
Evaluating and Interpreting the Results
       Figure 2. The minimum and mean error rate of variable subsets in the current population, by generation. Click on image to see enlarged view.    
Evaluating and Interpreting the Results 
 The selected subset of 10 genes predicts the lymph node status of patients with only a 2.25% error rate. However, this is probably an underestimate of the true error rate, as the subset of variables has been selected by searching through thousands of possible subsets, and it is likely that some will prove accurate predictors just by chance. To provide a more accurate estimate of the likely error rate of the model on unseen cases, we would ideally want to validate the predictions on a test set of unseen data. 
 The genes also require biological validation to confirm that they are possible candidates for biomarkers of lymph node status. We can display the descriptions of the selected genes to get some idea of their function. 
 selectedDescriptions = descriptions(selectedVars)';
for i=1:10
 disp(selectedDescriptions{i}(15:80))  end 
 AL109682:Homo sapiens mRNA full length insert cDNA clone EUROIMAGE
J02940:Human platelet glycoprotein Ib alpha chain mRNA, complete c
AF052145:Homo sapiens clone 24400 mRNA sequence /cds=UNKNOWN /gb=A
AF007833:Homo sapiens kruppel-related zinc finger protein hcKrox m
H55746:140A6 Homo sapiens cDNA /gb=H55746 /gi=1004390 /ug=Hs.28704
X97249:H.sapiens mRNA for leucine-rich primary response protein 1 
W29040:55d6 Homo sapiens cDNA /gb=W29040 /gi=1308997 /ug=Hs.199320
AF054177:Homo sapiens chromodomain-helicase-DNA-binding protein mR
AA079018:zm94e12.s1 Homo sapiens cDNA, 3  end /clone=IMAGE-545614 
J04755:Human ferritin H processed pseudogene, complete cds /cds=UN  To investigate further, we can submit the selected genes as a batch query to 
NetAffx, a publicly accessible database made available by Affymetrix that contains details of the genes on the microarrays. (Access to this Website requires a free registration with Affymetrix.) We create a batch query file, suitable for submission to the NetAffx Website, containing the names of the genes. 
  selectedGeneNames=geneNames(selectedVars)';
fid = fopen('query.txt','w');
for i = 1:numSelectedVars  fprintf(fid, '%s\n',selectedGeneNames{i,:});  end
fclose(fid);
type('query.txt'); 
 34538_at
33063_at
33601_at
33592_at
34559_at
32987_at
34081_at
31914_at
32396_f_at
31697_s_at  We submit the file query.txt directly to the NetAffx Website as a batch query. Figure 3 shows the query results. 
         
 Figure 3. Results of a batch query to NetAffx on the selected genes. Click on image to see enlarged view.
       Figure 3. Results of a batch query to NetAffx on the selected genes. Click on image to see enlarged view.    
 We can also obtain more details on each gene by submitting individual queries on them to NetAffx (you must be logged in to NetAffx for this to work). Figure 4 shows a sample result from a query. 
  for i = 1:numSelectedVars  query=sprintf('https://www.affymetrix.com
/analysis/netaffx/fullrecord.affx?pk=HG-U95AV2%%3A%s',selectedGeneNames{i,:});
web(query, '-new')  end          
 Figure 4. Example details of a selected gene. Click on image to see enlarged view.
       Figure 4. Example details of a selected gene. Click on image to see enlarged view.    
 Many areas of science now generate huge volumes of data that present visualization, modeling, and interpretation challenges. Methods for selecting small but highly relevant and predictive data subsets are therefore receiving much attention. In this article, we demonstrated the use of genetic algorithms, from the Genetic Algorithms and Direct Search Toolbox, coupled with predictive modeling methods from the Statistics Toolbox, to select potential biomarkers of breast tumor severity from a microarray dataset.
更多...