The choice of an algorithm for classification is in many ways the easiest part of developing a scheme for object classification. The discussion above demonstrates that there are several ``off-the-shelf'' approaches available (though there is obviously still room for improvement.) There are two major hurdles to be faced before these methods can be used, though: a training set must be constructed for which the true classifications of the objects are known, and a set of object parameters must be chosen that are powerful discriminators for classification. Once a possible classifier has been identified, it is necessary to measure its accuracy.
A training set must contain a list of objects with known classifications. Ideally the training set should contain many examples (typically thousands of objects) so that it includes both common and rare types of objects. Creating a training set requires a source of true object classifications, which is usually difficult even for human experts to generate if it must rely on the same data being used by the classifier.
To construct a training set for the star-galaxy classification problem, the best approach we have found is to use a more sensitive, higher resolution image to get true classifications of the objects seen in the survey data. In some cases (e.g. for the Sloan Digital Sky Survey), it may be quite difficult and time-consuming to acquire images that are significantly better than the survey images; in that case it may be possible to use instead simulated data (where the true classifications are known.) A risk in using simulated images is that they may not include all the effects that make classification difficult from the real data.
Adding many irrelevant parameters makes classification harder for all methods, not just the nearest neighbor methods. Training classifiers is an optimization problem in a many-dimensional space. Increasing the dimensionality of the space by adding more parameters makes the optimization harder (and the difficulty grows exponentially with the number of parameters.) It is always better to give the algorithm only the necessary parameters rather than expecting it to learn to ignore the irrelevant parameters.
One should not ask the classifier to rediscover everything you already know about the data. Not only should irrelevant parameters be omitted, but highly correlated parameters should be combined when possible to produce a few powerful features. For example, if you expect the shapes of images of a particular class of object to be similar, include a brightness-independent shape parameter rather than simply giving the classifier raw pixel values and expecting it to figure out how to extract shape information from the pixel values.
If the training process does not require too much computation, a useful approach to identifying the best parameters is to train many times on subsets of the features. We have used this method in two ways, both starting from the complete list of features and reducing it by removing parameters, and starting from a minimal list, augmenting it by adding parameters. Both methods have proven effective at pruning unnecessary parameters [SCF 95]. This procedure can be very fast if axis-parallel trees are used for the exploration. (Note that OC1 can construct either axis-parallel or oblique trees.)
Another useful approach with the decision tree methods is to examine directly the weights assigned to the various features. Important features are given a high weight, while unimportant features may not be used at all. This information can be used as a guide for pruning experiments.
Once a potentially useful classifier has been constructed, the accuracy of the classifier must be measured. Knowledge of the accuracy is necessary both in the application of the classifier and also in comparison of different classifiers.
The accuracy can be determined by applying the classifier to an independent training set of objects with known classifications. This is sometimes trickier than it sounds. Since training sets are usually difficult to assemble, one rarely has the resources to construct yet another set of objects with known classifications purely for testing. One must avoid the temptation to train and test on the same set of objects, though. Once an object has been used for training, any test using it is necessarily biased.
We normally use five-fold cross-validation to measure the accuracy of our classifiers. The training set is divided into five randomly selected subsets having roughly equal numbers of objects. The classifier is then trained five times, excluding a single subset each time. The resulting classifier is tested on the excluded subset. Note that each training session must be completely independent of the excluded subset of objects; one cannot, for example, use the results of an earlier training session as a starting point.
The advantage of cross-validation is that all objects in the training set get used both as test objects and as training objects. This ensures that the classifier is tested on both rare and common types of objects. The cost of cross-validation is that the training process must be repeated many times, adding to the computational cost of the training. In most applications, though, the computer time necessary to repeat the training is more readily available than is the human expert time required to generate completely independent test and training sets.