Any classification method uses a set of features or parameters to characterize each object, where these features should be relevant to the task at hand. We consider here methods for supervised classification, meaning that a human expert both has determined into what classes an object may be categorized and also has provided a set of sample objects with known classes. This set of known objects is called the training set because it is used by the classification programs to learn how to classify objects. There are two phases to constructing a classifier. In the training phase, the training set is used to decide how the parameters ought to be weighted and combined in order to separate the various classes of objects. In the application phase, the weights determined in the training set are applied to a set of objects that do not have known classes in order to determine what their classes are likely to be.
If a problem has only a few (two or three) important parameters, then classification is usually an easy problem. For example, with two parameters one can often simply make a scatter-plot of the feature values and can determine graphically how to divide the plane into homogeneous regions where the objects are of the same classes. The classification problem becomes very hard, though, when there are many parameters to consider. Not only is the resulting high-dimensional space difficult to visualize, but there are so many different combinations of parameters that techniques based on exhaustive searches of the parameter space rapidly become computationally infeasible. Practical methods for classification always involve a heuristic approach intended to find a ``good-enough'' solution to the optimization problem.
There are a number of standard classification methods in use. Probably neural network methods are most widely known. Odewahn et al. [OSP 92] applied neural network methods to the star-galaxy classification problem on digitized photographic plates. They obtained good results for objects in a limited brightness range. The biggest advantage of neural network methods is that they are general: they can handle problems with very many parameters, and they are able to classify objects well even when the distribution of objects in the N-dimensional parameter space is very complex. The disadvantage of neural networks is that they are notoriously slow, especially in the training phase but also in the application phase. Another significant disadvantage of neural networks is that it is very difficult to determine how the net is making its decision. Consequently, it is hard to determine which of the image features being used are important and useful for classification and which are worthless. As I discuss below (§ 3), the choice of the best features is an important part of developing a good classifier, and neural nets do not give much help in this process.
A very simple classifier can be based on a nearest-neighbor approach. In this method, one simply finds in the N-dimensional feature space the closest object from the training set to an object being classified. Since the neighbor is nearby, it is likely to be similar to the object being classified and so is likely to be the same class as that object. Nearest neighbor methods have the advantage that they are easy to implement. They can also give quite good results if the features are chosen carefully (and if they are weighted carefully in the computation of the distance.) There are several serious disadvantages of the nearest-neighbor methods. First, they (like the neural networks) do not simplify the distribution of objects in parameter space to a comprehensible set of parameters. Instead, the training set is retained in its entirety as a description of the object distribution. (There are some thinning methods that can be used on the training set, but the result still does not usually constitute a compact description of the object distribution.) The method is also rather slow if the training set has many examples. The most serious shortcoming of nearest neighbor methods is that they are very sensitive to the presence of irrelevant parameters. Adding a single parameter that has a random value for all objects (so that it does not separate the classes) can cause these methods to fail miserably.
Decision tree methods have also been used for star-galaxy classification problems [FWD93, DWF94, Wei94]. In axis-parallel decision tree methods, a binary tree is constructed in which at each node a single parameter is compared to some constant. If the feature value is greater than the threshold, the right branch of the tree is taken; if the value is smaller, the left branch is followed. After a series of these tests, one reaches a leaf node of the tree where all the objects are labeled as belonging to a particular class. These are called axis-parallel trees because they correspond to partitioning the parameter space with a set of hyperplanes that are parallel to all of the feature axes except for the one being tested.
Axis-parallel decision trees are usually much faster in the construction (training) phase than neural network methods, and they also tend to be faster during the application phase. Their disadvantage is that they are not as flexible at modeling parameter space distributions having complex distributions as either neural networks or nearest neighbor methods. In fact, even simple shapes can cause these methods difficulties. For example, consider a simple 2-parameter, 2-class distribution of points with parameters x,y that are all of type 1 when x>y and are of type 2 when x<y (Fig. 2). To classify these objects with an axis-parallel tree, it is necessary to approximate the straight diagonal line that separates the classes with a series of stair-steps. If the density of points is high, very many steps may be required. Consequently, axis-parallel trees tend to be rather elaborate, with many nodes, for realistic problems.
Figure: Sample classification problem with only two features.
Solid line: oblique decision tree. Dashed line:
axis-parallel decision tree. The true dividing line for the simulated
data is the square's diagonal. The oblique decision tree is both
simpler and more accurate than the axis-parallel tree.
Oblique decision trees attempt to overcome the disadvantage of axis-parallel trees by allowing the hyperplanes at each node of the tree to have any orientation in parameter space [HKS93, MKS94]. Mathematically, this means that at each node a linear combination of some or all of the parameters is computed (using a set of feature weights specific to that node) and the sum is compared with a constant. The subsequent branching until a leaf node is reached is just like that used for axis-parallel trees.
Oblique decision trees are considerably more difficult to construct than axis-parallel trees because there are so many more possible planes to consider at each tree node. As a result the training process is slower. However, they are still usually much faster to construct than neural networks. They have one major advantage over all the other methods: they often produce very simple structures that use only a few parameters to classify the objects. It is straightforward through examination of an oblique decision tree to determine which parameters were most important in helping to classify the objects and which were not used.
Most of our work has concentrated on oblique decision trees using the freely available software OC1 [MKS94], though we have used all the methods mentioned above. We find that oblique decision trees represent a good compromise between the demands of computational efficiency, classification accuracy, and analytical value of the results.
Noise is common in astronomical data. In the presence of noise, classification becomes a statistical estimation problem. Objects should not be classified as either stars or galaxies, but should be assigned probabilities of being in one or the other class.
All of the above methods can be modified to give a probabilistic estimate of class rather than a simple yes or no answer. Neural nets can have an output parameter that represents probability. It is also possible to provide to a neural network (or the other methods, for that matter) the noise in parameters as additional object features. It is hard to know what the neural network actually does with this information, though -- this approach asks the classifier to discover that a particular parameter measures the noise on another parameter. Why should the classifier be burdened with deducing this when we already know what the parameters mean?
The nearest-neighbor method can be generalized to use the K nearest neighbors to an object, which can vote for the object's class. If the vote is not unanimous, it gives an estimate of the uncertainty of the object's majority classification. The same approach can be applied to decision trees. If the search algorithm used to construct the decision tree has a randomized component (which is typical for optimization searches in high dimensional spaces and is in fact the case for OC1), one can construct a number of different decision trees and let them vote on an object's class.
For decision trees, it is also possible to use known noise estimates for the parameters to determine at each node the probability that an object belongs on the left branch or the right branch. These probabilities, multiplied down to the leaf nodes and summed over all the leaves belonging to a particular class, give a probability that an object belongs that class. We are just beginning to explore this approach, which has great intuitive appeal and appears to be a promising avenue for further work.