The combination of ever larger detector formats, powerful computers capable of processing vast amounts of data, and on-line access to large databases has made an ocean of data available to astronomers. Because of the large size of individual databases, and the large number of databases, automated techniques must be used to search a single or combined database for particular classes of interesting or rare objects. Moreover, automated methods are required to construct the databases themselves from the raw data collected at observatories.
In this paper I focus mainly on the problem of star-galaxy separation. For our purposes, the main difference between stars and galaxies is that stars look completely unresolved and so have sharp images, while galaxies look fuzzy. Even though stars are not fuzzy, images taken through telescopes have finite resolution and the stellar images do have non-zero sizes. Figure 1 shows a small section from a digitized photographic plate with galaxies marked. Distinguishing stars from galaxies becomes difficult when both are faint and the galaxies are small; it is also sometimes difficult for very bright objects, when the stars can saturate the detector and no longer look compact. All galaxies are centrally concentrated, and a significant fraction of galaxies have activity near their centers and so appear to have a stellar core surrounded by faint fuzz. Such objects are especially difficult to classify correctly.
Figure: 530 300 pixel section of digitized Palomar Sky Survey II
plate. Pixels are one arcsecond. The contrast has been enhanced to make
visible both faint
objects and the noise in the sky. Objects brighter than V=20.5 from
Postman's deep CCD images are marked (squares are stars, circles are galaxies.)
Our goal is to develop a classification algorithm that can
distinguish the stars from the galaxies. There are more than 3 million
image patches this size across the entire sky.
Two examples will suffice to demonstrate the scope of the problem. The Sloan Digital Sky Survey, scheduled to begin in 1997, will generate a survey of 10,000 square degrees (1/4 of the sky) in 5 colors using an array of 30 pixel CCD detectors [GK92]. The terabytes of raw pixel values generated by this survey will be processed to produce a catalog of objects with positions, brightness in 5 colors, and classifications as stars or galaxies. The classification will be based on both the objects' morphological parameters (shape, central concentration, etc.) and on their colors. Accurate classification of this vast number of objects down to the detection limit of the survey is a daunting task that will require new classification methods.
The Space Telescope Science Institute constructed a catalog of guide stars for pointing the Hubble Space Telescope [LSM 90]. The original catalog contained about objects to a brightness of 15th magnitude and was constructed from a digitized version of photographic plates in a single color taken at the Palomar and the UK Schmidt telescopes. A second generation catalog is now under construction [LMJ 95]; it will have nearly one billion objects and will be based on digitized photographic data taken in a variety of colors and at several different epochs, so it will include both color and information on proper motions of objects in the catalog. This project also needs an accurate, automatic star-galaxy classifier.
Although these projects may sound similar, they present rather different classification problems because of the non-linear response of photographic emulsions to light.
The next section (§ 2) briefly describes some of the common methods used for classification, with some attention to ways to handle noise in the data. The steps required during the development of a classifier for a particular problem are discussed (§ 3). A new method of constructing more robust features by using ranks is described (§ 4), with examples from the Guide Star Catalog II star/galaxy discrimination problem. The concluding section (§ 5) discusses promising areas for future work.