Speaker
Description
"Anomaly detection" covers a broad range of problems and settings. In some instances, it is seen as finding "rare objects", i.e. objects lying in a low-density region of the feature space. However, this task can quickly become difficult, particularly for higher dimensional, noisy or complex (non-rectangular) data where reliable density estimation is non-trivial.
Additionally, not all low-density points are necessarily interesting anomalies (and vice versa). In many cases, we are interested in specific types of anomalies: points that diverge, in some aspects, from our expectations (if we have reliable models) or from otherwise similar data points, along specific axes (e.g. in a given region of the feature space). In other words, we might be looking for objects that seem normal in every other regard but are "weird" (in some direction) contextually to their other features. This, however, requires some model to tell us what is "normal" conditioned on the context features. If we do not have such models (or if they are computationally expensive), we can take advantage of supervised Machine Learning methods to perform such a search in a data-driven way, without requiring however supervised anomalous examples.
We present a contextual anomaly detection pipeline for mid-infrared excess in FGK stars: mid-IR excess is a tracer of events such as planetary collisions and Extreme Debris Disks, which appear to be relatively rare. To identify outlier candidates, we use a combination of prediction errors from a set of Random Forests, and statistics using prediction errors of similar neighbouring points. Our pipeline bypasses the need for accurate stellar modelling while providing a high detection sensitivity crucial in the mid-IR. This allows us to scale our search to an unprecedented data set of 4.9 million stars, where we identify 53 mid-IR excess candidates.