The purpose of the ilodids project is to build and maintain an incremental (evolutionary) and local (adaptible) anomaly detection mechanism based on the work by Pokrajac et al. (2007). It is developed as part of the HELICOPTER project.

Current features are:

  • anomaly detection, that is, to analyze a stream of timestamped multidimensional data records and, when queried, report back which of these data records are anomalous. Anomalies are discovered by analyzing the data records with respect to a normal model based on prior data.
  • incremental anomaly detection, that is, the normal model is maintained while anomaly detection is taking place. It can be viewed as if the normal model evolves with the situation.
  • local anomaly detection, that is, the normal model can adapt to the specific situation.
  • operating system independence due to use of Java
  • technique is proven correct
  • configurable, but rudimentary, visual debugging (to be improved)

For more information concerning anomaly detection, the interested reader can access the excellent surveys by Chandola et al. (2009a), Chandola et al. (2012) and Chandola et al. (2009b). Briefly, the technique is based on clustering and there are claims of O(n*log n) algoritmic time complexity if proper indexing is employed. This is still under work.

References

Chandola, V, Banerjee, A & Kumar, V 2009a, ‘Anomaly detection: A survey’, ACM Comput. Surv., vol. 41, no. 3, pp. 1–58.

Varun Chandola, Deepthi Cheboli & Vipin Kumar 2009b, Detecting Anomalies in a Time Series Database, Department of Computer Science and Engineering, University of Minnesota, 4-192 EECS Bulding, 200 Union Street SE, Minneapolis, MN 55455-0159 USA, Minneapolis.
Chandola, V, Banerjee, A & Kumar, V 2012, ‘Anomaly Detection for Discrete Sequences: A Survey’, IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 5, pp. 823–839.
Pokrajac, D, Lazarevic, A & Latecki, LJ 2007, ‘Incremental Local Outlier Detection for Data Streams’, Proceeding of Computational Intelligence and Data Mining, 2007. CIDM 2007., pp. 504–515.

To download and install the package, download the lastest Java archive (jar file) and put it in your favorite location for Java archives (e.g., a directory, a server etc.) Then ensure that the Java virtual machine can reach the Java archive. Carefully read the instructions for the environment you are using. For example, in eclipse you are dissuaded to use the CLASSPATH environment variable to specify where the Java archive files are employed.

For more information, please consider, for example, the following web sites for placing the jar file in the proper location:

  • Documentation (by Oracle) on how the java virtual machine locates Java archive files.
  • Documentation (at eclipse.org) on how to specify this in Eclipse

Under the following tabs you can find a brief tutorial for setting up and using ilodids.

Overview

There are main five issues to consider:

  1. To set up data streams for analysis. A data stream is the main class in ilodids that performs the actual analysis of data records with respect to the normal model.
  2. To preprocess data in such a way that it is possible to apply anomaly detection (e.g., the range should be between 0 and 1.0).
  3. To insert data into the data stream.
  4. To check for anomalies (i.e., outliers).
  5. To perform maintenance tasks by removing outdated data records.

Setting up a data stream

To set up a data stream, the technique is to create a multidimensional data type which is used to create the data stream object. The following is an example of setting up a data stream based on a door passage sensor and time of day. The '10' is a configurable value that is used to determine what data records belongs to the same cluster. The larger the value is, the longer time it takes to perform the analysis. The lower the value is, the less resilient is the anomaly detection to noise.

DataRecordType dataRecordType=new MultiDimensionalPointType();
dataRecordType.addParameter(new Parameter("Door passage",EuclidDouble.class));
dataRecordType.addParameter(new Parameter("Time of day",EuclidDate.class));  
DataStream dataStream=new DataStream(10,dataRecordType);

It is recommended that you only check one target variable with respect to one or more contextual variables. The major reason is that clustering techniques have problems with high dimensionality, that is, that there are too many parameter to compare and contrast in the analysis. Further, it is hard to determine the cause of the anomaly if more parameters are mixed. It is better to create a data stream object for each analysis. For example, if it desirable to analyze the door passage frequency with respect to both time of day, time of week and both, then this might be what you need:

DataRecordType doorDateType=new MultiDimensionalPointType();
doorDateType.addParameter(new Parameter("Door passage",EuclidDouble.class));
doorDateType.addParameter(new Parameter("Time of day",EuclidDate.class));

DataStream doorDateStream=new DataStream(10,doorDateType); DataRecordType doorWeekType=new MultiDimensionalPointType(); doorWeekType.addParameter(new Parameter("Door passage",EuclidDouble.class)); doorWeekType.addParameter(new Parameter("Time of week",EuclidWeek.class)); DataStream doorWeekStream=new DataStream(10,doorWeekType); DataRecordType doorWeekDateType=new MultiDimensionalPointType(); doorWeekDateType.addParameter(new Parameter("Door passage",EuclidDouble.class)); doorWeekDateType.addParameter(new Parameter("Time of week",EuclidWeek.class)); doorWeekDateType.addParameter(new Parameter("Time of day",EuclidDate.class)); DataStream doorWeekDateStream=new DataStream(10,doorWeekDateRecordType);

 

Preprocessing data

To feed data to the anomaly detection, it is typically necessary to preprocess it. For example, to compute the frequency of events over a time period, the trend over a time period (e.g., using linear regression) or check usage. Further, to avoid problems with different scales, it is recommended to normlize the data to a floating point scale from 0.0-1.0. The latter part is supported by the Euclid class hierarchy that is part of the ilodids package. That is, it is possible either use existing classes (EuclidDate, EuclidWeek) or to define new classes for the particular data in question

For example, to add a new class for the door frequency, then we can consider how frequent that the door can be passed. Let us assume that 1 person per 5 seconds can move through the door, then 12 persons per minute and 612 persons per hour. If we do prepare by computing the frequency in a moving time window over the last 1 hour, then we can normalize by dividing by 612. A door frequency class can look like the following:

class EuclidDoorFrequency extends EuclidDouble {      
  public double normalize(double value) {         
    return value/612.0;      
  }  
};                           

Inserting data into the data stream

To insert data into the data stream and perform processing, the methods insertDataRecord are used. It is an overload interface with either a single data record that is to be inserted or a set of data records. In the following example, the single data record intertace is employed.

double[] doorFrequencyPartialLog; // the last time  window of values (e.g., last 8 hours)

Date timestamp=...; // e.g., to the current time // create a vector with values, ordering is important here Vector<EuclidDouble> parVec=new Vector<EuclidDouble>(); parVec.add(new EuclidDoorFrequency(averageDoorFrequencyOverTimeWindow(doorFrequencyPartialLog)); parVec.add(new EuclidDate(ts)); parVec.add(new EuclidWeek(ts));
// create the data record object ofp given type and at time defined by the timestamp MultiDimensionalPoint mdp=new MultiDimensionalPoint(doorWeekDataRecordType,ts,parVec); // insert data record into the stream, anomaly detection, incremental/local processing // is taking place dataStream.insertDataRecord(mdp);

 

Checking for outliers

To check for outlier, it is only necessary to query the data stream object for outliers like this:

Set<MultiDimensionalPointView> outlierSet=dataStream.outlierSet();
         
for  (MultiDimensionalPointView outlierMDPV: outlierSet) {
  final EuclidDouble dfSensorValue=outlierMDPV.getDataRecord().getParameter("Door Frequency");
  final Set<MultiDimensionalPointView> neighbours=outlierMDPV.getkNearestNeighbours();
  if  (dfSensorValue> upperThreshold(neighbours)) {
    // report increase
  } else  if  (dfSensorValue<lowerThrehold(neighbours)) {
    // report decrease
  }  else {
    // part of  the context has changed?
} }

The upper and lower threshold can be the upper and lower median of the neighbours. An outlier's nearest neightbours may not have the outlier as nearest neighbours and if, and only if, it it considered an outlier by the algorithm, then it does not have neighbours in a cluster. Instead, the outliers nearest neighbours belongs to other clusters or are outlier themselves. Note that if a set of data records form a separate (possible sparsely distributed) cluster then they are not outliers. So, it is safe to compare the target variable with the target variables of its nearest neighbours to figure out if the anomaly is due to an increase, a decrease or a change in the context.

Maintenance

To maintain the normal model so that it is incremental and local, it is necessary to remove data records that has expired. Currently there is no configurable mechanism for this in the package, but one is planned. Essentially, the user of the package must currently remove data records manually.

Set<MultiDimensionalPoint> removalSet=new HashSet<MultiDimensionalPoint>();
          removalSet.add(...); // add  objects of MultiDimensionalPoints that are too old
dataStream.removeDataRecord(removalSet);

 

In terms of algorithmic time complexity, Pokrajac et al. (2007) claims that the algorithms can run with O(n*log n) if proper indexing is employed and the number of dimensions are kept low. The current implemention here does not use indexing and hence it is more of an O(n^2) algorithm (depending on the value of k). There is on-going work to explore different kinds of indexing and compare these to the sources.