The purpose of the ilodids project is to build and maintain an incremental (evolutionary) and local (adaptible) anomaly detection mechanism based on the work by Pokrajac et al. (2007). It is developed as part of the HELICOPTER project.
Current features are:
For more information concerning anomaly detection, the interested reader can access the excellent surveys by Chandola et al. (2009a), Chandola et al. (2012) and Chandola et al. (2009b). Briefly, the technique is based on clustering and there are claims of O(n*log n) algoritmic time complexity if proper indexing is employed. This is still under work.
Chandola, V, Banerjee, A & Kumar, V 2009a, ‘Anomaly detection: A survey’, ACM Comput. Surv., vol. 41, no. 3, pp. 1–58.
To download and install the package, download the lastest Java archive (jar file) and put it in your favorite location for Java archives (e.g., a directory, a server etc.) Then ensure that the Java virtual machine can reach the Java archive. Carefully read the instructions for the environment you are using. For example, in eclipse you are dissuaded to use the CLASSPATH environment variable to specify where the Java archive files are employed.
For more information, please consider, for example, the following web sites for placing the jar file in the proper location:
Under the following tabs you can find a brief tutorial for setting up and using ilodids.
There are main five issues to consider:
To set up a data stream, the technique is to create a multidimensional data type which is used to create the data stream object. The following is an example of setting up a data stream based on a door passage sensor and time of day. The '10' is a configurable value that is used to determine what data records belongs to the same cluster. The larger the value is, the longer time it takes to perform the analysis. The lower the value is, the less resilient is the anomaly detection to noise.
DataRecordType dataRecordType=new MultiDimensionalPointType(); dataRecordType.addParameter(new Parameter("Door passage",EuclidDouble.class)); dataRecordType.addParameter(new Parameter("Time of day",EuclidDate.class)); DataStream dataStream=new DataStream(10,dataRecordType); |
It is recommended that you only check one target variable with respect to one or more contextual variables. The major reason is that clustering techniques have problems with high dimensionality, that is, that there are too many parameter to compare and contrast in the analysis. Further, it is hard to determine the cause of the anomaly if more parameters are mixed. It is better to create a data stream object for each analysis. For example, if it desirable to analyze the door passage frequency with respect to both time of day, time of week and both, then this might be what you need:
DataRecordType doorDateType=new MultiDimensionalPointType(); doorDateType.addParameter(new Parameter("Door passage",EuclidDouble.class)); doorDateType.addParameter(new Parameter("Time of day",EuclidDate.class)); DataStream doorDateStream=new DataStream(10,doorDateType); DataRecordType doorWeekType=new MultiDimensionalPointType(); doorWeekType.addParameter(new Parameter("Door passage",EuclidDouble.class)); doorWeekType.addParameter(new Parameter("Time of week",EuclidWeek.class)); DataStream doorWeekStream=new DataStream(10,doorWeekType); DataRecordType doorWeekDateType=new MultiDimensionalPointType(); doorWeekDateType.addParameter(new Parameter("Door passage",EuclidDouble.class)); doorWeekDateType.addParameter(new Parameter("Time of week",EuclidWeek.class)); doorWeekDateType.addParameter(new Parameter("Time of day",EuclidDate.class)); DataStream doorWeekDateStream=new DataStream(10,doorWeekDateRecordType); |
To feed data to the anomaly detection, it is typically necessary to preprocess it. For example, to compute the frequency of events over a time period, the trend over a time period (e.g., using linear regression) or check usage. Further, to avoid problems with different scales, it is recommended to normlize the data to a floating point scale from 0.0-1.0. The latter part is supported by the Euclid class hierarchy that is part of the ilodids package. That is, it is possible either use existing classes (EuclidDate, EuclidWeek) or to define new classes for the particular data in question
For example, to add a new class for the door frequency, then we can consider how frequent that the door can be passed. Let us assume that 1 person per 5 seconds can move through the door, then 12 persons per minute and 612 persons per hour. If we do prepare by computing the frequency in a moving time window over the last 1 hour, then we can normalize by dividing by 612. A door frequency class can look like the following:
class EuclidDoorFrequency extends EuclidDouble { public double normalize(double value) { return value/612.0; } }; |
To insert data into the data stream and perform processing, the methods insertDataRecord are used. It is an overload interface with either a single data record that is to be inserted or a set of data records. In the following example, the single data record intertace is employed.
double[] doorFrequencyPartialLog; // the last time window of values (e.g., last 8 hours) Date timestamp=...; // e.g., to the current time // create a vector with values, ordering is important here Vector<EuclidDouble> parVec=new Vector<EuclidDouble>(); parVec.add(new EuclidDoorFrequency(averageDoorFrequencyOverTimeWindow(doorFrequencyPartialLog)); parVec.add(new EuclidDate(ts)); parVec.add(new EuclidWeek(ts)); |
To check for outlier, it is only necessary to query the data stream object for outliers like this:
Set<MultiDimensionalPointView> outlierSet=dataStream.outlierSet(); for (MultiDimensionalPointView outlierMDPV: outlierSet) { final EuclidDouble dfSensorValue=outlierMDPV.getDataRecord().getParameter("Door Frequency"); final Set<MultiDimensionalPointView> neighbours=outlierMDPV.getkNearestNeighbours(); if (dfSensorValue> upperThreshold(neighbours)) { // report increase } else if (dfSensorValue<lowerThrehold(neighbours)) { // report decrease } else { // part of the context has changed? |
The upper and lower threshold can be the upper and lower median of the neighbours. An outlier's nearest neightbours may not have the outlier as nearest neighbours and if, and only if, it it considered an outlier by the algorithm, then it does not have neighbours in a cluster. Instead, the outliers nearest neighbours belongs to other clusters or are outlier themselves. Note that if a set of data records form a separate (possible sparsely distributed) cluster then they are not outliers. So, it is safe to compare the target variable with the target variables of its nearest neighbours to figure out if the anomaly is due to an increase, a decrease or a change in the context.
To maintain the normal model so that it is incremental and local, it is necessary to remove data records that has expired. Currently there is no configurable mechanism for this in the package, but one is planned. Essentially, the user of the package must currently remove data records manually.
Set<MultiDimensionalPoint> removalSet=new HashSet<MultiDimensionalPoint>(); removalSet.add(...); // add objects of MultiDimensionalPoints that are too old dataStream.removeDataRecord(removalSet); |