Space-efficient Tracking of Persistent Items in Massive Data Streams (Fall 2009-Fall 2010, with
Srikanta Tirthapura and Jaideep Chandrashekar, formerly with
Intel Labs Berkeley):
Motivated by applications in network anomaly detection, we consider
the problem of detecting persistent items in a data stream, which
are items that occur regularly in the stream. In contrast
with heavy hitters, persistent items do not necessarily contribute
significantly to the volume of a stream, and may escape
detection by traditional volume-based anomaly detectors ([1], [2], [3]).
At Intel Berkeley, Giroire et al. [4] monitored traffic from a host to detect communication across botnet channels. They observed that very persistent destinations were likely to belong to one of two classes: either they were malicious hosts associated with a botnet, or they were frequently visited benign hosts. It was also observed that the latter set of hosts could be identified easily and assembled into a whitelist of known good destinations. They found that tracking persistent items in the network stream, followed by filtering out items contained in the whitelist, resulted in reliable identification of botnet traffic.
Persistent items are often associated with specific anomalies in the context of network streams: periodic connections to a stock trading website is an indicator of ClickFraud, repeated (failed) connections observed in the stream is indicative of a failed or unreachable web service; botnets periodically "phone home" to their bot controllers. Although our motivation behind tracking persistent items was to detect anomalies in network streams, the general problem of detecting persistent items can have broad applicability in several other areas, and an interesting algorithmic problem in its own merit.
We first show that any online algorithm that tracks persistent items exactly must necessarily use a large workspace, and is infeasible to run on a traffic monitoring node. Motivated by this lower bound, we introduce an approximate formulation of the problem and present a small-space algorithm to approximately track persistent items over a large data stream. Our experiments on a real traffic dataset shows that in typical cases, the algorithm achieves a space compression factor of 5x-7x, while incurring very few false positives (< 1%) and false negatives (< 4%). To our knowledge, this is the first systematic study of the problem of detecting persistent items in a data stream, and our work can help detect anomalies that are temporal, rather than volume based. The findings of this research got accepted for publication in ACM DEBS 2011.
At Intel Berkeley, Giroire et al. [4] monitored traffic from a host to detect communication across botnet channels. They observed that very persistent destinations were likely to belong to one of two classes: either they were malicious hosts associated with a botnet, or they were frequently visited benign hosts. It was also observed that the latter set of hosts could be identified easily and assembled into a whitelist of known good destinations. They found that tracking persistent items in the network stream, followed by filtering out items contained in the whitelist, resulted in reliable identification of botnet traffic.
Persistent items are often associated with specific anomalies in the context of network streams: periodic connections to a stock trading website is an indicator of ClickFraud, repeated (failed) connections observed in the stream is indicative of a failed or unreachable web service; botnets periodically "phone home" to their bot controllers. Although our motivation behind tracking persistent items was to detect anomalies in network streams, the general problem of detecting persistent items can have broad applicability in several other areas, and an interesting algorithmic problem in its own merit.
We first show that any online algorithm that tracks persistent items exactly must necessarily use a large workspace, and is infeasible to run on a traffic monitoring node. Motivated by this lower bound, we introduce an approximate formulation of the problem and present a small-space algorithm to approximately track persistent items over a large data stream. Our experiments on a real traffic dataset shows that in typical cases, the algorithm achieves a space compression factor of 5x-7x, while incurring very few false positives (< 1%) and false negatives (< 4%). To our knowledge, this is the first systematic study of the problem of detecting persistent items in a data stream, and our work can help detect anomalies that are temporal, rather than volume based. The findings of this research got accepted for publication in ACM DEBS 2011.
References
[1] "Approximate Frequency Counts over Data Streams", Gurmeet Singh Manku and Rajeev Motwani,
Proceedings of 28th International Conference on Very Large Data Bases (VLDB), 2002
[2] "New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice", ACM Transactions on Computer Systems, volume 21, number 3, 2003
[3] "An improved data stream summary: the count-min sketch and its applications", Graham Cormode and S. Muthukrishnan, Journal of Algorithms, volume 55, number 1, 2005
[4] "Exploiting Temporal Persistence to Detect Covert Botnet Channels", Frederic Giroire and Jaideep Chandrashekar and Nina Taft and Eve M. Schooler and Dina Papagiannaki, 12th International Symposium on Recent Advances in Intrusion Detection (RAID), 2009
[2] "New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice", ACM Transactions on Computer Systems, volume 21, number 3, 2003
[3] "An improved data stream summary: the count-min sketch and its applications", Graham Cormode and S. Muthukrishnan, Journal of Algorithms, volume 55, number 1, 2005
[4] "Exploiting Temporal Persistence to Detect Covert Botnet Channels", Frederic Giroire and Jaideep Chandrashekar and Nina Taft and Eve M. Schooler and Dina Papagiannaki, 12th International Symposium on Recent Advances in Intrusion Detection (RAID), 2009