|
| ||||
|
|
| ||||
| Network Monitoring | Detecting exploit patterns from network flow streams: Network-based Intrusion Detection Systems (NIDS) try to detect malicious activities by monitoring network
traffic. Research on network traffic measurement has identified
various patterns that the typical exploits on today's Internet
exhibit. The goal of our research is to devise single-pass (online)
data stream algorithms for detecting these patterns from network
traffic flow data, using a workspace that is much smaller than the
size of the traffic flow. We aim to design algorithms with a
provable guarantee on the space and time requirements and the degree
of approximation in the estimates returned. A detailed proposal of the work can be found here.
The following projects are all parts of this work:
|
||||
| Payload-based Anomaly Detection for Network Packet Streams (Spring 2009-current, with
Yang Liu,
Yong Guan and
Srikanta Tirthapura):
We address the problem of detecting anomalous behavior from payloads of network packet streams, using small-space
approximation algorithms. An N-gram is a sequence of N adjacent
bytes in a payload unit. Since there can be 256N different values of N-grams, any
anomaly detection strategy that is based on keeping track of the
exact frequencies of the N-grams, may need exponential
space. Our goal is to develop a small-space sketch that can "learn" the
normal traffic profile for a site/network from a training dataset,
and then to use the sketch for detecting anomalous payloads from
the test dataset. We are using random projection methods for buidling the sketches and non-adaptive combinatorial group testing
techniques for comparing them.
|
|||||
| Identifying Heavy-Hitters from Data Streams with Time-Decay (Summer 2009):
We propose small-space, one-pass approximation algorithms for
identifying heavy-hitters from a data stream, where the weight of an
item reduces with time according to some decay function. We consider
the exponential, the monomial and the sliding window decay functions
here. For exponential and monomial decay, we adopt the Misra-Gries
sketch, and our algorithms take O(cmax/&epsilon) space to
produce an &epsilon-approximate solution, where cmax is an
apriori upper bound on the initial weight any element in the stream
appears with. For small (O(1)) values of cmax, this is an
improvement by a factor of &Omega(log m/&epsilon) over
the space-complexity of the algorithm by Cormode et al. We
also extend the count-min sketch to identify the
&epsilon-approximate heavy-hitters over a sliding window on a
cash-register stream.
|
|||||
| A Generic Framework for Detecting Top-k Items from a Stream (Summer 2009): We propose a generic framework for identifying the k most frequent items from a network data stream in a single pass,
using a workspace much smaller than the stream itself. We observed
that any small-space sketch that approximately maintains the
frequencies of the items appearing in a stream can be utilized to
extract the top-k items. We show theoretically how the Misra-Gries
sketch and the count-min sketch can fit into this framework, analyze the space-complexities of both, and
present experimental results on packet traces from a backbone
network link. With the realistic assumption that the frequencies of
the items follow a Zipfian distribution, we fully present an
algorithm, based on the Misra-Gries sketch, that takes
only O(kO(1)/&epsilon) space and guarantees that the solution
provided is &epsilon-approximate. Under identical input
distribution, the space-requirement of the top-k algorithm
provided by Charikar et al is logarithmic in the
size of the stream, so our algorithm achieves significant
space-reduction; and does not demand any apriori knowledge from the
user about the size of the stream. Here are the C++-based simulations of a naive and our
small-space algorithm.
|
|||||
| Finding Correlated Heavy-Hitters over Data Streams (Fall 2008-Summer 2009, with
Srikanta Tirthapura): We consider online mining of correlated heavy-hitters (CHH) from
network data streams. Given a multi-dimensional dataset, a
correlated aggregate query first filters a subset by applying a
predicate along a primary dimension, and then computes
aggregates along some secondary dimension of that subset data.
Previous work has focused on computing the correlated sum (or count)
where the predicate for the primary dimension involved a basic
aggregate that can be maintained exactly in small space (e.g.,
minimum, maximum or average). To our knowledge, our work is the
first to address queries of the following form: "On a stream of
(x, y) tuples, maintain the y-values that occur frequently
alongwith the x-values that appear frequently in the stream".
Previous techniques for correlated aggregates cannot handle queries
of this form, since even for a one-dimensional stream, heavy-hitters
cannot be maintained exactly using small space.
Our online data stream algorithm is easy to implement and uses
workspace which is orders of magnitude smaller than the stream
itself. We present provable guarantees on the maximum error
estimates, as well as experimental results, that demonstrate the
space-accuracy trade-off, on more than one billion packet headers
from a backbone network link. The theoretical and experimental results are in this paper that got accepted in IPCCC 2009. Here are the C++-based simulations of a naive and our
small-space algorithm.
|
|||||
| Distributed Data Mining | Computing Frequent Elements using Gossip (Fall 2007-Spring 2009, with
Srikanta Tirthapura): Due to the large scale of P2P and sensor networks, the values of aggregate functions over the data in the whole network are often more important than individual data at nodes. For example, in a P2P file-sharing network (like Gnutella or Napster), an administrator may be typically interested in identifying the most frequently accessed items over the whole network. Any reliance on central coordination limits the system’s scalability, therefore gossip-based protocols are emerging as an important communication paradigm. In gossip-based protocols, each node contacts one or a few nodes in each round (usually chosen at random), and exchanges information with these nodes.
This mechanism leads to high fault-tolerance and self-stabilization. We designed efficient randomized distributed algorithms for identifying the frequently occurring data elements with probabilistic guarantees on accuracy,
where the nodes exchange small-sized "sketches" or synopses of their individual data sets through gossip algorithms. The theoretical results are in this paper that we published in SIROCCO 2008. A Java-based simulation of the algorithm can be downloaded from here.
| ||||
| Sensor Networks | Development of Distributed, Lightweight, Secure Hierarchical Directories for Tracking Mobile Objects in Sensor Networks (Fall 2006-Summer 2007, with
Srikanta Tirthapura and
Bojian Xu): ![]() Most practical techniques for locating remote objects in a
distributed system suffer from problems of scalability and locality of
reference. We are implementing the Arrow distributed directory protocol, a
scalable and local mechanism for ensuring mutually exclusive access to
mobile objects. This directory has communication complexity optimal
within a factor of (1+MST-stretch(G))/2, where MST-stretch(G) is the
minimum spanning tree stretch of the underlying network. In Arrow protocol, local change
in the object’s position does not result in a global change in the network. This has been
deployed on a fixed spanning-tree-based network of MICA2 motes, where the presence of the object is detected
by measuring the amount of ambient light using MTS300 photo sensors, and the node detecting the object
transmits this information to its neighbours so that the pointers are updated appropriately.
| ||||
| Business Applications |
Trailblazer (Acuity) (January 2006-July 2006): This web-based application developed for McGraw-Hill Digital Learning enhanced the functionality of an existing application named TrailBlazer with creating test, exercise and assignments for students; scoring the students' response and generating predictive, diagnostic and summary reports. The system also monitors the structure of state curriculum and tracks the progress of a student in class. Roles played:
Roles played:
| ||||
|
|
| ||||
![]() |
Bibudh Lahiri Department of Electrical & Computer Engineering Iowa State University 3125 Coover Hall, Ames, Iowa | ||||