Abstract: the theoretical foundations of data stream analysis and

 

Abstract:

 A
growing number of applications that generate massive streams of data need
intelligent data processing and online analysis. Applications like Real-time
surveillance systems, telecommunication systems, sensor networks and other
dynamic environments are such examples. The imminent need for turning such unprocessed
data into useful information and knowledge augments the development of systems,
algorithms and frameworks that address data streaming challenges. The storage,
querying and mining of such data sets are highly computationally challenging
tasks. Mining data streams is concerned with extracting knowledge structures
represented in models and patterns in non-stopping streams of information.
Generally, two main challenges are designing fast mining methods for data
streams and need to promptly detect changing concepts and data distribution
because of highly dynamic nature of data streams. The goal of this article is
to analyze and classify the application of diverse data mining techniques in
different challenges of data stream mining. In this paper, we present the
theoretical foundations of data stream analysis and propose an analytical
framework for data stream mining techniques.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

 Keywords:

Data Stream, Data Stream
Mining, Stream Preprocessing.

 

1. Introduction

 Data mining techniques are suitable for simple
and structured data sets like relational databases, transactional databases and
data warehouses. Fast and continuous development of advanced database systems,
data collection technologies, and the World Wide Web, makes data grow rapidly
in various and complex forms such as semi structured and non-structured data,
spatial and temporal data, and hypertext and multimedia data. Therefore, mining
of such complex data becomes an important task in data mining realm. In recent
years different approaches are proposed to overcome the challenges of storing
and processing of fast and continuous streams of data. Data stream can be
conceived as a continuous and changing sequence of data that continuously
arrive at a system to store or process. Imagine a satellite-mounted remote
sensor that is constantly generating data. The data are massive (e.g., terabytes
in volume), temporally ordered, fast changing, and potentially infinite. These
features cause challenging problems in data streams field. Traditional OLAP and
data mining methods typically require multiple scans of the data and are
therefore infeasible for stream data applications. Whereby data streams can be
produced in many fields, it is crucial to modify mining techniques to fit data
streams. Data stream mining has many applications and is a hot research area.
With recent progress in hardware and software technologies, different
measurement can be done in various fields. These measurements are continuously
feasible for data with high changing ratio. Common applications which require
mining of large amount of data to find new patterns are sensor networks, store
and search of web events, and computer networks traffic. These patterns are
valuable for decision makings.  Data
Stream mining refers to informational structure extraction as models and
patterns from continuous data streams. Data Streams have different challenges
in many aspects, such as computational, storage, querying and mining.

Based on last researches,
because of data stream requirements, it is necessary to design new techniques
to replace the old ones. Traditional methods would require the data to be first
stored and then processed off-line using complex algorithms that make several
pass over the data, but data stream is infinite and data generates with high
rates, so it is impossible to store it. Therefore two main challenges are
designing fast mining methods for data streams and;  need to detect promptly changing concepts and
data distribution because of highly dynamic nature of data streams A first
research challenge is designing fast and light mining methods for data streams,
for example, algorithms that only require one pass over the data and work with
limited memory. Another challenge is created by the highly dynamic nature of
data streams, whereby the stream mining algorithms need to detect promptly
changing concepts and data distribution and adapt to them.

 

2. Data stream mining:

High volume and potential
infinite data streams are generated by so many resources such as real-time
surveillance systems, communication networks, Internet traffic, on-line
transactions in the financial market or retail industry, electric power grids,
industry production processes, scientific and engineering experiments, remote
sensors, and other dynamic environments. In data stream model, data items can
be relational tuples like network measurements and call records. In comparison
with traditional data sets, data stream flows continuously in systems with
varying update rate. Data streams are continuous, temporally ordered, fast
changing, massive and potentially infinite. Due to huge amount and high storage
cost, it is impossible to store an entire data streams or to scan through it
multiple times. So, it makes so many challenges in storage, computational and
communication capabilities of computational systems. Because of high volume and
speed of input data, it is needed to use semi-automatic interactional
techniques to extract embedded knowledge from data. Data stream mining is the
extraction of structures of knowledge that are represented in the case of
models and patterns of infinite streams of information.

For extracting knowledge
or patterns from data streams, it is crucial to develop methods that analyze
and process streams of data in multidimensional, multi-level, single pass and
online manner. These methods should not be limited to data streams only,
because they are also needed when we have large volume of data. Moreover,
because of the limitation of data streams, the proposed methods are based on
statistic, calculation and complexity theories. For example, by using
summarization techniques that are derived from statistic science, we can
confront with memory limitation. In addition, some of the techniques in computation
theory can be used for implementing time and space efficient algorithms. By
using these techniques we can also use common data mining approaches by
enforcing some changes in data streams. Some solutions have been proposed based
on data stream mining problems and challenges. Data-based techniques refer to
summarizing the whole dataset or choosing a subset of the incoming stream to be
analyzed. Sampling, load and sketching techniques represent the former one.
Synopsis data structures and aggregation represent the later one. Task-based
techniques are those methods that modify existing techniques or invent new ones
in order to address the computational challenges of data stream processing.
Approximation algorithms, sliding window and algorithm output granularity
represent this category.

 Sampling refers to the process of
probabilistic choice of a data item to be processed or not. The problem with
using sampling in the context of data stream analysis is the unknown dataset
size. Thus, the treatment of data stream should follow a special analysis to
find the error bounds. Another problem with sampling is that it would be
important to check for anomalies for surveillance analysis as an application in
mining data streams. Sampling may not be the right choice for such an
application. Sampling also does not address the problem of fluctuating data
rates. It would be worth investigating the relationship among the three
parameters: data rate, sampling rate and error bounds. Load shedding refers to
the process of dropping a sequence of data streams. Load shedding has been used
successfully in querying data streams. It has the same problems of sampling.
Load shedding is difficult to be used with mining algorithms because it drops
chunks of data streams that could be used in the structuring of the generated
models or it might represent a pattern of interest in time series analysis.
Sketching is the process of randomly project a subset of the features. It is
the process of vertically sample the incoming stream. Sketching has been
applied in comparing different data streams and in aggregate queries. The major
drawback of sketching is that of accuracy. It is hard to use it in the context
of data stream mining. Creating synopsis of data refers to the process of
applying summarization techniques that are capable of summarizing the incoming
stream for further analysis. Wavelet analysis, histograms, quantiles and
frequency moments have been proposed as synopsis data structures. Since
synopsis of data does not represent all the characteristics of the dataset,
approximate answers are produced when using such data structures. The process
in which the input stream is represented in a summarized form is called
aggregation. This aggregate data can be used in data mining algorithms. The main
problem of this method is that highly fluctuating data distributions reduce the
method’s efficiency. Approximation algorithms have their roots in algorithm
design. It is concerned with design algorithms for computationally hard
problems. These algorithms can result in an approximate solution with error
bounds. The idea is that mining algorithms are considered hard computational
problems given its features of continuality and speed and the generating
environment that is featured by being resource constrained.

 

 

 

Preprocessing techniques
for data stream mining: 

·        
Data-based
solutions

1.     
Sampling    

2.     
Load shedding

3.     
Sketching

4.     
Synopsis data
Structures

5.     
Aggregation

 

·        
Task-based
solutions

1.     
Approximation
Algorithms

2.     
Sliding window

3.     
Algorithm Output
Granularity

 

Approximation algorithms
have attracted researchers as a direct solution to data stream mining problems.
However, the problem of data rates with regard with the available resources
could not be solved using approximation algorithms. Other tools should be used
along with these algorithms in order to adapt to the available resources.
Approximation algorithms have been used in. The inspiration behind sliding
window is that the user is more concerned with the analysis of most recent data
streams. Thus, the detailed analysis is done over the most recent data items
and summarized versions of the old ones.

 

3. Classification of data stream challenges:

 There are different challenges in data stream
mining that cause many research issues in this field. Regarding to data stream
requirements, developing stream mining algorithms is needed more studying than
traditional mining methods. We can classify stream mining challenges in 5
categories; Irregular rate of arrival and variant data arrival rate over time,
Quality of mining results, Bounded memory size and huge amount of data streams,
Limited resources, e.g., memory space and computation power and to facilitate
data analysis and take a quick decision for users. In the following each of
them will be described. One of the most important issues in data stream mining
is optimization of memory space consumed by the mining algorithm. Memory
management is a main challenge in stream processing because many real data
streams have irregular arrival rate and variation of data arrival rate over
time. In many applications like sensor networks, stream mining algorithms with
high memory cost is not applicable. Therefore, it is necessary to develop
summarizing techniques for collecting valuable information from data streams.

 Data pre-processing is an important and
time-consuming phase in the knowledge discovery process and must be taken into
consideration when mining data streams. Designing a light-weight preprocessing
technique that can guarantee quality of the mining results is crucial. The
challenge here is to automate such a process and integrate it with the mining
techniques. By considering the size of memory and the huge amount of data
stream that continuously arrive to the system, it is needed to have a compact
data structure to store, update and retrieve the collected information. Without
such a data structure, the efficiency of mining algorithm will largely
decrease. Even if we store the information in disks, the additional I/O
operations will increase the processing time. While it is impossible to rescan
the entire input data, incremental maintaining of data structure is
indispensable. Furthermore, novel indexing, storage and querying techniques are
required to manage continuous and changing flow of data streams. It is crucial
to consider the limited resources such as memory space and computation power
for reaching accurate estimates in data streams mining.  If stream data mining algorithms consume the
available resources without any consideration, the accuracy of their results
would decrease dramatically. In several papers this issue is discussed and
their solutions for resource-aware mining. Visualization is a powerful way to
facilitate data analysis. Absence of suitable tools for visualization of mining
result makes many problems in data analysis and quick decision making by user.
This challenge still is a research issue that one of the proposed approaches is
intelligent monitoring.  

 

4. The proposed analytical framework

This research ends in an
analytical framework. This framework tries to show the efficiency of data
mining applications in developing the novel data stream mining algorithms.
These algorithms are classified base on the data mining tasks. We described the
details of these algorithms based on preprocessing steps and the following
steps.  In addition, this framework can
direct future works in this field. Some of the most important results that have
been reached during this research are:

 (1) Mining data streams has raised a number of
research challenges for the data mining community. Due to the resource and time
constraints many summarization and approximation techniques have been adopted
from the fields of statistics and computational theory.

(2) There are many open
issues that need to be addressed. The development of systems that will fully
address these issues is crucial for accelerating the science discovery in the
fields of physics and astronomy, as well as in business and financial
applications.

 

5. Conclusion

 In this paper we reviewed and analyzed data mining
applications for solving data stream mining challenges. At first, we presented
a comprehensive classification for data stream mining algorithms based on data
mining applications. In this classification, we separate algorithms with
preprocessing from those without preprocessing. In addition, we classify
preprocessing techniques in a distinct classification. In the following, the
layered architecture of the classification represents almost all of the
challenges that are mentioned in various researches. Then we discussed the
application of data mining techniques for addressing the challenges of data
stream mining. Results are shown that it is necessary to adopt many
summarization and approximation techniques from the fields of statistics and
computational theory, besides crucial changes that are needed in common data
mining techniques. In spite of the researches that have been done on data
mining’s application in data stream mining so far, there are still wide areas
for further researches.