Big Data & Information Analytics
July 2016 , Volume 1 , Issue 2&3
Select all articles
Export/Reference:
2016, 1(2&3): 139-161
doi: 10.3934/bdia.2016001
+[Abstract](6298)
+[PDF](946.5KB)
Abstract:
Nowadays we are in the big data era. The high-dimensionality ofdata imposes big challenge on how to process them effectively andefficiently. Fortunately, in practice data are not unstructured.Their samples usually lie around low-dimensional manifolds andhave high correlation among them. Such characteristics can beeffectively depicted by low rankness. As an extension to thesparsity of first order data, such as voices, low rankness is alsoan effective measure for the sparsity of second order data, suchas images. In this paper, I review the representative theories,algorithms and applications of the low rank subspace recoverymodels in data processing.
Nowadays we are in the big data era. The high-dimensionality ofdata imposes big challenge on how to process them effectively andefficiently. Fortunately, in practice data are not unstructured.Their samples usually lie around low-dimensional manifolds andhave high correlation among them. Such characteristics can beeffectively depicted by low rankness. As an extension to thesparsity of first order data, such as voices, low rankness is alsoan effective measure for the sparsity of second order data, suchas images. In this paper, I review the representative theories,algorithms and applications of the low rank subspace recoverymodels in data processing.
2016, 1(2&3): 163-169
doi: 10.3934/bdia.2016002
+[Abstract](3351)
+[PDF](650.0KB)
Abstract:
Big Data and Big Graphs have become landmarks of current cross-border research, destined to remain so for long time. While we try to optimize the ability of assimilating both, novel methods continue to inspire new applications, and vice versa.Clearly these two big things, data and graphs, are connected, but can we ensure management of their complexities, computational efficiency, robust inference? Critical bridging features are addressed here to identify grand challenges and bottlenecks.
Big Data and Big Graphs have become landmarks of current cross-border research, destined to remain so for long time. While we try to optimize the ability of assimilating both, novel methods continue to inspire new applications, and vice versa.Clearly these two big things, data and graphs, are connected, but can we ensure management of their complexities, computational efficiency, robust inference? Critical bridging features are addressed here to identify grand challenges and bottlenecks.
2016, 1(2&3): 171-183
doi: 10.3934/bdia.2016003
+[Abstract](3584)
+[PDF](1882.6KB)
Abstract:
Urban air pollution post a great threat to human health, and has been a major concern of many metropolises in developing countries. Lately, a few air quality monitoring stations have been established to inform public the real-time air quality indices based on fine particle matters, e.g. $PM_{2.5}$, in countries suffering from air pollutions. Air quality, unfortunately, is fairly difficult to manage due to multiple complex human activities from driving to smelting. We observe that human activities' hidden regular pattern offers possibility in predication, and this motivates us to infer urban air condition from the perspective of time series. In this paper, we focus on $PM_{2.5}$ based urban air quality, and introduce two kinds of time-series methods for real-time and fine-grained air quality prediction, harnessing historical air quality data reported by existing monitoring stations. The methods are evaluated based in the real-life $PM_{2.5}$ concentration data in the year of 2013 (January - December) in Wuhan, China.
Urban air pollution post a great threat to human health, and has been a major concern of many metropolises in developing countries. Lately, a few air quality monitoring stations have been established to inform public the real-time air quality indices based on fine particle matters, e.g. $PM_{2.5}$, in countries suffering from air pollutions. Air quality, unfortunately, is fairly difficult to manage due to multiple complex human activities from driving to smelting. We observe that human activities' hidden regular pattern offers possibility in predication, and this motivates us to infer urban air condition from the perspective of time series. In this paper, we focus on $PM_{2.5}$ based urban air quality, and introduce two kinds of time-series methods for real-time and fine-grained air quality prediction, harnessing historical air quality data reported by existing monitoring stations. The methods are evaluated based in the real-life $PM_{2.5}$ concentration data in the year of 2013 (January - December) in Wuhan, China.
2016, 1(2&3): 185-216
doi: 10.3934/bdia.2016004
+[Abstract](4362)
+[PDF](1687.0KB)
Abstract:
With the advent of the Internet of Things (IoT) and cloud computing,the need for data stores that would be able to store and process big data inan ecient and cost-eective manner has increased dramatically. Traditionaldata stores seem to have numerous limitations in addressing such requirements.NoSQL data stores have been designed and implemented to address the shortcomingsof relational databases by compromising on ACID and transactionalproperties to achieve high scalability and availability. These systems are designedto scale to thousands or millions of users performing updates, as wellas reads, in contrast to traditional RDBMSs and data warehouses. Althoughthere is a plethora of potential NoSQL implementations, there is no one-size-t-all solution to satisfy even main requirements. In this paper, we explorepopular and commonly used NoSQL technologies and elaborate on their documentation,existing literature and performance evaluation. More specically,we will describe the background, characteristics, classication, data model andevaluation of NoSQL solutions that aim to provide the capabilities for big dataanalytics. This work is intended to help users, individuals or organizations,to obtain a clear view of the strengths and weaknesses of well-known NoSQLdata stores and select the right technology for their applications and use cases.To do so, we rst present a systematic approach to narrow down the properNoSQL candidates and then adopt an experimental methodology that can berepeated by anyone to nd the best among short listed candidates consideringtheir specic requirements.
With the advent of the Internet of Things (IoT) and cloud computing,the need for data stores that would be able to store and process big data inan ecient and cost-eective manner has increased dramatically. Traditionaldata stores seem to have numerous limitations in addressing such requirements.NoSQL data stores have been designed and implemented to address the shortcomingsof relational databases by compromising on ACID and transactionalproperties to achieve high scalability and availability. These systems are designedto scale to thousands or millions of users performing updates, as wellas reads, in contrast to traditional RDBMSs and data warehouses. Althoughthere is a plethora of potential NoSQL implementations, there is no one-size-t-all solution to satisfy even main requirements. In this paper, we explorepopular and commonly used NoSQL technologies and elaborate on their documentation,existing literature and performance evaluation. More specically,we will describe the background, characteristics, classication, data model andevaluation of NoSQL solutions that aim to provide the capabilities for big dataanalytics. This work is intended to help users, individuals or organizations,to obtain a clear view of the strengths and weaknesses of well-known NoSQLdata stores and select the right technology for their applications and use cases.To do so, we rst present a systematic approach to narrow down the properNoSQL candidates and then adopt an experimental methodology that can berepeated by anyone to nd the best among short listed candidates consideringtheir specic requirements.
2016, 1(2&3): 217-225
doi: 10.3934/bdia.2016005
+[Abstract](2583)
+[PDF](345.0KB)
Abstract:
Given a data set with one categorical response variable and multiple categorical or continuous explanatory variables, it is required in some applications to discretize the continuous explanatory ones. A proper supervised discretization usually achieves a better result than the unsupervised ones. Rather than individually doing so as recently proposed by Huang, Pan and Wu in [12,13], we suggest a forward supervised discretization algorithm to capture a higher association from the multiple explanatory variables to the response variable. Experiments with the GK-tau and the GK-lambda are presented to support the statement.
Given a data set with one categorical response variable and multiple categorical or continuous explanatory variables, it is required in some applications to discretize the continuous explanatory ones. A proper supervised discretization usually achieves a better result than the unsupervised ones. Rather than individually doing so as recently proposed by Huang, Pan and Wu in [12,13], we suggest a forward supervised discretization algorithm to capture a higher association from the multiple explanatory variables to the response variable. Experiments with the GK-tau and the GK-lambda are presented to support the statement.
2016, 1(2&3): 227-245
doi: 10.3934/bdia.2016006
+[Abstract](2886)
+[PDF](1029.1KB)
Abstract:
Coalition attack is nowadays one of the most common type of attacks in the industry of online advertising. In this paper, we attempt to mitigate the problem of frauds by proposing a hybrid framework that detects the coalition attacks based on multiple metrics. We also articulate the theoretical basis for these metrics to be integrated into the hybrid framework. Furthermore, we instance the framework with two metrics and develop a detection system that identifies the coalition attacks from two distinguish perspectives.
Coalition attack is nowadays one of the most common type of attacks in the industry of online advertising. In this paper, we attempt to mitigate the problem of frauds by proposing a hybrid framework that detects the coalition attacks based on multiple metrics. We also articulate the theoretical basis for these metrics to be integrated into the hybrid framework. Furthermore, we instance the framework with two metrics and develop a detection system that identifies the coalition attacks from two distinguish perspectives.
2016, 1(2&3): 247-259
doi: 10.3934/bdia.2016007
+[Abstract](3414)
+[PDF](964.6KB)
Abstract:
Most existing clustering algorithms are slow for dividing a large dataset into a large number of clusters. In this paper, we propose a truncated FCM algorithm to address this problem. The main idea behind our proposed algorithm is to keep only a small number of cluster centers during the iterative process of the FCM algorithm. Our numerical experiments on both synthetic and real datasets show that the proposed algorithm is much faster than the original FCM algorithm and the accuracy is comparable to that of the original FCM algorithm.
Most existing clustering algorithms are slow for dividing a large dataset into a large number of clusters. In this paper, we propose a truncated FCM algorithm to address this problem. The main idea behind our proposed algorithm is to keep only a small number of cluster centers during the iterative process of the FCM algorithm. Our numerical experiments on both synthetic and real datasets show that the proposed algorithm is much faster than the original FCM algorithm and the accuracy is comparable to that of the original FCM algorithm.
2016, 1(2&3): 261-274
doi: 10.3934/bdia.2016008
+[Abstract](3754)
+[PDF](403.1KB)
Abstract:
News recommender systems efficiently handle the overwhelming number of news articles, simplify navigations, and retrieve relevant information. Many conventional news recommender systems use collaborative filtering to make recommendations based on the behavior of users in the system. In this approach, the introduction of new users or new items can cause the cold start problem, as there will be insufficient data on these new entries for the collaborative filtering to draw any inferences for new users or items. Content-based news recommender systems emerged to address the cold start problem. However, many content-based news recommender systems consider documents as a bag-of-words neglecting the hidden themes of the news articles. In this paper, we propose a news recommender system leveraging topic models and time spent on each article. We build an automated recommender system that is able to filter news articles and make recommendations based on users' preferences. We use topic models to identify the thematic structure of the corpus. These themes are incorporated into a content-based recommender system to filter news articles that contain themes that are of less interest to users and to recommend articles that are thematically similar to users' preferences. Our experimental studies show that utilizing topic modeling and spent time on a single article can outperform the state of the arts recommendation techniques. The resulting recommender system based on the proposed method is currently operational at The Globe and Mail (http://www.theglobeandmail.com/).
News recommender systems efficiently handle the overwhelming number of news articles, simplify navigations, and retrieve relevant information. Many conventional news recommender systems use collaborative filtering to make recommendations based on the behavior of users in the system. In this approach, the introduction of new users or new items can cause the cold start problem, as there will be insufficient data on these new entries for the collaborative filtering to draw any inferences for new users or items. Content-based news recommender systems emerged to address the cold start problem. However, many content-based news recommender systems consider documents as a bag-of-words neglecting the hidden themes of the news articles. In this paper, we propose a news recommender system leveraging topic models and time spent on each article. We build an automated recommender system that is able to filter news articles and make recommendations based on users' preferences. We use topic models to identify the thematic structure of the corpus. These themes are incorporated into a content-based recommender system to filter news articles that contain themes that are of less interest to users and to recommend articles that are thematically similar to users' preferences. Our experimental studies show that utilizing topic modeling and spent time on a single article can outperform the state of the arts recommendation techniques. The resulting recommender system based on the proposed method is currently operational at The Globe and Mail (http://www.theglobeandmail.com/).
2016, 1(2&3): 275-276
doi: 10.3934/bdia.2016009
+[Abstract](2658)
+[PDF](156.5KB)
Abstract:
This note introduces the research and development capacity of a data mining leader in Canada--Manifold Data Mining Inc. (Manifold)--and its collaboration with academic community.
This note introduces the research and development capacity of a data mining leader in Canada--Manifold Data Mining Inc. (Manifold)--and its collaboration with academic community.
Readers
Authors
Editors
Referees
Librarians
Email Alert
Add your name and e-mail address to receive news of forthcoming issues of this journal:
[Back to Top]