Utility-oriented pattern mining has become an emerging topic since it can reveal high-utility patterns from different types of data. The utility/profit of various items is not exactly equal in realistic situations; each item has its own utility or importance. In general, high-utility sequential pattern mining algorithms consider a uniform minimum utility (minutil) threshold to identify the set of HUSPs. This is an unfair measurement, which is unable to find the interesting patterns while the minutil is set extremely high or low. We first design a new utility mining framework across multi-sequences, named mining high-utility sequential patterns with individualized thresholds (USPT) for mining the useful set of HUSPs. Each item in the designed framework has its own specified minimum threshold. A baseline algorithm with a lexicographic-sequential (LS)-tree and the utility-linked (UL)-list structure, are presented to efficiently mine the HUSPs. Based on the introduced upper-bounds on utility, three pruning strategies named Maximal Utility of Extension (MUE), Look Forward (LF), and Prune of Irrelevant Item (PII), are developed to prune the unpromising candidates early in the search space. The results showed that the designed approaches could achieve a good effectiveness and efficiency for mining HUSPs with several developed structures and strategies.
With the proliferation of location-based services, the quantity of location data collected by service providers is gigantic. If these datasets could be published, they will be valuable assets to various sectors. However, there are two major concerns that considerably limit the availability and the usage of these trajectory datasets. The first is the threat to individual privacy. The other concern is the ability to analyze the exabytes of location data in a timely manner. Although there have been trajectory anonymization approaches proposed in the past to mitigate privacy concerns. None of these prior works address the scalability issue since it is a newly occurring problem brought by the significantly increasing adoption of location-based services. In this paper, we propose a novel parallel trajectory anonymization algorithm that achieves scalability, strong privacy protection and high utility rate of the anonymized trajectory datasets. We have conducted extensive experiments using MapReduce on real and synthectic datasets, and our results prove both effectiveness and efficiency when compared with the centralized approaches.
The increasing popularity of social media has attracted a huge number of people to participate in numerous activities on a daily basis. This results in tremendous amounts of rich user-generated data. Publishing user-generated data risks exposing individuals' privacy. Users privacy in social media is an emerging task and has attracted increasing attention in recent years. These works study privacy issues in social media from the two different points of views: identification of vulnerabilities, and mitigation of privacy risks. Recent research has shown the vulnerability of user-generated data against the two general types of attacks, identity disclosure and attribute disclosure. These privacy issues mandate social media data publishers to protect users' privacy by sanitizing user-generated data before publishing it. There is a vast literature on privacy of users in social media from many perspectives. In this survey, we review the key achievements of user privacy in social media. In particular, we review and compare the state-of-the-art algorithms in terms of the privacy leakage attacks and anonymization algorithms. We overview the privacy risks from different aspects of social media and categorize the relevant works into five groups. We also discuss open problems and future research directions for user privacy issues in social media.
The detection of vague, speculative, or otherwise uncertain language has been performed in the encyclopedic, political, and scientific domains, yet left relatively untouched in finance. However, the latter benefits from public sources of big financial data which can be linked with extracted measures of linguistic uncertainty. Doing so helps in understanding how the linguistic uncertainty of financial disclosures induces financial uncertainty to the market. As starting point for our experiments, we use Information Retrieval (IR) term weighting methods to detect linguistic uncertainty in a large dataset of financial disclosures. Apart from deploying an existing dictionary of financial uncertainty triggers, we automatically retrieve related terms in specialized word embedding models to expand the dictionary. In a set of event study regressions, we show that the such enriched dictionary explains a significantly larger share of future volatility, a common financial uncertainty measure, than before. Furthermore, we show that---different to the plain dictionary---our embedding models are well-suited to explain future analyst forecast uncertainty. Notably, we show that enriching the dictionary with industry-specific vocabulary significantly improves experimental results compared to an industry-agnostic expansion.
Each year, around 6 million car accidents occur in the U.S. on average. Road safety features (e.g., concrete barriers, metal crash barriers, rumble strips) play an important role in preventing or mitigating vehicle crashes. Accurate maps of road safety features is an important component of safety management systems for federal or state transportation agencies, helping traffic engineers identify locations to invest on safety infrastructure. In current practice, mapping road safety features is largely done manually (e.g., observations on the road or visual interpretation of streetview imagery), which is both expensive and time consuming. In this paper, we propose a deep learning approach to automatically map road safety features from streetview imagery. Unlike existing Convolutional Neural Networks (CNNs) that classify each image individually, we propose to further add Recurrent Neural Network (Long Short Term Memory) to capture geographic context of images (spatial autocorrelation effect along linear road network paths). Evaluations on real world streetview imagery show that our proposed model outperforms several baseline methods.
Influence maximization with application to viral marketing, is a well-studied problem of finding a small set of the most influential individuals in a social network to maximize the spread of influence under certain influence cascade models. However, almost all previous studies have focused primarily on node-level mining, where they consider selecting nodes as the initial seeders to achieve the desired outcomes. In this paper, instead of targeting nodes, we study a new boosted influence maximization problem from the edge-level perspective, which aims to add an edge set to maximize the final increased influence spread of a given seed set. We show that the problem is NP-hard, and the influence spread function is no longer submodular, which impose more challenging on the problem. Therefore, we devise a restricted influence spread function that is close to the original one and is submodular, and propose a greedy algorithm to approximately solve the problem. However, since the poor computational efficiency of the algorithm, we further propose an improved greedy algorithm that integrates several effective optimization strategies to significantly speed up the edge selection. The extensive experiments over real-world available social networks of different sizes demonstrate the effectiveness and efficiency of the proposed methods.
With the development of mobile Internet, many location based services (LBS) such as Yelp, Gowalla, and FourSquare have accumulated a large amount of data which can be used for POI recommendation. However, there are still challenges in developing an unified framework to incorporate multiple factors associated with both POIs and users, because of the heterogeneity and implicity of these information. To alleviate the problem, this paper first proposes a novel group based method for POI recommendation jointly considering the reviews, categories and geographic position of locations. We divide the users into different groups and train many individual recurrent neural networks for different groups which improves pertinence. Our proposed GTASRRNN not only considers the effect of temporal and geographical contexts, but also captures the users? opinions on locations by means of sentiment analysis. Experimental results show that GTASR-RNN acquires significant improvements over the compared methods on real datasets.