Hsinchun Chen
Publications
Abstract:
Analysis of affective intensities in computer-mediated communication is important in order to allow a better understanding of online users' emotions and preferences. Despite considerable research on textual affect classification, it is unclear which features and techniques are most effective. In this study, we compared several feature representations for affect analysis, including learned n-grams and various automatically and manually crafted affect lexicons. We also proposed the support vector regression correlation ensemble (SVRCE) method for enhanced classification of affect intensities. SVRCE uses an ensemble of classifiers each trained using a feature subset tailored toward classifying a single affect class. The ensemble is combined with affect correlation information to enable better prediction of emotive intensities. Experiments were conducted on four test beds encompassing web forums, blogs, and online stories. The results revealed that learned n-grams were more effective than lexicon-based affect representations. The findings also indicated that SVRCE outperformed comparison techniques, including Pace regression, semantic orientation, and WordNet models. Ablation testing showed that the improved performance of SVRCE was attributable to its use of feature ensembles as well as affect correlation information. A brief case study was conducted to illustrate the utility of the features and techniques for affect analysis of large archives of online discourse. © 2008 IEEE.
Abstract:
Online reputation systems are intended to facilitate the propagation of word of mouth as a credibility scoring mechanism for improved trust in electronic marketplaces. However, they experience two problems attributable to anonymity abuse - easy identity changes and reputation manipulation. In this study, we propose the use of stylometric analysis to help identify online traders based on the writing style traces inherent in their posted feedback comments. We incorporated a rich stylistic feature set and developed the Writeprint technique for detection of anonymous trader identities. The technique and extended feature set were evaluated on a test bed encompassing thousands of feedback comments posted by 200 eBay traders. Experiments conducted to assess the scalability (number of traders) and robustness (against intentional obfuscation) of the proposed approach found it to significantly outperform benchmark stylometric techniques. The results indicate that the proposed method in~y help militate against easy identity changes and reputation manipulation in electronic markets. © 2008 M.E. Sharpe, Inc.
Abstract:
Web forums are frequently used as platforms for the exchange of information and opinions, as well as propaganda dissemination. But online content can be misused when the information being distributed, such as radical opinions, is unsolicited or inappropriate. However, radical opinion is highly hidden and distributed in Web forums, while non-radical content is unspecific and topically more diverse. It is costly and time consuming to label a large amount of radical content (positive examples) and non-radical content (negative examples) for training classification systems. Nevertheless, it is easy to obtain large volumes of unlabeled content in Web forums. In this paper, we propose and develop a topic-sensitive partially supervised learning approach to address the difficulties in radical opinion identification in hate group Web forums. Specifically, we design a labeling heuristic to extract high quality positive examples and negative examples from unlabeled datasets. The empirical evaluation results from two large hate group Web forums suggest that our proposed approach generally outperforms the benchmark techniques and exhibits more stable performance than its counterparts. © 2012 IEEE.
Abstract:
Cross-language information retrieval (CLIR) and multilingual information retrieval (MLIR) techniques have been widely studied, but they are not often applied to and evaluated for Web applications. In this paper, we present our research in developing and evaluating a multilingual English-Chinese Web portal in the business domain. A dictionary-based approach has been adopted that combines phrasal translation, co-occurrence analysis, and pre- and post-translation query expansion. The approach was evaluated by domain experts and the results showed that co-occurrence-based phrasal translation achieved a 74.6% improvement in precision when compared with simple word-by-word translation. © Springer-Verlag Berlin Heidelberg 2003.
Abstract:
Identity management is critical to various governmental practices ranging from providing citizens services to enforcing homeland security. The task of searching for a specific identity is difficult because multiple identity representations may exist due to issues related to unintentional errors and intentional deception. We propose a probabilistic Naïve Bayes model that improves existing identity matching techniques in terms of effectiveness. Experiments show that our proposed model performs significantly better than the exact-match based technique as well as the approximate-match based record comparison algorithm. In addition, our model greatly reduces the efforts of manually labeling training instances by employing a semi-supervised learning approach. This training method outperforms both fully supervised and unsupervised learning. With a training dataset that only contains 10% labeled instances, our model achieves a performance comparable to that of a fully supervised learning.