Hsinchun Chen
Publications
Abstract:
TXTRACTOR is a tool that uses established sentence-selection heuristics to rank text segments, producing summaries that contain a user-defined number of sentences. The purpose of identifying text segments is to maximize topic diversity, which is an adaptation of the Maximal Marginal Relevance criterion used by Carbonell and Goldstein [5]. Sentence selection heuristics are then used to rank the segments. We hypothesize that ranking text segments via traditional sentence-selection heuristics produces a balanced summary with more useful information than one produced by using segmentation alone. The proposed summary is created in a three-step process, which includes 1) sentence evaluation 2) segment identification and 3) segment ranking. As the required length of the summary changes, low-ranking segments can then be dropped from (or higher ranking segments added to) the summary. We compared the output of TXTRACTOR to the output of a segmentation tool based on the TextTiling algorithm to validate the approach.
Abstract:
As fake website developers become more innovative, so too must the tools used to protect Internet users. A proposed system combines a support vector machine classifier and a rich feature set derived from website text, linkage, and images to better detect fraudulent sites. © 2009 IEEE.
Abstract:
Syndromic surveillance can play an important role in protecting the public's health against infectious diseases. Infectious disease outbreaks can have a devastating effect on society as well as the economy, and global awareness is therefore critical to protecting against major outbreaks. By monitoring online news sources and developing an accurate news classification system for syndromic surveillance, public health personnel can be apprised of outbreaks and potential outbreak situations. In this study, we have developed a framework for automatic online news monitoring and classification for syndromic surveillance. The framework is unique and none of the techniques adopted in this study have been previously used in the context of syndromic surveillance on infectious diseases. In recent classification experiments, we compared the performance of different feature subsets on different machine learning algorithms. The results showed that the combined feature subsets including Bag of Words, Noun Phrases, and Named Entities features outperformed the Bag of Words feature subsets. Furthermore, feature selection improved the performance of feature subsets in online news classification. The highest classification performance was achieved when using SVM upon the selected combination feature subset. © 2009 Elsevier B.V. All rights reserved.
Abstract:
While the Web has become a worldwide platform for communication, terrorists share their ideology and communicate with members on the "Dark Web" - the reverse side of the Web used by terrorists. Currently, the problems of information overload and difficulty to obtain a comprehensive picture of terrorist activities hinder effective and efficient analysis of terrorist information on the Web. To improve understanding of terrorist activities, we have developed a novel methodology for collecting and analyzing Dark Web information. The methodology incorporates information collection, analysis, and visualization techniques, and exploits various Web information sources. We applied it to collecting and analyzing information of 39 Jihad Web sites and developed visualization of their site contents, relationships, and activity levels. An expert evaluation showed that the methodology is very useful and promising, having a high potential to assist in investigation and understanding of terrorist activities by producing results that could potentially help guide both policymaking and intelligence research.
PMID: 15759623;Abstract:
Systems that extract biological regulatory pathway relations from free-text sources are intended to help researchers leverage vast and growing collections of research literature. Several systems to extract such relations have been developed but little work has focused on how those relations can be usefully organized (aggregated) to support visualization systems or analysis algorithms. Ontological resources that enumerate name strings for different types of biomedical objects should play a key role in the organization process. In this paper we delineate five potentially useful levels of relational granularity and propose the use of aggregatable substance identifiers to help reduce lexical ambiguity. An aggregatable substance identifier applies to a gene and its products. We merged 4 extensive lexicons and compared the extracted strings to the text of five million MEDLINE abstracts. We report on the ambiguity within and between name strings and common English words. Our results show an 89% reduction in ambiguity for the extracted human substance name strings when using an aggregatable substance approach.