Hsinchun Chen
Publications
Abstract:
As the Web continues to grow, it has become increasingly difficult to search for relevant information using traditional search engines. Topic-specific search engines provide an alternative way to support efficient information retrieval on the Web by providing more precise and customized searching in various domains. However, developers of topic-specific search engines need to address two issues: how to locate relevant documents (URLs) on the Web and how to filter out irrelevant documents from a set of documents collected from the Web. This paper reports our research in addressing the second issue. We propose a machine-learning-based approach that combines Web content analysis and Web structure analysis. We represent each Web page by a set of content-based and link-based features, which can be used as the input for various machine learning algorithms. The proposed approach was implemented using both a feedforward/backpropagation neural network and a support vector machine. Two experiments were designed and conducted to compare the proposed Web-feature approach with two existing Web page filtering methods - a keyword-based approach and a lexicon-based approach. The experimental results showed that the proposed approach in general performed better than the benchmark approaches, especially when the number of training documents was small. The proposed approaches can be applied in topic-specific search engine development and other Web applications such as Web content management. © 2007 Elsevier B.V. All rights reserved.
Abstract:
Web directories organize voluminous information into hierarchical structures, helping users to quickly locate relevant information and to support decision-making. The development of existing ontologies and Web directories either relies on expert participation that may not be available or uses automatic approaches that lack precision. As more users access the Web in their native languages, better approaches to organizing and developing non-English Web directories are needed. In this paper, we have proposed a semi-automatic framework, which consists of anchor directory boosting, meta-searching, and heuristic filtering, to construct domain-specific Web directories. Using the framework, we have built a Web directory in the Spanish business (SBiz) domain. Experimental results show that the SBiz Web directory achieved significantly better recall, F-value, efficiency, and satisfaction rating than the benchmark directory. Subjects provided favorable comments on the SBiz Web directory. This research thus contributes to developing a useful framework for organizing domain-specific information on the Web and to providing empirical findings and useful insights for end-users, system developers, and researchers of Web information seeking and knowledge management. © 2007 Elsevier Ltd. All rights reserved.
Abstract:
More and more women are participating in and exchanging opinions through community-based online social media. Questions concerning gender differences in the new media have been raised. This paper proposes a feature-based text classification framework to examine online gender differences between Web forum posters by analyzing writing styles and topics of interest. Our experiment on an Islamic women's political forum shows that feature sets containing both content-free and content-specific features perform significantly better than those consisting of only content-free features, feature selection can improve the classification results significantly, and female and male participants have significantly different topics of interest. © 2011 IEEE.
Abstract:
Information retrieval using probabilistic techniques has attracted significant attention on the part of researchers in information and computer science over the past few decades. In the 1980s, knowledge-based techniques also made an impressive contribution to "intelligent" information retrieval and indexing. More recently, information science researchers have turned to other newer inductive learning techniques including symbolic learning, genetic algorithms, and simulated annealing. These newer techniques, which are grounded in diverse paradigms, have provided great opportunities for researchers to enhance the information processing and retrieval capabilities of current information systems. In this article, we first provide an overview of these newer techniques and their use in information retrieval research. In order to familiarize readers with the techniques, we present three promising methods: The symbolic ID3 algorithm, evolution-based genetic algorithms, and simulated annealing. We discuss their knowledge representations and algorithms in the unique context of information retrieval. An experiment using a 8000-record COMPEN database was performed to examine the performances of these inductive query-by-example techniques in comparison with the performance of the conventional relevance feedback method. The machine learning techniques were shown to be able to help identify new documents which are similar to documents initially suggested by users, and documents which contain similar concepts to each other. Genetic algorithms, in particular, were found to out-perform relevance feedback in both document recall and precision. We believe these inductive machine learning techniques hold promise for the ability to analyze users' preferred documents (or records), identify users' underlying information needs, and also suggest alternatives for search for database management systems and Internet applications.