We present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation) and corresponding to particular image regions (region naming). Auto-annotation might help organize and access large collections of images. Region naming is a model of object recognition as a process of translating image regions to words, much as one might translate from one language to another. Learning the relationships between image regions and semantic correlates (words) is an interesting example of multi-modal data mining, particularly because it is typically hard to apply data mining techniques to collections of images. We develop a number of models for the joint distribution of image regions and words, including several which explicitly learn the correspondence between regions and words. We study multi-modal and correspondence extensions to Hofmann's hierarchical clustering/aspect model, a translation model adapted from statistical machine translation (Brown et al.), and a multi-modal extension to mixture of latent Dirichlet allocation (MoM-LDA). All models are assessed using a large collection of annotated images of real scenes. We study in depth the difficult problem of measuring performance. For the annotation task, we look at prediction performance on held out data. We present three alternative measures, oriented toward different types of task. Measuring the performance of correspondence methods is harder, because one must determine whether a word has been placed on the right region of an image. We can use annotation performance as a proxy measure, but accurate measurement requires hand labeled data, and thus must occur on a smaller scale. We show results using both an annotation proxy, and manually labeled data.
Fluorescent surfaces are common in the modern world, but they present problems for machine color constancy because fluorescent reflection typically violates the assumptions needed by most algorithms. The complexity of fluorescent reflection is likely one of the reasons why fluorescent surfaces have escaped the attention of computational color constancy researchers. In this paper we take some initial steps to rectify this omission. We begin by introducing a simple method for characterizing fluorescent surfaces. It is based on direct measurements, and thus has low error and avoids the need to develop a comprehensive and accurate physical model. We then modify and extend several modern color constancy algorithms to address fluorescence. The algorithms considered are CRULE and derivatives, Color by Correlation, and neural net methods. Adding fluorescence to Color by Correlation and neural net methods is relatively straight forward, but CRULE requires modification so that its complete reliance on diagonal models can be relaxed. We present results for both synthetic and real image data for fluorescent capable versions of CRULE and Color by Correlation, and we compare the results with the standard versions of these and other algorithms.
We introduce a context for testing computational color constancy, specify our approach to the implementation of a number of the leading algorithms, and report the results of three experiments using synthesized data. Experiments using synthesized data are important because the ground truth is known, possible confounds due to camera characterization and pre-processing are absent, and various factors affecting color constancy can be efficiently investigated because they can be manipulated individual and precisely. The algorithms chosen for close study include two gray world methods, a limiting case of a version of the Retinex method, a number of variants of Forsyth's gamut-mapping method, Cardei et al.'s neural net method, and Finlayson et al.'s Color by Correlation method. We investigate the ability of these algorithms to make estimates of three different color constancy quantities: the chromaticity of the scene illuminant, the overall magnitude of that illuminant, and a corrected, illumination invariant, image. We consider algorithm performance as a function of the number of surfaces in scenes generated from reflectance spectra, the relative effect on the algorithms of added specularities, and the effect of subsequent clipping of the data. All data is available on-line at http://www.cs.sfu.ca/∼color/data, and implementations for most of the algorithms are also available (http://www.cs.sfu.ca/∼color/code).
This venue is a peer reviewed, competitive conference (acceptance rate: 26%) and the full paper is published as part of the conference proceedings [ CSRanking endorsed, A* ]
We develop and demonstrate an object recognition system capable of accurately detecting, localizing, and recovering the kinematic configuration of textured animals in real images. We build a deformation model of shape automatically from videos of animals and an appearance model of texture from a labeled collection of animal images, and combine the two models automatically. We develop a simple texture descriptor that outperforms the state of the art. We test our animal models on two datasets; images taken by professional photographers from the Corel collection, and assorted images from the web returned by Google. We demonstrate quite good performance on both datasets. Comparing our results with simple baselines, we show that for the Google set, we can recognize objects from a collection demonstrably hard for object recognition. © 2005 IEEE.