Paper
16 January 2006 Versatile document image content extraction
Henry S. Baird, Michael A. Moll, Jean Nonnemaker, Matthew R. Casey, Don L. Delorenzo
Author Affiliations +
Proceedings Volume 6067, Document Recognition and Retrieval XIII; 60670R (2006) https://doi.org/10.1117/12.650359
Event: Electronic Imaging 2006, 2006, San Jose, California, United States
Abstract
We offer a preliminary report on a research program to investigate versatile algorithms for document image content extraction, that is locating regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc. To solve this problem in its full generality requires coping with a vast diversity of document and image types. Automatically trainable methods are highly desirable, as well as extremely high speed in order to process large collections. Significant obstacles include the expense of preparing correctly labeled ("ground-truthed") samples, unresolved methodological questions in specifying the domain (e.g. what is a representative collection of document images?), and a lack of consensus among researchers on how to evaluate content-extraction performance. Our research strategy emphasizes versatility first: that is, we concentrate at the outset on designing methods that promise to work across the broadest possible range of cases. This strategy has several important implications: the classifiers must be trainable in reasonable time on vast data sets; and expensive ground-truthed data sets must be complemented by amplification using generative models. These and other design and architectural issues are discussed. We propose a trainable classification methodology that marries k-d trees and hash-driven table lookup and describe preliminary experiments.
© (2006) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Henry S. Baird, Michael A. Moll, Jean Nonnemaker, Matthew R. Casey, and Don L. Delorenzo "Versatile document image content extraction", Proc. SPIE 6067, Document Recognition and Retrieval XIII, 60670R (16 January 2006); https://doi.org/10.1117/12.650359
Lens.org Logo
CITATIONS
Cited by 18 scholarly publications and 2 patents.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Feature extraction

Data modeling

Associative arrays

Image classification

Pattern recognition

Visualization

Advanced distributed simulations

Back to Top