Paper
16 January 2006 Complex document information processing: prototype, test collection, and evaluation
G. Agam, S. Argamon, O. Frieder, D. Grossman, D. Lewis
Author Affiliations +
Proceedings Volume 6067, Document Recognition and Retrieval XIII; 60670N (2006) https://doi.org/10.1117/12.662918
Event: Electronic Imaging 2006, 2006, San Jose, California, United States
Abstract
Analysis of large collections of complex documents is an increasingly important need for numerous applications. Complex documents are documents that typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. The state of the art today for a large document collection is essentially text search of OCR'd documents with no meaningful use of data found in images, signatures, logos, etc. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are also developing a roughly 42,000,000 page complex document test collection. The collection will include relevance judgments for queries at a variety of levels of detail and depending on a variety of content and structural characteristics of documents, as well as "known item" queries looking for particular documents.
© (2006) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
G. Agam, S. Argamon, O. Frieder, D. Grossman, and D. Lewis "Complex document information processing: prototype, test collection, and evaluation", Proc. SPIE 6067, Document Recognition and Retrieval XIII, 60670N (16 January 2006); https://doi.org/10.1117/12.662918
Lens.org Logo
CITATIONS
Cited by 19 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Prototyping

Optical character recognition

Image processing

Data mining

Document image analysis

Data processing

Databases

Back to Top