With the tremendous popularity of PDF format, recognizing mathematical formulas in PDF documents becomes a new
and important problem in document analysis field. In this paper, we present a method of embedded mathematical
formula identification in PDF documents, based on Support Vector Machine (SVM). The method first segments text
lines into words, and then classifies each word into two classes, namely formula or ordinary text. Various features of
embedded formulas, including geometric layout, character and context content, are utilized to build a robust and
adaptable SVM classifier. Embedded formulas are then extracted through merging the words labeled as formulas.
Experimental results show good performance of the proposed method. Furthermore, the method has been successfully
incorporated into a commercial software package for large-scale e-Book production.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.