PDF Clown 0.0.8 — Text extraction

LATEST NEWS — PDF Clown 0.0.8 functionalities are part of the latest release (PDF Clown 0.1.0). As 0.0 version series is under decommissioning, you’re warmly invited to adopt the current 0.1 version series. Thank you!

I know, it’s been just about one year since the latest version (0.0.7) was released… please, forgive me! 😉

In the meantime PDF Clown has been growing considerably to provide a rich text extraction functionality for its next 0.0.8 version:

  • the font model has been deeply revised and expanded to smoothly support character encoding issues;
  • the content stream model has been furtherly harmonized to simplify the access to text contents;
  • the content scanner has been simplified in its iterative mechanism and enriched through a new level of abstraction to allow easy object placement detection (image and text characters coordinates);
  • a text extraction tool allows sub-page region selection to extract text only from specific page areas.

Continue reading PDF Clown 0.0.8 — Text extraction