I know, it’s been just about one year since the latest version (0.0.7) was released… please, forgive me! 😉
In the meantime PDF Clown has been growing considerably to provide a rich text extraction functionality for its next 0.0.8 version:
- the font model has been deeply revised and expanded to smoothly support character encoding issues;
- the content stream model has been furtherly harmonized to simplify the access to text contents;
- the content scanner has been simplified in its iterative mechanism and enriched through a new level of abstraction to allow easy object placement detection (image and text characters coordinates);
- a text extraction tool allows sub-page region selection to extract text only from specific page areas.