PDF Clown 0.0.8 released

LATEST NEWS — PDF Clown 0.0.8 functionalities are part of the latest release (PDF Clown 0.1.0). As 0.0 version series is under decommissioning, you’re warmly invited to adopt the current 0.1 version series. Thank you!

This release is focused on text extraction support: a specialized tool provides, along with plain-text extraction, advanced functionalities such as full graphic state of extracted text (font, font size, text color, text rendering mode, text position…), text filtering by area, text grouping and sorting. Lots of minor improvements have been applied too.

Java version migrated to Java 6 platform, while C#/.NET version migrated to .NET 3.5.

LGPL 3 is the new license applied to the project.

Last but not least: the distribution’s directory structure has been revised to simplify its navigation and ease its integration with common IDEs (Eclipse- and Visual Studio-compatible).

This release may be downloaded from:
https://sourceforge.net/projects/clown/files/PDFClown-devel/0.0.8%20Alpha/

enjoy!

Patches

See PDF Clown 0.0.8 patches.

PDF Clown 0.0.8 — Text extraction

LATEST NEWS — PDF Clown 0.0.8 functionalities are part of the latest release (PDF Clown 0.1.0). As 0.0 version series is under decommissioning, you’re warmly invited to adopt the current 0.1 version series. Thank you!

I know, it’s been just about one year since the latest version (0.0.7) was released… please, forgive me! 😉

In the meantime PDF Clown has been growing considerably to provide a rich text extraction functionality for its next 0.0.8 version:

  • the font model has been deeply revised and expanded to smoothly support character encoding issues;
  • the content stream model has been furtherly harmonized to simplify the access to text contents;
  • the content scanner has been simplified in its iterative mechanism and enriched through a new level of abstraction to allow easy object placement detection (image and text characters coordinates);
  • a text extraction tool allows sub-page region selection to extract text only from specific page areas.

Continue reading PDF Clown 0.0.8 — Text extraction