PDF Clown 0.0.8 — Text extraction

LATEST NEWS — PDF Clown 0.0.8 functionalities are part of the latest release (PDF Clown 0.1.0). As 0.0 version series is under decommissioning, you’re warmly invited to adopt the current 0.1 version series. Thank you!

I know, it’s been just about one year since the latest version (0.0.7) was released… please, forgive me! 😉

In the meantime PDF Clown has been growing considerably to provide a rich text extraction functionality for its next 0.0.8 version:

  • the font model has been deeply revised and expanded to smoothly support character encoding issues;
  • the content stream model has been furtherly harmonized to simplify the access to text contents;
  • the content scanner has been simplified in its iterative mechanism and enriched through a new level of abstraction to allow easy object placement detection (image and text characters coordinates);
  • a text extraction tool allows sub-page region selection to extract text only from specific page areas.

Waiting for the termination of the current development iteration, let’s see some new stuff!

NOTE: the following code samples are expressed extending the Sample class common to all the CLI samples shipped with PDF Clown 0.0.8 downloadable distribution.

1. Basic text extraction

This code sample demonstrates the most basic way to extract text content according to PDF Clown 0.0.8.

package it.stefanochizzolini.clown.samples;

import it.stefanochizzolini.clown.documents.Document;
import it.stefanochizzolini.clown.documents.Page;
import it.stefanochizzolini.clown.documents.contents.ContentScanner;
import it.stefanochizzolini.clown.documents.contents.fonts.Font;
import it.stefanochizzolini.clown.documents.contents.objects.ContainerObject;
import it.stefanochizzolini.clown.documents.contents.objects.ContentObject;
import it.stefanochizzolini.clown.documents.contents.objects.ShowText;
import it.stefanochizzolini.clown.documents.contents.objects.Text;
import it.stefanochizzolini.clown.files.File;

import java.util.HashMap;
import java.util.Map;

public class BasicTextExtractionSample
  extends Sample
{
  @Override
  public boolean run(
    )
  {
    String filePath = promptPdfFileChoice("Please select a PDF file");

    // 1. Open the PDF file!
    File file;
    try
    {file = new File(filePath);}
    catch(Exception e)
    {throw new RuntimeException(filePath + " file access error.",e);}

    // 2. Get the PDF document!
    Document document = file.getDocument();

    // 3. Extracting text from the document pages...
    for(Page page : document.getPages())
    {
      if(!prompt(page))
        return false;

      extract(
        new ContentScanner(page) // Wraps the page contents into a scanner.
        );
    }

    return true;
  }

  /**
    Scans a content level looking for text.
  */
  /*
    NOTE: Page contents are represented by a sequence of content objects,
    possibly nested into multiple levels.
  */
  private void extract(
    ContentScanner level
    )
  {
    if(level == null)
      return;

    while(level.moveNext())
    {
      ContentObject content = level.getCurrent();
      if(content instanceof ShowText)
      {
        Font font = level.getState().font;
        // Extract the current text chunk, decoding it!
        System.out.println(font.decode(((ShowText)content).getText()));
      }
      else if(content instanceof Text
        || content instanceof ContainerObject)
      {
        // Scan the inner level!
        extract(level.getChildLevel());
      }
    }
  }

  private boolean prompt(
    Page page
    )
  {
    int pageIndex = page.getIndex();
    if(pageIndex > 0)
    {
      Map<String,String> options = new HashMap<String,String>();
      options.put("", "Scan next page");
      options.put("Q", "End scanning");
      if(!promptChoice(options).equals(""))
        return false;
    }

    System.out.println("\nScanning page " + (pageIndex+1) + "...\n");
    return true;
  }
}

In order to understand this sample, you have to know that the PDF Specification prescribes text content to be shown through so-called ShowText operations; so, we look for that kind of object…

Here it is the way it works:

  1. iterate the document pages [lines 36-44] applying the ContentScanner to the current page [lines 39-41];
  2. iterate the current page contents (through ContentScanner) looking for ShowText operations [lines 63-78], recurring into ContainerObject-s and Text objects;
  3. extract the text content from ShowText operations [line 70].
Incipit of the Japanese translation of the UN Universal Declaration of Human Rights

Applying this code sample to a document such as the Japanese translation of the UN Universal Declaration of Human Rights (see above), the result is pretty accurate, despite the extracted text contains exceeding line breaks (see second row in the figure below): such discrepancy is due to the way the PDF Specification defines text data representation. Particularly, contents within ShowText operations may have been (legally) arbitrarily split by the document generator, as at the time of its inception the PDF format was primarily aimed at typographic rendition instead of content accessibility. For this purpose, the above-mentioned TextExtractor tool provides the appropriate heuristics to effortlessly organize the extracted text in a more intelligible manner (see the following paragraphs).

Text extracted by PDF Clown from the incipit of the Japanese translation of the UN Universal Declaration of Human Rights.

2. Extended text extraction

This code sample shows how to exploit the new abstraction level provided by the content scanner of PDF Clown 0.0.8, which allows you to get a rich set of information describing the graphic state of extracted text (font, font size, text color, text rendering mode, text bounding box…).

In order to demonstrate its precision in detecting text position, the following code also draws the bounding box of each single character appearing on the pages.

package it.stefanochizzolini.clown.samples;

import it.stefanochizzolini.clown.documents.Document;
import it.stefanochizzolini.clown.documents.Page;
import it.stefanochizzolini.clown.documents.contents.ContentScanner;
import it.stefanochizzolini.clown.documents.contents.TextChar;
import it.stefanochizzolini.clown.documents.contents.colorSpaces.DeviceRGBColor;
import it.stefanochizzolini.clown.documents.contents.composition.PrimitiveFilter;
import it.stefanochizzolini.clown.documents.contents.objects.ContainerObject;
import it.stefanochizzolini.clown.documents.contents.objects.ContentObject;
import it.stefanochizzolini.clown.documents.contents.objects.Text;
import it.stefanochizzolini.clown.files.File;
import it.stefanochizzolini.clown.tools.PageStamper;

import java.awt.geom.Rectangle2D;

public class TextInfoExtractionSample
  extends Sample
{
  private DeviceRGBColor[] textCharBoxColors = new DeviceRGBColor[]
    {
      new DeviceRGBColor(200f/255,100f/255,100f/255),
      new DeviceRGBColor(100f/255,200f/255,100f/255),
      new DeviceRGBColor(100f/255,100f/255,200f/255)
    };
  private DeviceRGBColor textStringBoxColor = DeviceRGBColor.Black;

  @Override
  public boolean run(
    )
  {
    String filePath = promptPdfFileChoice("Please select a PDF file");

    // 1. Open the PDF file!
    File file;
    try
    {file = new File(filePath);}
    catch(Exception e)
    {throw new RuntimeException(filePath + " file access error.",e);}

    // 2. Get the PDF document!
    Document document = file.getDocument();

    PageStamper stamper = new PageStamper(); // NOTE: Page stamper is used to draw contents on existing pages.

    // 3. Iterating through the document pages...
    for(Page page : document.getPages())
    {
      System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");

      stamper.setPage(page);

      extract(
        new ContentScanner(page), // Wraps the page contents into a scanner.
        stamper.getForeground()
        );

      stamper.flush();
    }

    serialize(file,false);

    return true;
  }

  /**
    Scans a content level looking for text.
  */
  private void extract(
    ContentScanner level,
    PrimitiveFilter builder
    )
  {
    if(level == null)
      return;

    while(level.moveNext())
    {
      ContentObject content = level.getCurrent();
      if(content instanceof Text)
      {
        ContentScanner.TextWrapper text = (ContentScanner.TextWrapper)level.getCurrentWrapper();
        int colorIndex = 0;
        for(ContentScanner.TextStringWrapper textString : text.getTextStrings())
        {
          Rectangle2D stringBox = textString.getBox();
          System.out.println(
            "Text ["
              + "x:" + Math.round(stringBox.getX()) + ","
              + "y:" + Math.round(stringBox.getY()) + ","
              + "w:" + Math.round(stringBox.getWidth()) + ","
              + "h:" + Math.round(stringBox.getHeight())
              + "]: " + textString.getText()
              );

          // Drawing text character bounding boxes...
          colorIndex = (colorIndex + 1) % textCharBoxColors.length;
          builder.setStrokeColor(textCharBoxColors[colorIndex]);
          for(TextChar textChar : textString.getTextChars())
          {
            /*
              NOTE: You can get further text information
              (font, font size, text color, text rendering mode)
              through textChar.style.
             */
            builder.drawRectangle(textChar.box);
            builder.stroke();
          }

          // Drawing text string bounding box...
          builder.beginLocalState();
          builder.setLineDash(0, 5);
          builder.setStrokeColor(textStringBoxColor);
          builder.drawRectangle(textString.getBox());
          builder.stroke();
          builder.end();
        }
      }
      else if(content instanceof ContainerObject)
      {
        // Scan the inner level!
        extract(level.getChildLevel(),builder);
      }
    }
  }
}

This sample works exactly the same way as the previous “1. Basic text extraction” sample, but it dramatically empowers the extraction functionality providing decoded text along with its graphic attributes, such as font, font size, bounding box, text color, and so on:

  1. ContentScanner.TextWrapper represents a text object extracted from the ContentScanner [line 82];
  2. each ContentScanner.TextWrapper contains a list of text chunks (ContentScanner.TextStringWrapper) [line 84];
  3. each ContentScanner.TextStringWrapper contains a list of text characters (TextChar) [line 99];
  4. each TextChar provides information about the character state (position and style).

The figure below shows the result of this code running over the greek translation of the UN Universal Declaration of Human Rights.

Text characters framed within their respective bounding boxes.

3. Advanced text extraction

PDF Clown supports a third level of text extraction functionality built upon the others (basic and extended, as seen above): the TextExtractor tool.

Its purpose is to leverage the extended text extraction features for sorting, aggregating and integrating the retrieved text chunks. With TextExtractor you can:

  • extract full text information (text content along with graphic attributes for each single character (font, font size, text color, text rendering mode, text bounding box…)) or just plain text;
  • extract all the text content in a page (or any other IContentContext, such as FormXObject) or filter just partial page areas.

3.1. Plain text extraction

This sample demonstrates the extreme simplicity involved in extracting plain text from a page: after you have instantiated the TextExtractor [line 31], it’s just a matter of passing your page [line 38] — nothing but 1 line of code!

package it.stefanochizzolini.clown.samples;

import it.stefanochizzolini.clown.documents.Document;
import it.stefanochizzolini.clown.documents.Page;
import it.stefanochizzolini.clown.files.File;
import it.stefanochizzolini.clown.tools.TextExtractor;

import java.util.HashMap;
import java.util.Map;

public class AdvancedPlainTextExtractionSample
  extends Sample
{
  @Override
  public boolean run(
    )
  {
    String filePath = promptPdfFileChoice("Please select a PDF file");

    // 1. Open the PDF file!
    File file;
    try
    {file = new File(filePath);}
    catch(Exception e)
    {throw new RuntimeException(filePath + " file access error.",e);}

    // 2. Get the PDF document!
    Document document = file.getDocument();

    // 3. Extracting plain text from the document pages...
    TextExtractor extractor = new TextExtractor();
    for(Page page : document.getPages())
    {
      if(!prompt(page))
        return false;

      // Extract plain text from the current page!
      System.out.println(extractor.extractPlain(page));
    }

    return true;
  }

  private boolean prompt(
    Page page
    )
  {
    int pageIndex = page.getIndex();
    if(pageIndex > 0)
    {
      Map<String,String> options = new HashMap<String,String>();
      options.put("", "Scan next page");
      options.put("Q", "End scanning");
      if(!promptChoice(options).equals(""))
        return false;
    }

    System.out.println("\nScanning page " + (pageIndex+1) + "...\n");
    return true;
  }
}

3.2. Full text extraction

In this case text content is extracted along with its graphic attributes (font, font size, text color, text rendering mode, text bounding box…).
Note that, as we didn’t specify any particular page area, text strings are all gathered within the default area (the page itself), identified by the null key [line 40].

package it.stefanochizzolini.clown.samples;

import it.stefanochizzolini.clown.documents.Document;
import it.stefanochizzolini.clown.documents.Page;
import it.stefanochizzolini.clown.documents.contents.ITextString;
import it.stefanochizzolini.clown.files.File;
import it.stefanochizzolini.clown.tools.TextExtractor;

import java.awt.geom.Rectangle2D;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class AdvancedTextExtractionSample
  extends Sample
{
  @Override
  public boolean run(
    )
  {
    String filePath = promptPdfFileChoice("Please select a PDF file");

    // 1. Open the PDF file!
    File file;
    try
    {file = new File(filePath);}
    catch(Exception e)
    {throw new RuntimeException(filePath + " file access error.",e);}

    // 2. Get the PDF document!
    Document document = file.getDocument();

    // 3. Extracting text from the document pages...
    TextExtractor extractor = new TextExtractor();
    for(Page page : document.getPages())
    {
      if(!prompt(page))
        return false;

      List<ITextString> textStrings = extractor.extract(page).get(null);
      for(ITextString textString : textStrings)
      {
        Rectangle2D textStringBox = textString.getBox();
        System.out.println(
          "Text ["
            + "x:" + Math.round(textStringBox.getX()) + ","
            + "y:" + Math.round(textStringBox.getY()) + ","
            + "w:" + Math.round(textStringBox.getWidth()) + ","
            + "h:" + Math.round(textStringBox.getHeight())
            + "]: " + textString.getText()
            );
      }
    }

    return true;
  }

  private boolean prompt(
    Page page
    )
  {
    int pageIndex = page.getIndex();
    if(pageIndex > 0)
    {
      Map<String,String> options = new HashMap<String,String>();
      options.put("", "Scan next page");
      options.put("Q", "End scanning");
      if(!promptChoice(options).equals(""))
        return false;
    }

    System.out.println("\nScanning page " + (pageIndex+1) + "...\n");
    return true;
  }
}

3.3. Page area filtering

Text filtering by page area can be done both before and after extracting the page text:

  • pre-filtering: TextExtractor.getAreas()/setAreas(…) methods allow the user to define the relevant page areas before extracting the text;
  • post-filtering: TextExtractor.filter(…) methods allow the user to select text by area from previously-extracted text (useful in case of multi-stage processing).

In this case we apply the text filtering to a common task: retrieving the text associated to link annotations on a page [lines 80-82] (maybe you don’t know that text links on PDF pages are just superimposed to the “associated” text, so a location inference is needed in order to match the position of a link annotation with the respective text — such a tough work :-D).

package it.stefanochizzolini.clown.samples;

import it.stefanochizzolini.clown.documents.Document;
import it.stefanochizzolini.clown.documents.Page;
import it.stefanochizzolini.clown.documents.PageAnnotations;
import it.stefanochizzolini.clown.documents.contents.ITextString;
import it.stefanochizzolini.clown.documents.fileSpecs.FileSpec;
import it.stefanochizzolini.clown.documents.interaction.actions.Action;
import it.stefanochizzolini.clown.documents.interaction.actions.GoToDestination;
import it.stefanochizzolini.clown.documents.interaction.actions.GoToEmbedded;
import it.stefanochizzolini.clown.documents.interaction.actions.GoToNonLocal;
import it.stefanochizzolini.clown.documents.interaction.actions.GoToURI;
import it.stefanochizzolini.clown.documents.interaction.actions.GoToEmbedded.TargetObject;
import it.stefanochizzolini.clown.documents.interaction.annotations.Annotation;
import it.stefanochizzolini.clown.documents.interaction.annotations.Link;
import it.stefanochizzolini.clown.documents.interaction.navigation.document.Destination;
import it.stefanochizzolini.clown.files.File;
import it.stefanochizzolini.clown.objects.PdfObjectWrapper;
import it.stefanochizzolini.clown.tools.TextExtractor;

import java.awt.geom.Rectangle2D;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class LinkTextExtractionSample
  extends Sample
{
  @Override
  public boolean run(
    )
  {
    String filePath = promptPdfFileChoice("Please select a PDF file");

    // 1. Open the PDF file!
    File file;
    try
    {file = new File(filePath);}
    catch(Exception e)
    {throw new RuntimeException(filePath + " file access error.",e);}

    // 2. Get the PDF document!
    Document document = file.getDocument();

    // 3. Extracting links text from the document pages...
    TextExtractor extractor = new TextExtractor();
    extractor.setAreaTolerance(2); // 2 pt tolerance on area boundary detection.
    for(Page page : document.getPages())
    {
      if(!prompt(page))
        return false;

      Map<Rectangle2D,List<ITextString>> textStrings = null;

      // Get the page annotations!
      PageAnnotations annotations = page.getAnnotations();
      if(annotations == null)
      {
        System.out.println("No annotations here.");
        continue;
      }

      boolean linkFound = false;
      for(Annotation annotation : annotations)
      {
        if(annotation instanceof Link)
        {
          linkFound = true;

          if(textStrings == null)
          {textStrings = extractor.extract(page);}

          Link link = (Link)annotation;
          Rectangle2D linkBox = link.getBox();
          /*
            Extracting text superimposed by the link...
            NOTE: As links have no strong relation to page text but a weak location correspondence,
            we have to filter extracted text by link area.
          */
          StringBuilder linkTextBuilder = new StringBuilder();
          for(ITextString linkTextString : extractor.filter(textStrings,linkBox))
          {linkTextBuilder.append(linkTextString.getText());}
          System.out.println("Link '" + linkTextBuilder + "' ");
          System.out.println(
            "    Position: "
              + "x:" + Math.round(linkBox.getX()) + ","
              + "y:" + Math.round(linkBox.getY()) + ","
              + "w:" + Math.round(linkBox.getWidth()) + ","
              + "h:" + Math.round(linkBox.getHeight())
              );
          System.out.print("    Target: ");
          PdfObjectWrapper<?> target = link.getTarget();
          if(target instanceof Destination)
          {printDestination((Destination)target);}
          else if(target instanceof Action)
          {printAction((Action)target);}
          else if(target == null)
          {System.out.println("[not available]");}
          else
          {System.out.println("[unknown type: " + target.getClass().getSimpleName() + "]");}
        }
      }
      if(!linkFound)
      {
        System.out.println("No links here.");
        continue;
      }
    }

    return true;
  }

  private void printAction(
    Action action
    )
  {
    System.out.println("Action [" + action.getClass().getSimpleName() + "] " + action.getBaseObject());
    if(action instanceof GoToDestination<?>)
    {
      if(action instanceof GoToNonLocal<?>)
      {
        FileSpec fileSpec = ((GoToNonLocal<?>)action).getFileSpec();
        if(fileSpec != null)
        {System.out.println("    Filename: " + fileSpec.getFilename());}

        if(action instanceof GoToEmbedded)
        {
          TargetObject target = ((GoToEmbedded)action).getTarget();
          System.out.println("    EmbeddedFilename: " + target.getEmbeddedFileName() + " Relation: " + target.getRelation());
        }
      }
      System.out.print("    ");
      printDestination(((GoToDestination<?>)action).getDestination());
    }
    else if(action instanceof GoToURI)
    {System.out.println("    URI: " + ((GoToURI)action).getURI());}
  }

  private void printDestination(
    Destination destination
    )
  {
    System.out.println(destination.getClass().getSimpleName() + " " + destination.getBaseObject());
    System.out.print("    Page ");
    Object pageRef = destination.getPageRef();
    if(pageRef instanceof Page)
    {
      Page refPage = (Page)pageRef;
      System.out.println((refPage.getIndex()+1) + " [ID: " + refPage.getBaseObject() + "]");
    }
    else
    {System.out.println(((Integer)pageRef+1));}
  }

  private boolean prompt(
    Page page
    )
  {
    int pageIndex = page.getIndex();
    if(pageIndex > 0)
    {
      Map<String,String> options = new HashMap<String,String>();
      options.put("", "Scan next page");
      options.put("Q", "End scanning");
      if(!promptChoice(options).equals(""))
        return false;
    }

    System.out.println("\nScanning page " + (pageIndex+1) + "...\n");
    return true;
  }
}

20 thoughts on “PDF Clown 0.0.8 — Text extraction

    1. Hi Tina,

      LinkTextExtractionSample has been consolidated into LinkParsingSample: looking in the latter sample you can find exactly the same code for extracting text associated to links.

      You’re welcome! 😉
      Stefano

      1. Thanks for your reply, Stefano!

        There is an error: “Header checksum illegal” when I try to open a PDF file.
        Could you please help how to handle it?

        Thanks in advance,
        Tina

  1. thanks for the reply, I have already extracted texts and images by using your PDF Clown library. Can i get the detailed steps about how to develop such tool. thank you

    1. sorry, I have no time to reason about implementation details outside the current project evolution; I can just suggest you to see the way TextExtractor class accesses the content stream (see also ContentScanner class). If you have adequate skills, you can get it by yourself — good luck! 😉

    1. Hi,
      there’s no “table” concept in PDF file format, as its vectorial grammar is made just of simple primitives dealing with paths (i.e. lines, curves, font outlines…) and sampled content (i.e. bitmap images).

      So: no strong table entity, but… good heuristics can detect the weak presence of a so-called “table” representation (i.e., tipically, crossing lines intermingled with contents).

      PDF Clown gives you all the power to develop such an extraction tool (see the documents.contents.* subnamespaces within the library [1]), but table extraction hasn’t been implemented yet.

      [1] http://clown.sourceforge.net/API/

  2. Awesome work!
    I’m new to PDF Clown and excited to start using it.

    Here it is my use case:
    I would like to use your new scanner [TextExtractor] to find some particular text on a PDF page and change that text background to – say – yellow, as to have a highlighter effect. Is that possible?

    Thank you!

    1. Hi Paulo,

      you may either
      * pass to your client (typically Adobe Acrobat) a pseudo-xml file conforming to the Adobe’s Highlight spec
      or
      * impress the highlighting directly into your PDF file.

      The former approach seems “lighter” as it doesn’t imply modifying your PDF file, but requires your clients to support such a protocol (typically through a dedicated plug-in… not so portable, I think); furthermore, its definition is somewhat fuzzy as it describes highlighted contents by character/word offsets (this is not so obvious in an unstructured format such as PDF is).

      The latter approach requires modifying your PDF file, but offers more reliable results, as your clients aren’t forced to run extra components; you have also full control over the location and appearance of your highlighting. This can be obtained through at least two ways: text markup annotations (‘Highlight’ markup type) and layered stamping (by the way, there’s no concept of “background” in PDF text objects: you have just to draw bare colored boxes behind your text glyphs). Describing here all the implementation details would be overwhelming, nonetheless it’s a most-requested topic so I’m considering to publish a working sample about it.

      Keep experimenting!

      [April 16, 2011 update] PDF Clown 0.1.1 is going to feature a working text highlighting sample!

      1. Fantastic !
        Thanks a lot for your explanation, learning so much here!
        Looking forward to your sample !

  3. Is there any complete documentation to know how to work with PDF Clown?
    I am also waiting to see its new version.
    thanks

    1. The User Guide (included in the downloadable distribution) provides an architectural overview of the library (which is really valuable in order to understand its fundamentals); the downloadable distribution is also packed with lots of reusable, in-line commented, working code samples.

      I completed the 0.0.8 Java version at the end of May; now I’m involved in the final debugging stage of the 0.0.8 C#/.NET version.
      My previous release estimation was, unfortunately, too optimistic (real life was much more demanding than expected); just consider that I’m currently applying as much effort as I can…

      I can anticipate that 0.0.8 version is going to migrate to Java 6 (instead of outdated Java 5) and .NET 3.5 (instead of outdated .NET 2.0); licensing conditions will be upgraded to LGPL 3.

  4. Very excited of the text extraction features. I hope the features on the Greek translation will also be available in the 0.0.8 version… keep it up! Good job.

    1. Sure, StringMan: extended text extraction features (character position and style) will be ready for 0.0.8; maybe I’ll postpone some advanced TextExtractor features such as column detection…

    1. Hi, I’m entering the final revision stage (minor font management and TextExtractor refinements) — it’s a matter of some weeks.

      I’m considering to transfer some enhancement activities to the following release (0.0.9) in order to speed up the currently scheduled timeline.

Leave a Reply to stechio Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s