PDF Clown 0.1.1 — Text highlighting and lots of good stuff

LATEST NEWS — On November 14, 2011 PDF Clown 0.1.1 has been released!

Next release is going to introduce new exciting features (text highlighting, optional/layered contents, Type1/CFF font support, etc.) along with improvements and consolidations of existing ones (enhanced text extraction, enhanced content rendering, enhanced acroform creation and filling, etc.). This post will be kept updated according to development progress, so please stay tuned! 😉
These are some of the things I have been working on till now:

Primitive object model enhancements
Text highlighting
Metadata streams (XMP)
Optional/layered contents
AcroForm fields filling

1. Primitive object model enhancements

PDF primitive object model (see org.pdfclown.objects namespace) has undergone a substantial revision in order to simplify its use (transparent update), extend its functionality (bidirectional traversal), enforce its consistency (simple object immutability) and consolidate its code base (parser classes refactoring).

Bidirectional traversal has been accomplished by the introduction of explicit references to ascendants: composite objects (PdfDictionary, PdfArray, PdfStream) are now aware of their parent container, so walking through the ascending path to the root PdfIndirectObject (and File) is absolutely trivial! This functionality has loads of engaging potential applications, such as fine-grained object cloning based on structure context (as in case of Acroform annotations residing on a given page).

Ascendant-aware objects are intelligent enough to automatically detect and notify changes to their parent container, making incremental updates transparent to the user.

Simple objects have been made immutable to avoid risks of unintended changes and promote their efficient reuse.

As expected (you may have noticed some TODO task comments about this within the project’s code base), object parsing of PostScript-related formats (PDF file, PDF content stream and CMaps) has been organized under the same class hierarchy to improve its consistency and maintainability.

2. Text highlighting

Text highlighting was a much-requested feature. It took me less than one hour of enjoyable coding to write a prototype which could populate a PDF file with highlight annotations matching an arbitrary text pattern, as you can see in the following figure representing a page of Alice in Wonderland resulting from the search of “rabbit” occurrences:

This text highlighting sample leverages both text extraction [line 55] and annotation [line 106] functionalities of PDF Clown, as you can see in its source code:

package org.pdfclown.samples.cli;

import java.awt.geom.Rectangle2D;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.ITextString;
import org.pdfclown.documents.contents.TextChar;
import org.pdfclown.documents.interaction.annotations.TextMarkup;
import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;
import org.pdfclown.files.File;
import org.pdfclown.tools.TextExtractor;
import org.pdfclown.util.math.Interval;
import org.pdfclown.util.math.geom.Quad;

/**
  This sample demonstrates how to highlight text matching arbitrary patterns.
  Highlighting is defined through text markup annotations.

  @author Stefano Chizzolini (http://www.stefanochizzolini.it)
  @since 0.1.1
  @version 0.1.1
*/
public class TextHighlightSample
  extends Sample
{
  @Override
  public boolean run(
    )
  {
    String filePath = promptPdfFileChoice("Please select a PDF file");

    // 1. Open the PDF file!
    File file;
    try
    {file = new File(filePath);}
    catch(Exception e)
    {throw new RuntimeException(filePath + " file access error.",e);}

    // Define the text pattern to look for!
    String textRegEx = promptChoice("Please enter the pattern to look for: ");
    Pattern pattern = Pattern.compile(textRegEx, Pattern.CASE_INSENSITIVE);

    // 2. Iterating through the document pages...
    TextExtractor textExtractor = new TextExtractor(true, true);
    for(final Page page : file.getDocument().getPages())
    {
      System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");

      // 2.1. Extract the page text!
      Map textStrings = textExtractor.extract(page);

      // 2.2. Find the text pattern matches!
      final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings));

      // 2.3. Highlight the text pattern matches!
      textExtractor.filter(
        textStrings,
        new TextExtractor.IIntervalFilter()
        {
          @Override
          public boolean hasNext()
          {return matcher.find();}

          @Override
          public Interval next()
          {return new Interval(matcher.start(), matcher.end());}

          @Override
          public void process(
            Interval interval,
            ITextString match
            )
          {
            // Defining the highlight box of the text pattern match...
            List highlightQuads = new ArrayList();
            {
              /*
                NOTE: A text pattern match may be split across multiple contiguous lines,
                so we have to define a distinct highlight box for each text chunk.
              */
              Rectangle2D textBox = null;
              for(TextChar textChar : match.getTextChars())
              {
                Rectangle2D textCharBox = textChar.getBox();
                if(textBox == null)
                {textBox = (Rectangle2D)textCharBox.clone();}
                else
                {
                  if(textCharBox.getY() > textBox.getMaxY())
                  {
                    highlightQuads.add(Quad.get(textBox));
                    textBox = (Rectangle2D)textCharBox.clone();
                  }
                  else
                  {textBox.add(textCharBox);}
                }
              }
              highlightQuads.add(Quad.get(textBox));
            }
            // Highlight the text pattern match!
            new TextMarkup(page, MarkupTypeEnum.Highlight, highlightQuads);
          }

          @Override
          public void remove()
          {throw new UnsupportedOperationException();}
        }
        );
    }

    // 3. Highlighted file serialization.
    serialize(file, false);

    return true;
  }
}

This is another example matching words which contain “co” (regular expression “\w*co\w*”):

Here you can appreciate the dehyphenation functionality applied to another search (words beginning with “devel” — regular expression “\bdevel\w*”):

3. Metadata streams (XMP)

XMP metadata streams are now available for reading and writing on any dictionary or stream entity within a PDF document (see PdfObjectWrapper.get/setMetadata()).

4. Optional/Layered contents

Smoothing out some PDF spec awkwardness while implementing the content layer (aka optional content) functionality proved to be an interesting challenge. The result was nothing but satisfaction: a clean, intuitive and rich programming interface which automates lots of annoying housekeeping tasks and lets you access even the whole raw structures in case of special needs! 😎

The figure above represents a document generated by the following code sample; for the sake of comparison, I took an iText example and translated it to PDF Clown, adding some niceties like the cooperation between the PrimitiveComposer (whose lower-level role is graphics composition through primitive operations like showing text lines and drawing shapes) and the BlockComposer (whose higher-level role is to arrange text within page areas managing alignments, paragraph spacing and indentation, hyphenation, and so on).

package org.pdfclown.samples.cli;

import java.awt.Dimension;
import java.awt.Point;
import java.awt.Rectangle;

import org.pdfclown.documents.Document;
import org.pdfclown.documents.Document.PageModeEnum;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.composition.AlignmentXEnum;
import org.pdfclown.documents.contents.composition.AlignmentYEnum;
import org.pdfclown.documents.contents.composition.BlockComposer;
import org.pdfclown.documents.contents.composition.PrimitiveComposer;
import org.pdfclown.documents.contents.fonts.StandardType1Font;
import org.pdfclown.documents.contents.layers.Layer;
import org.pdfclown.documents.contents.layers.Layer.ViewStateEnum;
import org.pdfclown.documents.contents.layers.LayerDefinition;
import org.pdfclown.documents.contents.layers.LayerGroup;
import org.pdfclown.documents.contents.layers.Layers;
import org.pdfclown.files.File;

/**
  This sample demonstrates how to define layers to control content visibility.

  @author Stefano Chizzolini (http://www.stefanochizzolini.it)
  @since 0.1.1
  @version 0.1.1
*/
public class LayerCreationSample
  extends Sample
{
  @Override
  public boolean run(
    )
  {
    // 1. PDF file instantiation.
    File file = new File();
    Document document = file.getDocument();

    // 2. Content creation.
    populate(document);

    // 3. Serialize the PDF file!
    serialize(file, false, "Layer", "inserting layers");

    return true;
  }

  private void populate(
    Document document
    )
  {
    // Initialize a new page!
    Page page = new Page(document);
    document.getPages().add(page);

    // Initialize the primitive composer (within the new page context)!
    PrimitiveComposer composer = new PrimitiveComposer(page);
    composer.setFont(new StandardType1Font(document, StandardType1Font.FamilyEnum.Helvetica, true, false), 12);

    // Initialize the block composer (wrapping the primitive one)!
    BlockComposer blockComposer = new BlockComposer(composer);

    // Initialize the document layer configuration!
    LayerDefinition layerDefinition = new LayerDefinition(document); // Creates the document layer configuration.
    document.setLayer(layerDefinition); // Activates the document layer configuration.
    document.setPageMode(PageModeEnum.Layers); // Shows the layers tab on document opening.

    // Get the root layers collection!
    Layers rootLayers = layerDefinition.getLayers();

    // 1. Nested layers.
    {
      Layer nestedLayer = new Layer(document, "Nested layer");
      rootLayers.add(nestedLayer);
      Layers nestedSubLayers = nestedLayer.getLayers();

      Layer nestedLayer1 = new Layer(document, "Nested layer 1");
      nestedSubLayers.add(nestedLayer1);

      Layer nestedLayer2 = new Layer(document, "Nested layer 2");
      nestedSubLayers.add(nestedLayer2);
      nestedLayer2.setLocked(true);

      // NOTE: Text in this section is shown using PrimitiveComposer.
      composer.beginLayer(nestedLayer);
      composer.showText(nestedLayer.getTitle(), new Point(50, 50));
      composer.end();

      composer.beginLayer(nestedLayer1);
      composer.showText(nestedLayer1.getTitle(), new Point(50, 75));
      composer.end();

      composer.beginLayer(nestedLayer2);
      composer.showText(nestedLayer2.getTitle(), new Point(50, 100));
      composer.end();
    }

    // 2. Simple group (labeled group of non-nested, inclusive-state layers).
    {
      Layers simpleGroup = new Layers(document, "Simple group");
      rootLayers.add(simpleGroup);

      Layer layer1 = new Layer(document, "Grouped layer 1");
      simpleGroup.add(layer1);

      Layer layer2 = new Layer(document, "Grouped layer 2");
      simpleGroup.add(layer2);

      // NOTE: Text in this section is shown using BlockComposer along with PrimitiveComposer
      // to demonstrate their flexible cooperation.
      blockComposer.begin(new Rectangle(50, 125, 200, 50), AlignmentXEnum.Left, AlignmentYEnum.Middle);

      composer.beginLayer(layer1);
      blockComposer.showText(layer1.getTitle());
      composer.end();

      blockComposer.showBreak(new Dimension(0, 15));

      composer.beginLayer(layer2);
      blockComposer.showText(layer2.getTitle());
      composer.end();

      blockComposer.end();
    }

    // 3. Radio group (labeled group of non-nested, exclusive-state layers).
    {
      Layers radioGroup = new Layers(document, "Radio group");
      rootLayers.add(radioGroup);

      Layer radio1 = new Layer(document, "Radiogrouped layer 1");
      radioGroup.add(radio1);
      radio1.setViewState(ViewStateEnum.On);

      Layer radio2 = new Layer(document, "Radiogrouped layer 2");
      radioGroup.add(radio2);
      radio2.setViewState(ViewStateEnum.Off);

      Layer radio3 = new Layer(document, "Radiogrouped layer 3");
      radioGroup.add(radio3);
      radio3.setViewState(ViewStateEnum.Off);

      // Register this option group in the layer configuration!
      LayerGroup options = new LayerGroup(document);
      options.add(radio1);
      options.add(radio2);
      options.add(radio3);
      layerDefinition.getOptionGroups().add(options);

      // NOTE: Text in this section is shown using BlockComposer along with PrimitiveComposer
      // to demonstrate their flexible cooperation.
      blockComposer.begin(new Rectangle(50, 185, 200, 75), AlignmentXEnum.Left, AlignmentYEnum.Middle);

      composer.beginLayer(radio1);
      blockComposer.showText(radio1.getTitle());
      composer.end();

      blockComposer.showBreak(new Dimension(0, 15));

      composer.beginLayer(radio2);
      blockComposer.showText(radio2.getTitle());
      composer.end();

      blockComposer.showBreak(new Dimension(0, 15));

      composer.beginLayer(radio3);
      blockComposer.showText(radio3.getTitle());
      composer.end();

      blockComposer.end();
    }
    composer.flush();
  }
}

Some comments on the code:

document layer configuration initialization [lines 68-69]: this is the first operation to do;
layer creation [line 77] and insertion [line 78] into the hierarchical structure;
sublayer insertion [line 82];
content layering [lines 89, 91]: content is enclosed within a layer section, making its visibility dependent on the layer state. There’s a subtle discrepancy in the PDF spec when it comes to nested layers: one may assume they imply a hierarchical dependency of the sublayer states, but that’s NOT the case — if you hide a layer its descendants are still visible! To work around this counterintuitive behaviour, many software toolkits wrap contents within multiple nested layer blocks; for example, if you want to wrap the text “nested layer 1” into a layer (resource name /Pr2) which is a sublayer of another one (resource name /Pr1), the content stream will contain this cumbersome syntax:
4 0 obj << /Length 205 >> stream [...] /OC /Pr1 BDC /OC /Pr2 BDC q BT 1 0 0 1 100 800 Tm /F1 12 Tf (nested layer 1)Tj ET Q EMC EMC [...] endstream endobj
This beast is repeated as many times as there are distinct content chunks to include within the same layer; it goes even worse as the number of nesting levels increases — just awful! 😯 Instead of this, PDF Clown defines a default hierarchical membership for each layer which can be used as a single, terse wrapping block (resource name /Pr2):
4 0 obj << /Length 185 >> stream [...] /OC /Pr2 BDC q BT 1 0 0 1 100 800 Tm /F1 12 Tf (nested layer 1)Tj ET Q EMC [...] endstream endobj 6 0 obj << /Type /Pages /Count 1 /Resources << /Font 7 0 R /Properties 15 0 R >> /Kids [5 0 R ] >> endobj 15 0 obj << /Pr2 16 0 R >> endobj
16 0 obj << /Type /OCMD /OCGs [12 0 R 11 0 R ] /P /AllOn >> % Membership containing the references to the layers belonging to the hierarchical path of nested layer 1. endobj
This way code is concise and more maintainable (if you want to rearrange the hierarchical structure of the layers you don’t have to walk through the content stream hunting layer block occurrences for correction — just go to the membership associated to the layer and update its hierarchical path!). 🙂
simple layer group creation and insertion [lines 104-105]
option group definition [lines 148-152]

5. AcroForm fields filling

Text fields have been enhanced to support automatic appearance update on value change.

4 thoughts on “PDF Clown 0.1.1 — Text highlighting and lots of good stuff”

Adi says:

April 9, 2015 at 9:52 am

Hi,
how can I highlight words by start and end index instead of regex matching (I may do not want to highlight at some places)?

1. stechio says:
  
  April 9, 2015 at 12:11 pm
  
  Hi Adi,
  if you carefully examined the code sample above, it should be quite easy to identify the right spot where to place your logic: hasNext() and next() methods of the extraction iterator (TextExtractor.IIntervalFilter) define the matching intervals. As shown in the sample, in order to search through the extracted text you have to consolidate the chunks in a big string (TextExtractor.toString(textStrings)), then you can find your matchings by regex or plain indexOf() or whatsoever you want.
  
  In my opinion, getting rid of the regex (apparently) doesn’t make sense, as in any case (even if you are looking for simple, literal matchings without wildcards) you are coping with unstructured textual data.
  
Jan says:

August 9, 2011 at 7:53 am
Looks like a great library for me.

However when I try to convert a pdf page to an image (using the dotNet assembly) it gives the following error:

at System.Drawing.Pen.set_DashPattern(Single[] value) at org.pdfclown.documents.contents.objects.PaintPath.GetStroke(GraphicsState state) at org.pdfclown.documents.contents.objects.PaintPath.Scan(GraphicsState state) at org.pdfclown.documents.contents.ContentScanner.MoveNext() at org.pdfclown.documents.contents.ContentScanner.Render(Graphics renderContext, SizeF renderSize, GraphicsPath renderObject)

How can that be solved?
1. stechio says:
  
  August 9, 2011 at 8:33 pm
  
  hi,
  PDF Clown rasterization capabilities are currently at alpha stage — as thoroughly documented, no assumption can be done about its effectiveness at the moment (you can solve it contributing code for its implementation ;-)).
  
  PS: please report any bug to the project’s tracker.