Features Overview

Creating PDF Documents

Using PDF Clown you can create PDF documents through a powerful object-oriented model which brings you full control over the definition of both contents and metadata from any data source.

Adding contents to a document can be accomplished through multiple abstraction levels:

  • low-level content-stream model (imperative): direct insertion of native PDF graphics instructions. For example:
        // 1. Instantiate a new PDF document!
        Document document = new File().getDocument();
    
        // 2. Add a page to the document!
        Page page = new Page(document); // Instantiates the page inside the document context.
        document.getPages().add(page); // Puts the page in the pages collection (you may choose an arbitrary position).
    
        // 3. Create a content composer for the page!
        PrimitiveComposer composer = new PrimitiveComposer(page);
    
        // 4. Add contents through the composer!
        composer.setFont(new StandardType1Font(document, StandardType1Font.FamilyEnum.Courier, true, false), 32);
        composer.showText("Hello World!", new Point2D.Double(32,48));
    
        // 5. Flush the contents into the page!
        composer.flush();
    
        // 6. Save the document!
        document.getFile().save(myFilePath, SerializationModeEnum.Standard);
    

    In case of advanced needs, you can directly access the content stream object tree through the ContentScanner backing the PrimitiveComposer, and add/remove/modify the content objects (instructions and containers) which define that content stream. For example:

        // Get the scanner which the composer is based on!
        ContentScanner scanner = composer.getScanner();
    
        // Get the current transformation matrix (which is part of the content stream's graphics state)!
        AffineTransform ctm = scanner.getState().getCtm();
    
        // Get the content stream this scanner is working on!
        Contents contents = scanner.getContents();
    
        // Get the context of this content stream (typically a page)!
        IContentContext contentContext = contents.getContentContext();
        if(contentContext instanceof Page) {
          . . .
        }
    
        // Get the resources (fonts, color spaces, ...) associated to this content stream!
        Resources contentResources = contentContext.getResources();
    
  • high-level composition model (declarative): definition of richly-styled high-level elements (based on a subset of HTML+CSS3) like paragraphs, titles, tables, lists, images, etc. (suitable for creating any kind of document like manuals, books, reports, etc.). For example:
        // 1. Instantiate a new PDF document!
        Document document = new File().getDocument();
    
        // 2. Create a content composer for the document!
        DocumentComposer composer = new DocumentComposer(document);
    
        // 3. Add contents through the composer!
        Paragraph paragraph = new Paragraph("Hello World!");
        composer.show(paragraph);
    
        // 4. End the composition!
        composer.close();
    
        // 5. Save the document!
        document.getFile().save(myFilePath, SerializationModeEnum.Standard);
    

Behind the scenes, both the content-stream and composition models sit upon implementations of a common Visitor pattern which feeds the actual page content stream: this approach guarantees tons of flexibility to address specific custom needs.

Editing PDF Documents

PDF Clown exposes a fully-object-oriented multi-layered representation of PDF documents at file, document and content stream levels.

For example, you can wok on a Page object this way:

    // Open an existing PDF document!
    Document document = new File(myFilePath).getDocument();

    // Get the first page!
    Page page = document.getPages().get(0);

    // Get the data structure backing the page!
    PdfDictionary pageDictionary = page.getBaseDataObject();

    // Get the content stream of the page!
    Contents pageContents = page.getContents();

PDF Clown’s finely-grained access to PDF content streams allows editing of each and every content object (graphics instructions and containers) along with their graphics state through a cursor-based scanner (ContentScanner).

Besides the native PDF dynamic content reuse mechanism (indirect references), PDF Clown can elegantly import (static content reuse) any content from one PDF file into another at any abstraction level (primitive PDF objects, content chunks, whole pages) through the so-called contextual cloning mechanism. Object tree traversal can be customized to selectively filter the cloned object. For example:

    // 1. Open the PDF files!
    Document targetDocument = new File(myTargetFilePath).getDocument();
    Document sourceDocument = new File(mySourceFilePath).getDocument();

    // 2. Append the first page of the source document into the target document!
    targetDocument.getPages().add(sourceDocument.getPages().get(0).clone(targetDocument));

    // 3. Save the target file!
    targetDocument.getFile().save();

Reading PDF Documents

PDF Clown exposes all the PDF document structure as a traversable model allowing powerful data mining such as advanced text extraction (full positional and style information), image extraction, metadata extraction, form data extraction. No limits to the ways you can analyze the contents of your documents!

Rendering PDF Documents

PDF Clown has introduced an experimental prototype for content rasterization and printing that will be progressively expanded in the next releases. For example:

    // 1. Open the PDF file!
    File file = new File(myFilePath);
    Document document = file.getDocument();

    // 2. Rasterize the first page!
    Renderer renderer = new Renderer();
    BufferedImage image = renderer.render(document.getPages().get(0), new Dimension(1400, 850));

    // 3. Save the rasterized image!
    ImageIO.write(image,"jpg",myOutputPath);

Feature Summary

This is a high-level representation of the functionalities supported (✓), partially supported (*) or planned to be supported (✗) by PDF Clown; see the Development Status page if you are interested in an analytical, spec-based description of the current implementation.

Content Creation & Editing
Document Assembly (document split and merge, page combination and removal)
Page Content Creation
Page Content Modification and Removal
Page Content Reuse
Actions (JavaScript, page transitions, go-to actions, etc.)
Annotations (file attachments, links, notes, text markup, multimedia, etc.)
Barcodes (EAN-13)
Color Spaces (Device color spaces (RGB, CMYK, Gray))
Font Embedding (OpenType/TrueType)
Form Creation, Modification and Removal
Form Filling
Form Flattening
Image Embedding (JPG)
Layers
Page Formats (ISO (A, B, C) series, ANSI series, Architectural series, Traditional north-american sizes (Letter, Legal, Executive, Statement, Tabloid))
Watermarking and Content Stamping
Font Subsetting
Form Data Import/Export
Content Reading
Annotations Extraction
Attachments Extraction
Form Data Extraction
Images Extraction
Metadata Extraction
Text Extraction (full positional and style information)
Content Rendering
* Print
* Rasterization (i.e. page-to-image rendering)
File Structure and I/O
File Compression
Memory Buffer I/O (in-memory file read/write without need to access secondary storage (disk))
File Linearization
Security
Digital Signatures
File Encryption and Permissions
Optimization
Compact File Serialization (removal of older object revisions from incrementally-updated files)
Unused Objects Removal (removal of orphaned objects, i.e. objects without alive references)

What’s next?

4 thoughts on “Features

  1. I am searching a tool capable of extracting specific content from PDF files (e.g. some information between page 5 and 10): can this tool do that?

    1. To ensure the sustainable development of advanced features like Digital Signatures, I’m considering to bind them to crowdfunding goals: their availability as LGPLed features will depend on the success of those campaigns (users should take the responsibility to support the project — there’s no point in demanding all the bells & whistles without any serious commitment).

      Current dev cycle (0.2.0)’s features were defined last year and are about to be completed; next dev cycle (due to begin next autumn) will be influenced by the preferences expressed in the Features Poll currently appearing on this site.

Your Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s