Strictly adhering to the PDF standard, PDF Clown features a layered model which progressively hides syntactic details (xref tables, streams, dictionaries, arrays, etc.) and reveals semantic entities (pages, bookmarks, fonts, etc.), providing a robust and convenient access to the intricacies of the format:
- Byte layer: raw PDF file
- Token layer: lexical interpretation (parsing) of the PDF file
- Object layer: data structures emerging from token aggregation (COS Model)
- File layer: syntactic representation of the PDF file (macro-structure constituted by a sequence of indirect objects)
- Document layer: semantic representation of the PDF file (catalog object tree)
As shown by the diagram above, both the File Layer (through PdfIndirectObject class) and the Document Layer (through PdfObjectWrapper class) are based on the Object Layer; the amazing thing is that, in spite of the awful complexity of the PDF specification, PDF Clown is based on just two Object Layer classes (see the diagram below):
- PdfObject (root of the COS Model, which encompasses the low-level building blocks of PDF such as PdfIndirectObject, PdfDictionary, PdfArray, PdfStream, PdfName and so on)
- PdfObjectWrapper (root of the Document Model, which encompasses the high-level entities such as Document, Page, Contents, Annotation, Bookmark, Font and so on).
Here it is a class diagram representing the library’s main entities and their relations (note this is just a more detailed view of the same layers described above).
The diagram below represents a Document object (the semantic root of a PDF file as modelled by PDF Clown) as viewed across the layers:
- Byte layer: the dotted box inside the PDF file icon contains a sample data fragment that represents a Catalog Dictionary (root object);
- Token layer: the bytes of the Catalog Dictionary are aggregated in atomic items (lexemes);
- Object layer: an indirect object pattern is recognized, so that a PdfIndirectObject is instantiated to incapsulate the Catalog Dictionary data;
- File layer: the PdfIndirectObject (bridge between the Object Layer and the File Layer) containing the Catalog Dictionary is arrayed among the others to represent the PDF file structure;
- Document layer: the Catalog Dictionary is encapsulated inside a Document object, which inherits from PdfObjectWrapper (bridge between the Object Layer and the Document Layer).