LATEST NEWS — PDF Clown 0.1.0 has been superseded by PDF Clown 0.1.1
This release introduces support to cross-reference-stream-based PDF files (as defined since PDF 1.5 spec) along with page rendering and printing: a specialized tool provides a convenient way to convert PDF pages into images (aka rasterization). Lots of minor improvements have been applied too.
Last but not least: the project’s base namespace has changed to org.pdfclown
This release may be downloaded from:
20 thoughts on “PDF Clown 0.1.0 released”
I want to print a PDF, but cannot find an example. Is there already some java code for this?
There is a sample in the lastest version (0.1.0), but your printed output will not be the same as the PDF document as PDF Clown does not render text or images, only lines and shapes.
you’re a wee bit lazy, Gamba 😉 didn’t you read the documentation that comes along with the downloadable distribution?
The User Guide (see userGuide.pdf file) features an Appendix (§ A. Samples) which is a complete directory keyed by topic to find the sample relevant to your use.
Furthermore, if you walk through the pdfclown.samples.cli project, you can immediately spot a sample file called “PrintingSample.java”… so damn easy! 😉
Anyway, please note that printing functionality is currently pre-alpha (as stated in the above-mentioned documentation — see ISSUES), so it’s not expected to produce complete outputs (at the moment for example there’s no support to text rendering). It will be expanded in the next releases.
Is there any facility in PDF Clown about tags extraction from pdf files?
Do you mean tags as described by Tagged PDF spec [PDF:1.7:10.7]?
Those structures currently aren’t managed at high-level by PDF Clown (you can access them at low-level as primitive data structures though); anyway, marked contents within content streams are available for parsing through ContentScanner — see MarkedContent and MarkedContentPoint classes.
The project’s Status page analytically decribes the level of implementation reached by PDF Clown.
I have one more issue regarding pdf extraction: how can we identify the table (border line)? Is there any control to access that?
As I stated in my previous reply, it’s a matter of heuristics — there’s no golden rule, just well-balanced analyses. Establishing such strategy is up to you, as it’s a non-trivial judgement which I haven’t done till now.
I am still trying to render images to the page but am having difficulty.
I have implemented the Scan method of the PaintXObject class and have successfully retrieved the image xobject but when I try to render it to the RenderContext the page is blank.
I think this is because of the clipping on the RenderContext and Im guessing I need to do something with the matricies. Any idea what I’m doing wrong?
I have tried manually setting the Clip property of the RenderContext to one that would not clip the image and I am drawing the image to the RenderContext using a Graphics.DrawImage method. I still can’t work out why the image does not appear on the output.
Im fairly confident that the image is ok as I’ve tried saving it to a file and it is the image I’m trying to render.
I have also tried using a DrawString to draw some text to the RenderContext in the Scan method of the PaintXObject, like I did in ShowText, and this isn’t appearing on the output either.
I have now got the image onto the page, but it doesn’t look much like the image, apart from the main colour. Maybe I need to DCT decode…?
At least I am now making some progress.
I’m also looking at adding images to the rendered pages.
Looking at the code I’m guessing this would be added to the Scan method of the ContentObject class.
Does PDF Clown only support jpg images?
In order to draw images on canvas you have to implement the Scan method of the PaintXObject  class, as (external) images are referenced this way. There are also inline (internal) images (see InlineImage  class), but they are quite rarely used.
Images are implemented within the org.pdfclown.documents.contents.entities namespace ; yes, currently only JPEG images are supported, as I preferred to focus my efforts on extending horizontally the available library functionalities (that is, instead of concentrating on a narrow part of the PDF specification I chose to move across disparate parts to offer a richer potential). I’m confident that brave & skilled users may contribute some vertical part, such as adding image formats, compression filters and so on (see for example ASCII85Filter  which was contributed by J. James Jack, an english engineer, as stated in the API class credits).
Implementing additional parts is not that difficult 🙂 — I’m confident you may be the next candidate for contribution.
Have a nice day
Yes what I did is very rough. Currently I just need a small thumbnail of PDF documents and it looks OK. I am still working on doing it properly though..
It would be really nice if you could drop here your code lines for text rasterization to share with other users — after all, the spirit of this project is about cooperation. 😉
All I did was replace
in the ShowText class with a DrawString method applied to the ContentScanner’s RenderContext.
I was really lazy with it and just picked one font and one colour to use for all text. The characters were rendered upside-down too, so needed to be rotated.
It was something like this:
state.Scanner.RenderContext.DrawString(textChar.ToString(), new System.Drawing.Font("Arial", 4), System.Drawing.Brushes.Black, charBox);
I’m planning on making this a lot better as it’s pretty useless as it is.
I was looking at rendering the images today but didn’t get very far with it…
Good work so far. I would like to see rendering of pages with text. I am a developer and would be willing to provide some input on this. I’ve not had a thorough look through the code yet as I’m new to PDF Clown, and I’m not familiar with rasterizing fonts either. Would this be a great deal of work. Roughly how long do you think it would be before this feature would be in PDF Clown?
text rendering of modern fonts is primarily a matter of outline drawing and filling through Bézier curves; there are also some cases where bitmap glyphs are still in use, such as some CJK fonts. Text rasterization will be part of my next developments; I can estimate some weeks of work (that is about 3-4 months during my release cycle) to reach a decent representation.
I’ve had a good look through the source today and managed to rasterize the text to the page using just one extra line of code. Oviously this wasn’t anywhere near the overall standard of the library and I’ll be refining the code next week to get it closer to the PDF specification. I really like the coding you’ve done so far. It’s a great project. Good work.
Well done, crafty fellow! 🙂
Did you call GDI/AWT native text rendering methods, didn’t you? That’s good for an approximated rendition, but you know…