PDF File analysis
Last updated
Was this helpful?
Last updated
Was this helpful?
From:
PDF is an extremely complicated document file format, with enough tricks and hiding places . This also makes it popular for CTF forensics challenges. The NSA wrote a guide to these hiding places in 2008 titled "Hidden Data and Metadata in Adobe PDF Files: Publication Risks and Countermeasures." It's no longer available at its original URL, but you can . Ange Albertini also keeps a wiki on GitHub of .
The PDF format is partially plain-text, like HTML, but with many binary "objects" in the contents. Didier Stevens has written about the format. The binary objects can be compressed or even encrypted data, and include content in scripting languages like JavaScript or Flash. To display the structure of a PDF, you can either browse it with a text editor, or open it with a PDF-aware file-format editor like Origami.
is one tool that can be useful for exploring a PDF and transforming or extracting information from it. Another is a framework in Ruby called .
When exploring PDF content for hidden data, some of the hiding places to check include:
non-visible layers
Adobe's metadata format "XMP"
the "incremental generation" feature of PDF wherein a previous version is retained but not visible to the user
white text on a white background
text behind images
an image behind an overlapping image
non-displayed comments
There are also several Python packages for working with the PDF file format, like , that enable you to write your own parsing scripts.