Previous parts:

In the previous part, we found new information from the published documents through PDF version history analysis. In this part, we report on additional metadata leaks and data we discovered through PDF forensics, and explain the tools we used.

Metadata leaks

We went through all the published documents using various tools to analyze metadata from the PDF files. The most significant finding was the redaction failure in two documents about NSA spy hubs in the U.S. The PDF versioning metadata revealed whole sections in the documents that were meant to be removed. That finding was covered in the previous part. While we think that this was a valuable discovery of previously unknown information, we had hoped to find more significant revelations.

The fact is that overall the journalists have done a great job of redacting the documents and ensuring that the files' metadata do not leak anything significant.

(Note on disclosure: Throughout this analysis, we identified various NSA agent usernames and target identifiers (e.g. IP addresses and email addresses) that should have been redacted but remain visible. We've chosen not to republish them. Our focus here is on documenting the pattern of redaction failures and metadata leaks.)

Here are the other metadata leaks we found.

NSA spy spacecraft labels

In the document titled "Menwith collection assets," the spacecraft image on the second page originally had yellow text labels explaining each component. The spacecraft's name is redacted (probably ORION or NEMESIS?).

Screenshot from the published version.

Screenshot from the earlier version, revealing the yellow labels.

The removed labels:

2.7 m Collection Reflector
A2100 Commercial Bus
2.0 m Downlink Antenna
4x40 MHz Intercept Channels
C, Ku, Ka Horns

Target's email address and Machine ID visible

In the document titled "Operational Engineering November 2010," the screenshot on page 23 shows the target's email address and machine ID in version 1, later redacted in the final published version.

Target's HTTP header values and NSA agent username visible

The document titled "TDI Introduction" has four PDF versions. In versions 1 and 2, page 13 shows targets' HTTP header values (cookies, URI, referrer), redacted in versions 3 and 4. In version 3, the domains "youporn" and "reuters" are highlighted on the same page, but these highlights (likely made by journalists) are removed in the final published version 4.

On version 1, an NSA agent username is visible on pages 15 and 16 and redacted in later versions.

Decoration images removed, screenshot blurred, surveillance info redacted

The document titled "Elegant Chaos: collect it all, exploit it all" has five PDF versions.

Removed text

On the document titled "Sensitive Targeting Authorization" the date and "Tasking" texts are removed. The original document might have had all the fields filled, but the journalists removed every filled piece of information and published the non-filled version of the document.

The published version.

Version 1 showing the texts "3 October 2008" and "Tasking".

Bad redactions

On the document titled "Mobile Theme Briefing May 28 2010," we found that redacted text is selectable on the last page. The text reads:

• Standalone TERRAIN for Initial Mobile Exploit - SSE/SMO/COMSAT access (Long term MVR for SSE).

We found only one prior mention of this bad redaction, in the Polish cybersecurity blog zaufanatrzeciastrona.pl. The Polish blog reports that, "The hidden fragment sounds quite enigmatic," and that "perhaps someone will be able to decipher it someday." We did not find any other bad redactions in the published documents (other than those already reported).

Failed redactions

The published documents contain numerous failed redactions (information that should have been removed but remains visible in the final versions). The most common examples are NSA agent usernames and target identifiers such as IP addresses and email addresses.

Here's one method to demonstrate the extent of these failures. We OCRed all published documents and extracted 7-character strings delimited by common characters (space, comma, dot, etc.) as potential NSA usernames. After filtering common English words we were left with fewer than 1,000 candidates. We then manually verified potential usernames by searching them in context within the documents.

We ended up finding more than 20 NSA agent usernames in less than one hour. The NSA usernames (also officially "SIDs" or Security Identifiers) appear to follow a consistent 7-character format: two initials followed by up to four characters from the surname, sometimes with trailing digits. For example, Snowden's username was "ejsnowd" (Edward Joseph Snowden).

The consistent format makes usernames guessable if you know an agent's name, or conversely, makes names partially reconstructable from usernames. There's a pattern of inconsistent redaction where the same username often appears redacted in one screenshot but visible in another within the same document. With two initials and four surname characters, adversaries can cross-reference against known NSA personnel lists, narrow down candidates based on first-name initials, or verify suspected agents by matching the pattern. This represents an operational security failure on two fronts: the NSA's predictable username schema, and journalists' inability to consistently identify and redact them.

Analysis, forensics

We used multiple tools and scripts to find metadata leaks, bad redactions, and generally, information "that shouldn't be there."

Worth noting is that, even though we think we've managed to get almost all the published documents, we might still have missed some of them which could have some metadata leaks, bad redactions, etc. It could also be that we have some other version of the document (since many different organizations and people have re-published documents, with many editing the filenames and metadata) which is not the original published one.

Scope of analysis

We checked for hidden layers/optional content groups (OCGs), annotations, form fields, attachments/embedded files, version history, and bad redactions (e.g., black boxes over selectable text). Some analysis techniques proved too difficult or generated too many false positives, so we did not systematically check: content placed outside visible page boundaries, invisible text (white on white or invisible rendering modes), or compression artifacts. Tools capable of reliably analyzing these aspects would be valuable for future research.

Tools and scripts

We wrote a script that uses pdfresurrect and qpdf to go through all the PDF files in the current working directory and logs the PDF object differences between all the PDF versions.

For example, on the document "MHS Collection Assets," the script logs this entry:

--- DIFF START: ./MHS-collection-assets.pdf | Object 16 (V1 vs V2) ---
@@ -1 +1 @@
-<< /Annots 40 0 R /Contents 340 0 R /MediaBox [ 0 0 595.3200 841.9200 ] /Parent 1 0 R /Resources 335 0 R /Type /Page >>
+<< /Annots 40 0 R /Contents 386 0 R /MediaBox [ 0 0 595.3200 841.9200 ] /Parent 1 0 R /Resources 383 0 R /Type /Page >>
--- DIFF END: Object 16 ---

From that, we can quickly see that between version 1 and 2, there are some content changes on object 16. Indeed, that's how we spotted the section removals discussed in the previous part.

For most of the documents that have version differences, the changes are in the removal of producer/author metadata (e.g., the programs used by journalists to modify the PDF files). For example, here is the ImageMagick Producer and Title metadata removed between version 1 and 2:

--- DIFF START: ./tracking-targets-on-online-social-networks.pdf | Object 297 (V1 vs V2) ---
@@ -1 +1 @@
-<< /CreationDate (D:20150410150118) /ModDate (D:20150410150118) /Producer (ImageMagick 6.8.6-3 2014-04-08 Q16 http://www.imagemagick.org) /Title (Tracking Targets on Online Social Networks-final) >>
+<< /CreationDate (D:20150410150118) /ModDate (D:20150410150118) >>
--- DIFF END: Object 297 ---

Many of the documents also have page rotation changes between versions. For example, here, the page is rotated 90 degrees clockwise:

--- DIFF START: ./operational-engineering-nov-2010.pdf | Object 87 (V1 vs V2) ---
@@ -1 +1 @@
-<< /Contents 88 0 R /MediaBox [ 0 0 595.320 841.92 ] /Parent 1 0 R /Type /Page >>
+<< /Contents 88 0 R /MediaBox [ 0 0 595.3200 841.9200 ] /Parent 1 0 R /Rotate 90 /Type /Page >>
--- DIFF END: Object 87 ---

Especially interesting are the /Contents and /Resources changes, from which we can quickly spot if some actual content is changed between the versions.

The tool pdfxplr nicely outputs basic document metadata, such as authors, creation and modification dates, keywords, title, and the number of versions, but it also tries to lists all the emails, links, and IP addresses found in the document. This could be useful for general PDF forensics purposes. However, we did not find anything significant using pdfxplr, except for a bunch of NSA agent names and usernames.

The tool x-ray finds bad redactions. It did its job by finding all the bad redactions. Here's an example output of the "Mobile Theme Briefing May 28 2010" document, showing that there is a box over actual text, so the text is not actually removed from the PDF data:

FILE PATH: ./20140127-nyt-mobile_theme_briefing.pdf
REDACTION DATA:
{
  "4": [
    {
      "bbox": [
        83.9999771118164,
        252.00003051757812,
        695.9998779296875,
        336.0
      ],
      "text": "\u2022\u202fStandalone TERRAIN for Initial Mobile Exploit -  SSE/SMO/COMSAT access (Long term MVR for SSE). "
    }
  ]
}

Annotations, forms, attachments

We also checked for hidden data in annotations (using PyMuPDF, our script here), form fields (using pypdf, our script here), and attachments (using pdfdetach from poppler utils, our script here). The only annotations we found were made by journalists themselves. No form field data or embedded files were discovered in any of the published documents.

Our forensic analysis reveals that most published documents were handled carefully, with only two showing significant metadata leaks in their version history out of hundreds examined. The major finding (deleted content about NSA spy hubs exposed through PDF versioning) and the selectable text under redaction boxes represent rare lapses in an otherwise professional publication process.

The more pervasive issue is the pattern of inconsistent redactions across the document set. The same information (agent usernames, target identifiers) often appears redacted in one location but visible elsewhere, sometimes even within the same document. This inconsistency shows the fundamental challenge of manual review at scale - when dealing with hundreds of complex technical documents, even careful editors will miss details.

The bottom line is that the journalists did solid work under the circumstances. The traces remain, but they're minor ones.

If analyzing metadata leaks in published documents is this interesting, imagine what we could find in the 95-99% that remains unpublished. No rush, we've only been waiting over a decade.

We don't yet know what the next part will cover, though it will probably be completely unrelated to PDF forensics like this and the previous part.

Notes

Edit 18th January 2026: Our scripts were pointing to 404, repository was set to private. Now updated.