Are Wikileaks emails doctored?

I’ve heard claims that emails released by Wikileaks have been doctored. I decided to try to try to test this.

Email has hidden data

From most people’s perspective Email messages have a date, to, from, subject, and body. There is potentially a lot more information that isn’t shown to users. They can also have an arbitrary number of attachments, alternative views like HTML vs plain text, an arbitrary amount of metadata, and a log of how the message got from sender to receiver. Messages often have more invisible than visible information, and some of that can be used to check the authenticity of a message. Much of the data is checked routinely in an effort to block SPAM. The default view on Wikileaks doesn’t have any of this extra data, but with a little digging I found the raw originals were available for download.

Tools for SPAM prevention can serve another purpose

Years ago I was running several mail servers as part of my day job as a sysadmin for web hosting service. It was a common occurrence for other organizations mail servers to stop accepting our outgoing mail. Mail servers do a surprising amount of work to check that everything is properly set up. Multiple Black Hole List (BHL) servers may be contacted to check of the origin of message has a good good reputation. The receiving server could try connecting back to the server of origin to see if it allows incoming connections, and if it answers with the right hello message. It could ask the sending server if it has a mailbox for the from address of the message it is in the process of trying to send. It will reverse lookup the IP address to see if the ISP knows about the domain. It will statistically analyze the message to see if it “looks legit”. It seemed like there was a new test to pass every other week. One particularly strong feature is called DomainKeys Identified Mail (DKIM) Signatures. It cryptographically signs all outgoing messages, and publishes the public key via DNS.

It’s not a perfect system

Not all servers use DKIM Signatures, but they have become very common as it’s a lot easier to get your outgoing messages accepted. There are some limitations of course. They are signed to the domain not the user. The keys tend to be smaller because they are sent via DNS, and that system switches from UDP to slower TCP with large packets. The servers could replace the key such that old email can no longer be verified. The DNS servers might not use encryption themselves. Not all headers are signed. Minor data mangling from things like converting invisible line endings from DOS to Unix can break the authentication mechanism without changing the meaning in any way.

It has a good chance of being accurate anyways

Cryptographic signatures are very difficult to spoof. I checked the signatures gmail uses, and they are of similar strength as what Bitcoin uses. Anyone who can break a key that strong could rob the world of all Bitcoins. It is more plausible that the servers signing the messages get hacked than that someone breaks the key. Their reliability is related to how well the servers with the signing keys are protected.

Checking manually

Install tools

apt-get install wget opendkim-tools

Downloading a raw email from the command line.

wget 'https://wikileaks.org/podesta-emails//get/4984'

Checking the message signature.

cat 4984 | dkimverify

With this message I get “signature ok”

Preliminary results

I only got signature validation on some of the emails I tested initially but this doesn’t necessarily invalidate them as invisible changes to make them display correctly on different machines done automatically by browsers could be enough to break the signatures. Not all messages are signed. Etc. Many of the messages that failed were stuff like advertising where nobody would have incentive to break the signatures, so I think I can safely assume my test isn’t perfect. I decided at this point to try to validate as many messages as I could so that people researching these emails have any reference point to start from. Rather than download messages from wikileaks one at a time I found someone had already done that for the Podesta emails, and uploaded zip files to Archive.org.

Emails 1-4160
Emails 4161-5360
Emails 5361-7241
Emails 7242-9077
Emails 9078-11107

It only took me about 5 minutes to download all of them. Writing a script to test all of them was pretty straightforward. The program dkimverify just calls a python function to test a message. The tricky part is providing context, and making the results easy to search.

Automated testing of thousands of messages

It’s up on Github

It’s main output is a spreadsheet with test results, and some metadata from the message being tested. Results Spreadsheet 1.5 Megs

It has some significant bugs at the moment. For example Unicode isn’t properly converted, and spreadsheet programs think the Unicode bits are formulas. I also had to trap a bunch of exceptions to keep the program from crashing.