Wikileaks Research Project–Final Report: Background and Methods

The Motivation

https://twitter.com/#!/corbett_inc/status/11049922849673216

https://twitter.com/#!/corbett_inc/status/11050551278051330

https://twitter.com/#!/ario/status/11053875767279616

Check my Wikileaks Research Project midpoint report out for more details but a bite sized background piece quoting from the blog post is as follows:

A week ago I hatched a plan to educate myself in a more structured manner on what exactly Wikileaks stands for, its history, and its possible futures with the idea of developing a more concrete and informed opinion (and since I have a habit of taking action with respect to my opinions, thereby guiding my future actions). After seeking some advice about what to read via twitter and starting with this Julian Assange essay as a basis–a very good one for some philosophical background of the larger aims, I spent awhile digesting the Wikipedia page on Wikileaks before deciding I wanted to fact check its assertions and that its references might serve as a springboard the next step in my studies

When the fantastic FBZ suggested I give a talk at SecurityBSides Berlin I submitted this as a talk, which when accepted gave me a tight deadline to meet:

Deadline for Wikileaks Research Project Completion

Date of BerlinSides Talk i.e. Deadline for Wikileaks Research Project Completion

I am happy to say I made it. The talk went decently but I’m making up for the places in which it was lacking by making a thorough blog post detailing the methodology and findings.

The Methodology

Everything I speak about here is available in the github repostory of the project https://github.com/corbett/WikileaksResearchProject.

1. Obtain articles

This is covered in more depth in my midpoint report blogpost but the Cliff’s notes are to run the following:

./AddUrlsToInstapaper.py -u username -p password page_url

on the relevant Wikipedia page, which in this case is http://en.wikipedia.org/wiki/WikiLeaks. AddUrlsToInstapaper.py is of course available on github.

2. Read articles, all 265 + of them

The articles are available in .mobi e-book format and original html (for offline reading)  on the github. Since the inception of the project, I read a bit every day, making notes along the way. The worthwhile reads are also marked in the citation spreadsheet (see step 3) and are indicated by belonging to the folder “WL Reading List” under the “Instapaper Folder” column.

Quotations from Worthwhile Reads

Quotations from Worthwhile Reads-Click through for full size

Quotations from Worthwhile Reads

Quotations from Worthwhile Reads–Click through for full size

3. Classify articles

During the course of reading, I quickly noticed most articles didn’t cite their sources properly, and I was often led in circles before finding an original source. I decided to quantify this, and the results are available in the following spreadsheet: http://tinyurl.com/wikipedia-wikileaks-citations.

A few key results:

Course Citation Classification

Course Citation Classification

The course citation classification looks pretty good. 47% original documents (note this does not necessarily mean they were informative or useful only that they were original) and 25% citing original sources.

Fine Grained Citation Classification

Fine Grained Citation Classification

However, classifying in greater detail (the above lumps all articles which cite at least one original source into the “yes” pile) and reserving “yes” for articles which cite original sources wherever applicable, the numbers get significantly more depressing. Here only 5% of articles (~10% of non-original source documents) properly handle citations, and most articles which do cite an original source do so in a haphazard and inadequate manner.

4. Analyze relationships

In addition to the citation problem, I quickly noticed a repetition problem. Many articles were rehashes or light rephrasing of other articles making this algorithmic “what to read” approach inefficient as I often read the same article rehashed over and over. To quantify this to some degree I wrote a script to identify articles which had at least one full sentence in common and represent that connection by an edge connecting two nodes, representing the two articles, in a graph. The articles the numbered nodes correspond to are indicated in this table. The graphs are tangled media webs but here are a few interesting excerpts:

Wikileaks About Page (node 311)

Wikileaks About Page (node 311)

A Few Articles in Swedish

A Few Articles in Swedish

New Yorker Profile of Julian Assange (node 6)

New Yorker Profile of Julian Assange (node 6)

I also did a quick and dirty scan for the number of documents mentioning key terms to the saga:

Terms of Interest

Terms of Interest

This analysis was coarse, but what pops outs clearly, as if it wasn’t obvious already, is Assange’s domination of the news.

Conclusions

The conclusions and opinions I formed and actions I plan as a result–both with respect to WikiLeaks and the media as a whole–deserve their own post. Stay tuned for Wikileaks Research Project Final Report II: Conclusions, Opinions and Actions. In the meantime the slides from my BerlinSides Presentation are available for perusal. Happy New Year!

0 thoughts on “Wikileaks Research Project–Final Report: Background and Methods

  1. Applying a quantitative method to media’s reporting of Wikileaks is a great idea. Do you think there can be a control case? Then we might be able to think about whether the media is “behaving differently” on this topic than on others. We will also be able to put your wonderful graph (from section 4) into better perspective. I think it would also be interesting to feature a few exemplary articles/sentences (if there are any) that got repeated the most or were most polemic, etc to see what indeed are the statements that were copied over most frequently/with most persuasion. Were there cases, however, when a statement was repeated (in quotes) so that it could be refuted in the new article?
    A few typos–“course” should be “coarse”; “we rehashes” should be “were rehashes.”
    Cool work,
    Seohyung

    1. Thanks for the comment. Typos fixed. Email with more as you find them.

      Some great ideas-I have a few examples of those in the talk (or at the least statements I was never quite able to verify in this sample of documents). I’ll do the additional analysis you suggestion for the conclusions section, this was just going to be a report on the methodology although of course what I decided to look at it is in itself a statement.

      I’d like to generalize this to apply to any topic of interest but before I apply the method to a control I’d like a more valuable way of determining what to read probably working with the citation classification and the originality of the document. I.e. original source > doc which cites original sources > doc which cites no original sources and have things ordered by how often they are themselves copied/referenced (edges coming out of graph weighted by number of sentences in common?).

  2. This was awesome. Why is it so rare to see someone dig in and actually read what they’re talking about before they talk about it? Impressed and look forward to part 2.

    1. Thanks for the comment. It is probably so rare as it takes a lot of work. The whole project probably took me 30-40 hours. One of the things I’d like to build is an engine which makes such research easier–identifying articles and original sources which make a lot of impact and are important to read in the course of research.

Leave a Reply

Your email address will not be published. Required fields are marked *