The Motivation

https://twitter.com/#!/corbett_inc/status/11049922849673216

https://twitter.com/#!/corbett_inc/status/11050551278051330

https://twitter.com/#!/ario/status/11053875767279616

Check my Wikileaks Research Project midpoint report out for more details but a bite sized background piece quoting from the blog post is as follows:

A week ago I hatched a plan to educate myself in a more structured manner on what exactly Wikileaks stands for, its history, and its possible futures with the idea of developing a more concrete and informed opinion (and since I have a habit of taking action with respect to my opinions, thereby guiding my future actions). After seeking some advice about what to read via twitter and starting with this Julian Assange essay as a basis–a very good one for some philosophical background of the larger aims, I spent awhile digesting the Wikipedia page on Wikileaks before deciding I wanted to fact check its assertions and that its references might serve as a springboard the next step in my studies

When the fantastic FBZ suggested I give a talk at SecurityBSides Berlin I submitted this as a talk, which when accepted gave me a tight deadline to meet:

Date of BerlinSides Talk i.e. Deadline for Wikileaks Research Project Completion

I am happy to say I made it. The talk went decently but I’m making up for the places in which it was lacking by making a thorough blog post detailing the methodology and findings.

The Methodology

Everything I speak about here is available in the github repostory of the project https://github.com/corbett/WikileaksResearchProject.

1. Obtain articles

This is covered in more depth in my midpoint report blogpost but the Cliff’s notes are to run the following:

./AddUrlsToInstapaper.py -u username -p password page_url

on the relevant Wikipedia page, which in this case is http://en.wikipedia.org/wiki/WikiLeaks. AddUrlsToInstapaper.py is of course available on github.

2. Read articles, all 265 + of them

The articles are available in .mobi e-book format and original html (for offline reading) on the github. Since the inception of the project, I read a bit every day, making notes along the way. The worthwhile reads are also marked in the citation spreadsheet (see step 3) and are indicated by belonging to the folder “WL Reading List” under the “Instapaper Folder” column.

Quotations from Worthwhile Reads-Click through for full size

Quotations from Worthwhile Reads–Click through for full size

3. Classify articles

During the course of reading, I quickly noticed most articles didn’t cite their sources properly, and I was often led in circles before finding an original source. I decided to quantify this, and the results are available in the following spreadsheet: http://tinyurl.com/wikipedia-wikileaks-citations.

A few key results:

Course Citation Classification

The course citation classification looks pretty good. 47% original documents (note this does not necessarily mean they were informative or useful only that they were original) and 25% citing original sources.

Fine Grained Citation Classification

However, classifying in greater detail (the above lumps all articles which cite at least one original source into the “yes” pile) and reserving “yes” for articles which cite original sources wherever applicable, the numbers get significantly more depressing. Here only 5% of articles (~10% of non-original source documents) properly handle citations, and most articles which do cite an original source do so in a haphazard and inadequate manner.

4. Analyze relationships

In addition to the citation problem, I quickly noticed a repetition problem. Many articles were rehashes or light rephrasing of other articles making this algorithmic “what to read” approach inefficient as I often read the same article rehashed over and over. To quantify this to some degree I wrote a script to identify articles which had at least one full sentence in common and represent that connection by an edge connecting two nodes, representing the two articles, in a graph. The articles the numbered nodes correspond to are indicated in this table. The graphs are tangled media webs but here are a few interesting excerpts:

Wikileaks About Page (node 311)

A Few Articles in Swedish

New Yorker Profile of Julian Assange (node 6)

I also did a quick and dirty scan for the number of documents mentioning key terms to the saga:

Terms of Interest

This analysis was coarse, but what pops outs clearly, as if it wasn’t obvious already, is Assange’s domination of the news.

Conclusions

The conclusions and opinions I formed and actions I plan as a result–both with respect to WikiLeaks and the media as a whole–deserve their own post. Stay tuned for Wikileaks Research Project Final Report II: Conclusions, Opinions and Actions. In the meantime the slides from my BerlinSides Presentation are available for perusal. Happy New Year!

Wikileaks Research Project–Final Report: Background and Methods

The Motivation

The Methodology

1. Obtain articles

2. Read articles, all 265 + of them

3. Classify articles

4. Analyze relationships

Conclusions

Related

The Motivation

The Methodology

1. Obtain articles

2. Read articles, all 265 + of them

3. Classify articles

4. Analyze relationships

Conclusions

Share this:

Related