Shooter Citation Networks Project: PDFs, OCR, OMG!

In the last post we had gotten through obtaining a list of files (web addresses and descriptions) to download & extract data from. Let’s take a look at one of these.

Here we have an image of Pekka-Eric Auvinen’s former YouTube profile. Recall that we are seeking mentions of other shooters and their crimes.

While there is no overt, formal naming of another shooter, someone familiar with school shootings will recognize “Natural Selection” as a catchphrase of Eric Harris, the Columbine school shooter. However the actual phrase “Natural Selection” does not appear– instead, we see that it would be preferable in some cases to use a more general keyword such as “natural + [possible space] + select” (since it could be selectION, selectOR, etc). This generalization will be ideal in several instances. We may also notice the term “godlike” which is another Harris/Columbine meme. Fortunately for us, there are a limited number of shooters with such a substantial “fan following” among other shooters that their speech patterns are reproduced in endless combinations and permutations as in-jokes or references; most of these come from Columbine. In short, we should take note to include non-name references in the name search list as though they were nicknames for the shooters.

We may also notice some socio-political buzzwords that may be of interest later, when we are looking into ideological similarities between shooters other than direct influence. I see “sturm”/”storm” (can be a white nationalist metaphor), “atheist”, “humanist”, etc., as well as loaded descriptions of the shooting action, such as “eliminate” and “my cause” (shooters often reference their “justice day”, “wrath”, “final action”, or whatever). We should start a list of interesting terms to consider later.

We may also note that this is a pdf image and as such we cannot yet ask our computer to search it as text. We will need to obtain searchable text from our PDFs. I didn’t know about this before this project, but apparently there are a few basic types of PDFs:

1) “Digitally created PDFs” / “True PDFS”: PDFs created with software that gives them internal meta-information designating the locations of text, images, and so on: in a sense, the computer can “see” what the naked eye can see due to this internal structure, and there is preexisting, GUI-having, user-friendly software to navigate these documents with ease manually one at a time (or of course you could automate their navigation yourself in such a manner as you desire; I’m just saying that they’re broken down to the point that they’re already cracked open and layed out like a multicourse meal for a layperson to consume– if they have the appropriate tools onhand).

2) PDFs that are just images, such as a raw scan of an original document of any kind, or of a picture, or of handwriting, etc. Just any old thing: a picture of it. There is no internal structure to help your computer navigate this, it’s “dead” in its current state. This is the worst case (aka the fun case for us; you’ll see!).

3) PDFs that have been made searchable through application of (e.g.) OCR to an image PDF, yielding text that is then incorporated into the original document as an underlayer. This text can be selected, copied, etc. This is a good situation. In our case, when we OCR image files, we’ll just go ahead and save the text in a file by itself (one per source image PDF) rather than creating searchable PDFs– because that’s all we need!

This is a case of #2– just a plain ol’ screenshot that someone took of this YouTube profile.

Now, this is a relatively small document in terms of the text contained within it– if I had a transcription of this text, it wouldn’t kill me to just read through it and see for myself if anything notable is contained therein. However, besides the sheer number of documents, a lot of the documents we’re going to be dealing with are these really interminable, deadly-dry court documents or FBI compendium files that are just hundreds and in some cases (cough, Sandy Hook shooting FBI report) thousands of pages long– fine reading for a rainy day or when I’m sick in bed or something, but not something I want to suffer through on my first attempt to get a high-level glance at who’s talking about whom.

(Seriously, some of the content of these things– there’s stuff like the cops doing the interrogations squabbling about when they’re going on their lunch break and “Wilson, stop fiddling with the tape while I’m interviewing!” and people going back and forth about whether they just properly translated the question to the witness or whether they just asked her about a pig’s shoes by mistake, etc.– and Valery Fabrikant representing himself in trial– merciful God! I’m going to have to do a whole separate post on the comic relief I encountered while going through these, both in terms of actually funny content and in terms of stuff that my computer parsed in a comically bogus way, such as when someone’s love of Count *Cho*cula gave me a false positive for a reference to Seung-Hui *Cho*.) Point being, I’m not gonna do this, I’m gonna make my computer do it. So that’s gonna be half the battle, namely, the second half.

First half of the battle is going to be getting the text out of the PDF. Enter optical character recognition (OCR). OCR is, in short, your computer reading. So let’s back it up– when you’re looking at text in some program on your computer, you’re looking at what’s really a manifestation of an underlying numerical representation of that character (meaning your computer knows two different “A”s are both capital As in the sense that they both “mean” [such-and-such number]). It’s not trying to figure it out from the arrangement of pixels of the character in that font every single time. (Honestly I don’t feel I have enough experience in this area to judge an appropriate summary of the appropriate “main” topics, so I’m just going to link out to someone else and you can read more if you like.)

But when the computer is looking at a picture of someone’s handwriting, or a picture of printed-out text from another computer, it’s only seeing the geometric arrangement of the pixels; it doesn’t yet know where the letters stop and start, or which number to associate a written character with once it is isolated. So what would you do if someone in a part of the world that used a totally unfamiliar alphabet slid you their number scrawled on a napkin (which they’d set their drink on, leaving a big wet inky ring)? First you’d try to mentally eliminate everything that’s not even part of the number– any food, dirt, wet spots, etc. Then you’d try to separate out groupings of marks that constitute separate numbers (like how the word “in” has the line part and the dot part that together make up the letter i– that’s one group– and then the little hump part that constitutes the n– that’s the second group). Then you’d zoom in on the grouping that you had decided made up one numerical digit, and you’d look back and forth between that and a list of all the one-digit numbers in that language in some nice formal print from a book that you are satisfied is standard and “official enough”. So you’d start with the first number in the list, and compare it to your number. Now you could go about this comparison a couple of ways.

1) Draw a little pound sign / graph over your number and over the printed number (imagine a # over the number 5). Compare the bottom left box of your number to the bottom left box of the printed number. Plausibly the same or not at all? Then compare the bottom right box. Etc. Apply some standard of how many boxes have to be similar to decide it’s a match, and when you find a match in the list of printed numbers, stop (or, do the comparison algorithm for each entry in the printed list, and whichever printed number has the most similar boxes to your written number is picked as the answer).

There are some problems with this though, namely that things might be tilted, different sizes, written with little flourishes on the ends of the glyphs, etc. in such a way that on a micro level, the similarities are disguised: think of the “a” character as typed in this blog vs. the written “a” character as is taught in US primary school (basically looks like an “o” with the line part of an “i” stuck on the right-hand side). The written, elementary-school-“a” would likely be determined to be an “o” under the # system. Not good. This is called matrix matching.

2) Attempt to identify the major parts, or “features”, of the character. (For example, we will consider the line and the dot in an “i” to be separate features because they’re separated in space, or the lines in an “x” as individual features as they can be drawn in one stroke, or whatever.) For the “a” we have a round part, and on its left a vertical line of some sort. Okay, now we’re talking generally enough that the two “a”s described above sound pretty similar. This is called feature recognition. (As you can imagine, it gets pretty complicated to get a computer to decide how to look for features and what’s a feature.)

So, that’s the game. There are several “engines” / programs / packages / etc for performing this task, and I used Tesseract. It’s pretty great at reading all kinds of *typed* characters, but you have to train any engine to read handwriting (one handwriting at a time, by slowly feeding it samples of that writing so it can learn to recognize it). I had so many different people’s handwritings, and so few handwritten documents PER handwriting, that this didn’t seem like the project to get into that on. I’m definitely going to get back to that for the purposes of transcribing my own handwriting, as I write poetry and prose poetry longhand and have a huge backlog of notebooks to type up (securing all that data is one of my main outstanding life tasks, in fact– there’s really no excuse at this point to endanger all of my writing by leaving it in unbacked-up physical copies).

This post is getting a little long so I’m going to go ahead and put it up and get into the technical stuff in the next post. Peace!

Beautiful Soup Pt. 1: Web Scraping for the Citation Networks Project

As discussed in my previous post, I’ve been mass-downloading and automatedly searching rampage shooters’ manifestos, court documents, etc. for mentions of the other shooters (by name, event, associated symbols, and so on). For this I used a Python library called Beautiful Soup, and I’d like to say a few words about how the process goes.

What is Beautiful Soup?



Beautiful Soup (library and documentation available at link) self-describes as “a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. ” It is currently in version 4 (hence I’ll refer to it as BS4 from here on), but it is possible to port code written for version 3 without much trouble.

What do we want from it?

To put it as broadly/simply as possible, we want to take a complete website, that we find out there on the internet, and turn it into an organized set of items mined from within that website, throwing back all the stuff that we don’t want. So for example in my project, a major source of these documents was Peter Langman’s database of original documents pertaining to mass shootings. Here it is “normally” and here it is “behind the scenes”. [pic 1, pic 2] Not only am I going to want to download all of these documents and leaf through them, but I’ll need to do so in an organized fashion. So a good place to start might be:

Goal: obtain a list or lists of shooters and their associated documents available in this database.

Using Beautiful Soup

Basically, how it works is that we assign a webpage’s code to be the “soup” that BS4 will work with. We allow BS4 to process the “soup” in such a way that BS4 will be able to navigate through it using the site’s HTML tags as textual search markers, and to pull out (as strings) whichever types of objects we want.

from bs4 import BeautifulSoup
site = "https://schoolshooters.info/original-documents"
page = urllib2.urlopen(site)
soup = BeautifulSoup(page)

I myself chose urllib2. BeautifulSoup (and later prettify) are commands built into the BS4 library. So, we’ve set the scene to allow us to move through the HTML/text, some examples below (these and more can be seen at the BS4 Documentation page).

We can pull out parts of the (prettified) soup by specifying tags, for example:

soup.prettify
all_links = soup.find_all("a")

Pulling out all the “href” elements within these “<a>” tags…

templist = list()
for link in all_links:
templist.append(link.get("href"))


One can work with nested tags as well by iterating the same procedure.

Unfortunately this is still not my desired list of links, but a simple script of the type below can filter for the appropriate strings and create a file listing them.

f = open('linkslist.txt','w+')
for item in templist:
if "http" in item:
f.write(item + "\n")

yields a document such as below. Included are a couple other samples of basic ways one could have fished-out data written to documents. Please note that more documents were added to the database after I began this project, so e.g. the William Atchison files are not included here even though they appear in the “soup” pictures. I’ll synchronize the images later.

So now we’ve managed to extract some data that would have taken much longer to do by hand! Next on the agenda will be mass-downloading my desired files (and avoiding undesired ones) and crawling them for cross-referencing– while avoiding booby traps! See you in the next post.

Project Overview: Citation Networks in Rampage Shootings

I’ve been working on my latest project for a long time now, but, having underestimated how complex it would get, was waiting until I had “a draft finished” to post something– ha! Yeah, right! There may never be a true “end” to this project, so waiting for a “complete” draft might mean waiting forever. I’m going to start posting periodic updates as I pass through stages of the project instead. So, without further ago, introducing:

Citation Networks in Rampage Shootings.

Background: In my experience, media coverage and anecdotal discussion of mass shooting events typical portray them as essentially unconnected natural disasters. But as it turns out, “climatology” is a very complex science…

In the last few years the US public has started to realize that there is an element of media influence involved in the motivation for perpetration, and academics have begun to analyze the characteristics of perpetrators (examples: race, age, mental health diagnoses on record, year shooting committed) and even the frequency of some contents in their writings; for example this Peter Langman paper (link does not directly open the large document). We are able to do this because of the great deal of shooter/event information available to the public via the Freedom of Information Act (FOIA)

However there is not a concentration on rampage shootings as a covert political movement of sorts, and/or a type of abstract terrorist network operating in single-cell units.  I posit that this is a useful conception, as expressed by a network (acyclic directed graph) of “citations” between rampage shooters. The existing methodology of citation analysis can provide a framework or guide for this expression. For now, I am concentrating on creating the most thorough network I reasonably can while maintaining relevance to my interests and an appropriately tight scope (I’ll elaborate).

sci2figure4.12

Illustration of citation network(s), from this wiki page on network analysis.

My vague “starting goal”: Obtain and organize documents (primarily manifestos) associated with rampage shooters, and datamine them for cross-references. Create a visual network.

My concretized, “actionable” goal: Scrape the web for documents written directly, or otherwise generated directly (e.g. FBI report list of websites visited by shooter as found on his computer), by a restricted population of mass shooters (starting with those on Langman’s schoolshooters.info website). [✔️] Process them all into searchable text (as some are PDF, handwritten, etc). [✔️] Create a list of shooter-associated names and terms (names, nicknames, schools attacked, and unambiguous referents) [✔️], and directly search the documents for these names/terms, creating a list of citations apparent between shooters (which will need to be cleaned for redundancy and ongoingly manually checked to see if the results are making sense). Go through and throw out false positives, attempt to identify false negatives from personal knowledge of the documents (which I viewed individually when classifying during OCR pre-processing). Create a visualized network of citations [❌] where the graph is an acyclic directed graph, the nodes are shooters, and the edges are citations. Ideally, the nodes will be physically laid out along a timeline. [❌]

Where I’m at now:
✔️ Outline my goals.
🆕  Blog about same.

✔️ 
Define the population to be included (restricted set of shooters). Start with shooters represented in Langman’s original documents database, and work outward from there.
✔️ Scrape the web for documents using BeautifulSoup.
🔜 Blog about the web scraping.

✔️ Learn what OCR options are out there. Use OCR and manual/voice transcription to convert all document types to searchable .txt files as follows. First classify all documents according to their status: useless document, desirable and already usable txt, desirable but needs OCR, desirable but handwritten / otherwise illegible to conventional prefab OCR. Use tesseract and other packages until either every typed document is successfully hit or I conclude that that is not going to happen anytime soon. For everything else, decide whether it’s worth training AI to decode the writing/printing style or whether it’s better to just grind it out by transcribing manually or reading aloud into a voice-to-text processor.
🔜 Blog about the document processing.

✔️ Create list of terms and names associated with each shooter.
✔️ For each shooter, use Python to mine associated list of outgoing citations (of other shooters in this set) from documents.
✔️ Review identified citations, scrutinize sensitivity and specificity.
❌ Repeat until satisfied.
🔜 Blog about the comedic mishaps encountered along the way (no spoilers, but let’s just say I had some very confusing results for awhile).

❌ Decide which language(s) to create visuals in.
❌ Create formal nodelist (easy modification of shooter list) and edgelist (slightly harder but still pretty easy translation of citations list/log once created).
❌ Plot and prettify visual graph.
❌ Partially order nodes in time, place physically along a timeline.

💭 Node expansion goals: flesh out the network with all “iconic figures” mentioned, starting with other mass shooters (not from Langman), then enlarging to other mass killers, then killers in general (e.g. Hitler) as well as media (e.g. The Basketball Diaries). Needless to say many of these would not need to be searched for most outward citations (e.g. The Basketball Diaries is likely not citing any modern shooters).
💭 Edge expansion goals: grow the network into a set of hypergraphs with edge criteria such as: same type of weapon used (may need consultation as to what is similar enough to constitute probable mimesis), similar poses in images released, same type of manifesto (video, etc) released to the news intentionally, and so on. Each criterion could be represented by a different color, for example, if we wish to be able to toggle through criteria– perhaps create a mini interactive network where the user selects the criterion.
💭 Incorporate visual citations: Consider pictures released to the public by shooters that contain visual references to other shooters. Most of these are to Columbine. We have the “wrath” & “natural selection”-style “uniforms”, the Dylan Klebold gun-finger wave, and so on.

Coming soon… Perils and adventures encountered putting my first toe in the web scraping waters!

Influential Nodes in Worldwide Terror Networks: Centrality + Improved Graphics

I’ve improved the presentation of my network model for global terrorist collaborations. You can take a look at the code on my github, and definitely follow the link to view the network in full.

CLICK HERE TO SEE THE FULL NETWORK.

Screen Shot 2017-02-21 at 8.08.43 PM.png

Please note that I replaced the node IDs for “the” Taliban (T), Boko Haram (BH), ISIL/ISIS (IS), Hamas (H), and”the” Al-Qaeda (without a regional modifier in the name) (AQ) by the initials herein, so that they can be easily pinpointed on the graph. You’ll probably want to open the lengthy node key in another window.

 

A few notes:

Criteria for inclusion. Please refer to my previous post.

Node Clean-up. I got rid of the nodes “Unknown”, “Individual” (meaning a non-organization), and “Other”, which had escaped my attention and unduly linked some pairs of organizations as having one degree of separation (e.g. both Group A and Group B collaborated with persons who were never discovered– this doesn’t mean they collaborated with the SAME person!). I’m also noticing some nodes here and there that have basically the same problem, such as “Palestinians”– that is not an organization. I will return to these sorts of nodes and remove them on a case-by-case basis.

Community Detection. I used the “fast greedy” community detection algorithm to assign and color the communities. Here is a comparison of community detection algorithms for networks with various properties. Before executing this algorithm, I combined any multiple edges between a pair of nodes into a single weighted edge, and got rid of loops (since “collaboration with oneself” is not what I was intending to portray in this model).

Let’s take a look at the output given by R. Upon inspection, these groupings seem to make sense; the organizations seem plausibly affiliated and frequently refer to the same cultures, regions, or ideologies. Some of the names could use a bit of clarification (for example, “Dissident Republicans” refers to breakaways from the IRA toward the end of the Northern Ireland conflict) or expansion/compression. As you may infer, the numberings to the left of the members of a group are not the node IDs that appear in the rainbow graph later, but rather numberings within the communities (only the first number is shown per line of community members).

Screen Shot 2017-02-21 at 8.14.30 PM.png

SEE COMMUNITY CLUSTERS HERE. 

Cliques. The largest cliques (complete subgraphs) were revealed as:

Clique 1. Bangsamora Islamic Freedom Movement (BIFM), New People’s Army (NPA), Moro National Liberation Front (MNLF), Moro Islamic Liberation Front (MILF), Abu Sayyaf Group (ASG)

Clique 2. Popular Resistance Committees, Popular Front for the Liberation of Palestine (PFLP), Hamas, al-Aqsa Martyrs Brigade, Democratic Front for the Liberation of Palestine (DFLP)

Clique 3. Popular Resistance Committees (PFLP), Hamas, Al-Asqa Martyrs Bridgade, Palestinian Islamic Jihad (PIJ)

Centrality. I wanted to know how “influential” each node was. Of course, centrality is not the only way to measure this, especially in a case like the GTD where we have so much other information, such as victim counts. Even going on centrality, there are several centrality measure options in igraph for R; I went with eigencentrality. To quote from the manual:

“Eigenvector centrality scores correspond to the values of the first eigenvector of the graph adjacency matrix; these scores may, in turn, be interpreted as arising from a reciprocal process in which the centrality of each actor is proportional to the sum of the centralities of those actors to whom he or she is connected. In general, vertices with high eigenvector centralities are those which are connected to many other vertices which are, in turn, connected to many others (and so on).”

Screen Shot 2017-02-20 at 3.22.25 AM.png

The “scale” option fixed a maximum score of 1.

Nodes Sorted by Eigencentrality (Decreasing) + Commentary:

de Brujin Graphs, etc. (co-authored)

Had the pleasure of putting this survey paper together alongside Camille Scott and Luiz Irber for Raissa D’Souza’s Network Theory class in Spring 2016. It’s about the network theoretic aspects and applications of genomic data, with a bit of a history lesson tied in. The data used constituted all invertebrate and mammalian genomes available on NCBI, a whopping 84 GB. Luckily my co-authors had access to “supercomputers”. Please don’t be intimidated by the wall of text; I started this project with zero knowledge of genomics (thanks Camille and Luiz!) and co-wrote with a similar audience in mind. All of the graphics are by Ms. Scott.

genomicsgenomics2genomics3

genomics3genomics3genomics4genomics5genomics6genomics7genomics8genomics9genomics10genomics11

Feature Extraction on Global Terror Events

 

The GTD is incredible.

The GTD is an index of terrorist or suspected terrorist events from 1970 to 2014, compiled by the University of Maryland for the Dept. of Homeland Security of the USA. The documentation for the project can be found at [4]. It contains over 100k events with no geographical restriction.

From the source material:

”The original set of incidents that comprise the GTD occurred between 1970 and 1997 and were collected by the Pinkerton Global Intelligence Service (PGIS) a private security agency. After START completed digitizing these handwritten records in 2005, we collaborated with the Center for Terrorism and Intelligence Studies (CETIS) to continue data collection beyond 1997 and expand the scope of the information recorded for each attack. CETIS collected GTD data for terrorist attacks that occurred from January 1998 through March 2008, after which ongoing data collection transitioned to the Institute for the Study of Violent Groups (ISVG). ISVG continued as the primary collector of data on attacks that occurred from April 2008 through October 2011. These categories include, quote, ‘incident date, incident location, incident information, attack information, target/victim information, perpetrator information, perpetrator statistics, claims of responsibility, weapon information, casualty information, consequences, kidnapping/hostage taking information, additional information, and source information,’ as well as an internal indexing system. […] The GTD defines a terrorist attack as the threatened or actual use of illegal force and violence by a non state actor to attain a political, economic, religious, or social goal through fear, coercion, or intimidation.” (More on their criteria in a moment.) Take a look at the source codebook for yourself and enjoy the rich array of data that this project has! I tried to compile a small subset of this information myself once upon a time and it was a ton of work, so props to these people for stepping up.


Transforming Qualitative Data Into Quantitative Data

I originally selected this data for a class project, and some of this class was concerned with dimension reduction. It seems that most dimension reduction and feature extraction algorithms are designed with continuous or at least ordered data in mind. For this reason I sought to convert the GTD data from categorical strings into numbers. Goals: Make the data easier to dimension-reduce. Interpret the information in the GTD in a way such that it can be internally compared, despite the disparate value ranges and types the various features take. Identify characteristics that predict other characteristics in an arbitrary or restricted-domain terrorist incident.

I transformed the data as follows.

Some of the data was simple enough that I was able to directly convert it into an ordered numerical scale. I converted the “target types”–- the intended victims of the acts– by classifying them on a scale from civilian to state targets, where 1 is ”most civilian” or an infrastructural target intended to affect daily living (included their categories of: private citizens/property, journalists & media, educational institutions, abortion-related, business, tourists, food/water supply, telecommunication, utilities, and transportation), 2 is semi-state or other loosely organized or less-empowered political organizations (airports & aircraft, maritime, NGO, other, religious figures/institutions, terrorists/non-state militias, & violent political parties) and 3 is “most statelike”(general government, police, military, diplomatic government, and unknown). For the ambiguous ones (other, unknown, etc) I looked at what was actually in that set to determine its category. Let’s take a look at the GTD’s criteria for inclusion while we’re at it:

Screen Shot 2016-09-25 at 4.56.29 PM.png

At this point in my exploration I wasn’t sure which techniques I would wind up using, but I wanted to prepare the data to be as malleable as possible without losing much. If I decided to use compressive sensing techniques to reduce the dimensionality of the data, a sparse matrix representation of the data would be preferable. Sparse intuitively means that for every feature of an incident/entry, the expected value is near zero due to a high number of zero instances of this feature across entries. Using the GTD, I had a lot of categorical variables that take, say, N values on the dataset, so I reasoned that these might best be decomposed into N features that each take a binary value. For example, the original variables ”weapon type 1″, “weapon type 2”, “weapon type 3” were converted into a column: was there a firearm involved? y/n, i.e. binary valued “weapfirearm” column. I made separate binary features for each possible weapon type. Chemical, biological, nuclear, and radiological were so seldomly occurring that I threw them away as features. I also made binary columns for whether hostages were taken, whether the attack was coordinated between multiple parties, if the perp is known or unknown, and whether the perpetrators were from the region in which they committed the crimes. Regions were broken down into simple cultural regions like the Middle East and North Africa, South Asia, Europe, and so on by the GTD people.

binaries.png

 

Preliminary Reduction: How Much Data, Exactly?

I worked with 141,967 incidents (before filtration) each having over 50 numerical and categorical variables, some null/missing. To deal with the missing data, depending on the type, I either threw away the entire incident row or used averaging techniques to extend data from the same incident in a way that wouldn’t mess up the statistics overall. Statistical concerns sometimes necessitated reframing the way I conceived the variables.

Geographical data is abundantly provided by the GTD. As well as the regional classifications, we also have access to not only the country, state, province, and/or city, but even the exact longitude and latitude of the vast majority of the events. In fact, the presence of this information is what persuaded me to wrangle the entire dataset rather than sticking to the smaller file of only the events that occurred in 1993 (this is set aside by GTD as a special year with its own documentations due to a loss-of-data incident in the archives that distinguishes that  year). I first tried and failed to open the data (.5 MB) in R. After a bit of looking around online I concluded that the the first thing I needed to do was convert the xlsx file to a csv file via e.g. Python, and then it would be advisable to throw away any data that I would definitely not be using (i.e. make a new files with a refined dataset). I had to put my grownup pants on and learn to selectively read and manipulate dataframes without opening the whole file in Excel.

After all this sparse-data mining, here is where it would be appropriate to subset the sparse columns (event features) and use the JLT to reduce dimensionality. I didn’t actually wind up doing that, partly because after the alterations I mentioned, the data management turned out to be not that bad in terms of what my computer could handle.

Some Preliminary Results

 

The first thing I wanted to check was whether terrorism is primarily isolated incidents by unaffiliated actors or if it is the primary mode of warfare for many major organizations. I used R for this: There are 440 nodes and 1156 edges. Note that many incidents involved more than two actors. The big components are who you might guess: ISIL, various Talibans, and Al-Quaeda. FARC was also a high-degree actor. I don’t know whether some of these supposedly different organizations are just subsidiaries of their connected organizations, or what. I’m playing with a Gephi representation right now and I’ll come back with some labeling so you all can see what’s what. I’ll tag some other famous groups like the ALF and ELF. terrornetworks.pngAbove: the network. Below: constructing the graph for igraph and Gephi.

perps
Edges.png

 

 

 

 

Feature Extraction: PCA and K-Means Clustering

I got into PCA by watching this video demo. Really, this video is good enough and uses a clear enough example that I am delegating saying what PCA is to this video. But I’ll try to explain it in-case too.

PCA is data agnostic. There do exist “spatial PCAs” tailored to dimension reduction of ”big” data while maintaining spatial correlations, see [2]. And there is also precedent for factor extraction on census-type data, see [1]. For PCA on discrete data, see: [3]. That’s all stuff I still have to do. Especially the geographical data I’m eager to use.

I proceeded to attempt a less-tailored PCA as well as k-means clustering on the dataset to see what the archetypal incidents would be–- that is, are there meaningful eigenincidents that represent archetypes of terrorism? I was wondering if there would be a significant correlation between geographical coordinates and method, varying with culture and resources. For example, we might find that one canonical type of incident takes place in Location X and involves firearms, hostages, and multiparty coordinations, whereas another might be the suicide bombing of an individual in a public marketplace in Location Y.

Due to the differing scales of the data, it was particularly necessary to scale and center the data before proceeding with PCA. And all that binary data wasn’t great for this “naive” PCA either, so I had to stash it for later. So let’s take a look at what I got when I PCA’ed the pre-pared data in R.

pcacode.png

Using some code from Thiago G. Martins’s data science blog.

PCAterr.png

To read what the PCA is telling us, we want to examine which features’ (rows’) absolute values are biggest, for a given fixed principal component, one of the columns. Note the standard deviations list at the top. The algorithm attempts to impose a natural delineation of the clusters of correlation, given by the different PCs that appear. But what the principal components really are are these: Maximize the variance over all combinations of the components. Keep in mind it’s showing the deviation, not the proportion of total variance, there in the (Feature, PC#) spot. Then we exclude all of the variance we just “used” in the most recent PC# creation, and iterate this N times total.The resulting vectors are a linearly uncorrelated orthogonal basis of the feature space.It appears that the strongest correlation appearing in the first principal component is between the event being an explosion/bombing and the use of explosive weapons-– okay, at least this is a good sign that our calculations are working, because that correlation is practically tautological . And when that is the case, the attack is less likely to be an assault (attackassault = ~ -0.34) or involve firearms (weapfirearms = ~ -0.42). I had hoped for something more insightful, but it’s a first run. I will experiment with excluding subsets of features from the PCA process. Let’s take a look at how much variance each of these components account for, with all of the features included.

variance2The mark “1” denotes PC1, and so on.

Approximately the first eight to ten principal components account for most of the variance. The first component is the dominant one. Then the second through fourth components could be considered the next ”batch”, and finally the fifth through [arguable final] components give almost all of the remaining variance in the dataset. Let’s look at other representations.

 

variance

stats.png

We could also subset the data to compare variables that we suspect are correlated.

But that is way too many features for me to try to visualize in simulated 3D.

Below, I restricted the features to year, whether the attack was a suicide attack (those are usually bombings), and target type, in that order. This data was adjusted for individual variation of the variables before processing.

pcayll.png

It appears that target type (remember higher values are more state-like targets) is inversely correlated with suicidality of method: that is, as we increase the public nature of the act, we increase the chance of a suicidal terror act. This makes sense because suicide missions create a stir and disarm the public. The follow figure illustrates how these three principal components constitute the overwhelming majority of the variance.

yllpca.png

 

predict

Still following the Martins tutorial, we use MATLAB to simulate “predicting” the tail end of our own data, the 113117th and 113118th incident. Since the data is in chronological order, it only makes sense to force the year and just get predictions for latitude and longitude. This is as variance so I still need to translate that back to coordinates and compare to the actual last incidents.

K-Means Clustering

 

Finding the ideal k for a k-means clustering is ”the big question” in the procedure. To get an heuristic sense of what works for this dataset, we can experiment with various k. In this case it seems that 3 is better than 5: look how feeble some of the clusters are when we choose k = 5. Compare these images of two clustering implementations using MATLAB’s Cosine distance function.

cos5.png
cos3.png

The following are K-Means Clusterings with the subset of year, whether the attack was a suicide, and the target type scale. I used Euclidean distance.

ylleuc.png

clustercorrs

Making the silhouettes in MATLAB:

codesample

Lifted directly from the good MathWorks documentation for k-means clustering.

Back to the Lone Wolf Thing

Around the time of the 2001 attacks on the WTC there was an increase in suicide bombing attacks already under way, and then again in 2010 the violence took a drastic climb. I would speculatively infer that the high profiles of these events inspired many small-cell copycats, but high profile events seemed to occur only when a local upward trend was already underway. The fever chart graph is courtesy of the search feature on the University of Maryland’s GTD page [4]. I don’t know what accounts for the drop after 2007.

gtdsearch.png

I’m going to mess around with the estimable Peter Langman’s rampage shooter data soon and compare to what I got here. Excited for that.That’s all for right now.

refs.png
By the way, you want to know the list of actors in the GTD database, right?

Disclaimer: I am utterly unfamiliar with the vast majority of these organizations. I’m not commenting on anyone’s politics or the status of their organization, since I didn’t collect this data myself. You can check out the methodology at the source site.  Continue reading “Feature Extraction on Global Terror Events”