The following is inspired by (and based loosely on) a tutorial by data journalist Clara Guibourg: Network analysis of a Twitter hashtag using Gephi and NodeXL (hat-tip @KirkDBorne), worked up in a ‘Heath Robinson’ fashion since I don’t currently have Java or MS Excel installed on this laptop. (Java v.7+ is required to run Gephi, “the leading visualization and exploration software for all kinds of graphs and networks”; and, without Excel, not much use in trying to run the NodeXL Excel 2007+ template “that makes it easy to explore network graphs”.)
Graphs don’t just come in curves:
A standard Cartesian graph consists of a set of (x,y) co-ordinates (the points or vertices on the graph) and the relationship (the edges, arcs or lines) between them. The result is the graphed line, which may be also expressed as some algebraic function specifying the relationship (for example, in its simplest form: y = x).
A network graph is similarly used to express relationships between edges (the connections) and vertices (the points, also called nodes) and can be used in a host of data organisational and informational settings. You can read more info about networks and graphs on Wikipedia here: Graph Theory and Graph Drawing.
This example looks at the relationships between all Twitter users posting about a particular subject (here, using a specific hashtag as the indicator). As Clara Guibourg points out in her tutorial: this is a great way to “find out who the most influential users are on a certain hashtag.”
Basic method modified from above tutorial:
- What question do we want to answer: “Who are the top 50 most influential Twitter users, in the last 12 months, for the hashtag #datascience?”
- Gather the data: pull together a data set based on the specified criteria (example: all tweets including required hashtag during the specified time period 20-Mar-15 to 19-Mar-16)
- Import data into a readable format: (the above tutorial this uses NodeXL to import tweets into Excel for easier viewing)
- Export data to chosen visualisation / graphing interface:
- Prepare data visualisation: manipulate data / graph functions as required to allow for better visualisation of connections (space the nodes further apart; set factors to indicate which node is central to the network, hence most important; partition the nodes into relevant groupings and add colouring by group; rank the nodes based on size of importance; re-space the graph to prevent overlaps and allow easier reading; add labels to nodes if desired)
- Publish data visualisation: either export as image or PDF, or publish in interactive form using a JavaScript (the above tutorial recommends the Sigma JS Javascript library for the latter)
Remembering the data science roadmap:
Define question – Define ideal data set – Determine accessible data – Obtain data – Clean data – Explore data – Stats prediction/modelling – Interpret results – Challenge results – Report results – Create reproducible code – Publish results to others [QDAOCEPICRRP] (Source: Leek/Johns Hopkins University: Data Scientist’s Toolkit)