‘Big data’ in action: Linking 180 million tweets with 600,000 police records

Professor Matthew Williams and Dr Pete Burnap are directors of the ESRC-funded Social Data Science Lab that continues the successful COSMOS programme of work. The Lab forms part of the Data Innovation Research Institute, which will be housed within the new Social Science Research Park at Cardiff University.

Together with colleagues (Dr Luke Sloan and Professor Omer Rana) they recently presented their intriguing findings about the power of pulling large sets of data from social media in front of 150 policymakers, academics and industry experts at the Data Science and Government ConferenceThe event, organised by the Behavioural Insights Team, looked at how emerging techniques in data science can best be used to support policy agendas in a range of areas.

Professor Matthew Williams and Dr Pete Burnap

Professor Matthew Williams and Dr Pete Burnap

Many would say there has been a lot of hype about the promise of Big Data and Data Science in government circles in recent years.  The Data Science and Government Conference gave one of the first opportunities for presenters, from government and academia, to demonstrate how very large datasets are being put to use in real-world policy contexts to address a range of pressing questions and to introduce new efficiencies. Recently the Cabinet Office developed a set of guidelines for the ethical use of big data in government projects.
The publication of this document is important for two reasons:

  • it signals the recognition of new forms of data as valuable in addressing problems facing government
  • it acknowledges the technical, methodological and ethical complexities associated with big datasets.

Several projects using big administrative data have been piloted in government departments with results indicating improved efficiencies and more effective engagement with the public.


We were invited to the conference because our social data science analysis techniques, developed during our ESRC projects, have potential to be transferred to operational and policy contexts. We presented the Lab’s Risk & Safety and Cybersecurity research programmes, jointly funded by the ESRC, EPSRC, Metropolitan Police, Airbus and Welsh Government.

We noted that there is a clear need in the public and third sectors for a greater understanding of how these new forms of big social data can be marshalled to add value to existing practices. With the right checks-and-balances in place and guided by conventional social science methodological and theoretical wisdom, social media data can complement and augment conventional government data to address existing problems and improve efficiencies.

To provide some context we presented work underpinning our recent ESRC impact acceleration grant that will see big data analysis techniques deployed in operational real-world settings.

In 2013 the Lab was awarded an ESRC NCRM grant to examine if big social media data assist in improving crime pattern estimation. The potential value added by social media data is that it is user-generated in real-time in voluminous amounts.  As such it can provide insight into the behaviour of populations on the move; the ‘pulse of the city’. This is in contrast to the necessarily retrospective snapshots of social trends and populations provided by conventional methods such as household surveys and officially recorded data.

Over a period of 12 months we collected 180 million geocoded tweets and close to 600,000 Metropolitan Police recorded crime incidents.

The research developed new data fusion techniques and improved upon existing mathematical models that have used social media data to predict voting patterns, the spread of disease, the revenue of Hollywood movies, and the estimates of the centres of earthquakes.

To date the default approach in big data research seems to have been wholly data-driven in the effort to predict.  However, without theory-driven data collection, transformation and analysis we cannot answer the substantive questions about social processes and mechanisms that concern us.

Purely data-driven approaches tend to produce models and algorithms that are over fit to the idiosyncrasies of a particular dataset, leading to spurious results that often do not reflect reality.

To avoid this trap, we put a series of strict checks and balances in place, including augmenting big data with conventional sources and using theory to drive our analytical process.

We employed advanced statistical techniques that take into account variation in time and space given that new forms of big data, like social media communications, occur in real-time, unlike conventional data that the police are used to using.  The models developed allowed us to re-test classic criminological theories using new forms of data.

The results published in the British Journal of Criminology showed that tweets containing signatures of crime and disorder helped estimate actual patterns of recorded crime.

Tweets containing mentions of the breakdown of the local social and environmental order were more predictive of crime than conventional correlates such as unemployment and proportion of young people in an area.  These techniques, along with others that facilitate the detection of cyberhate on social media, are being deployed into operational settings where we hope they will achieve a real-world impact.

Funds from various ESRC programmes including Digital Social Research, Google Data Analytics, NCRM and Global Uncertainties have also enabled the Lab researchers to detect online racial tension following sporting events, model the propagation of cyberhate following a terrorist attack, and detect the presence of counter-speech as form of online community based regulation.  Recently the Lab was successful in an application to the Innovation Panel of the Understanding Society longitudinal survey, allowing us to refine our estimations of social media user demographic characteristics.

You can follow @MattLWilliams and @pbFeed on Twitter. 

Further reading: Williams, ML, Burnap, P and Sloan, L (2016) Crime Sensing with Big Data: The Affordances and Limitations of Using Open Source Communications to Estimate Crime Patterns. British Journal of Criminology.

2 thoughts on “‘Big data’ in action: Linking 180 million tweets with 600,000 police records

  1. Pingback: Your top 10 ESRC blogs from 2016 | ESRC blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.