Big Data and Bayesian Modelling Workshops

In this blog post two PhD students associated with the Geocomputation Hub – Alejandro Coca Castro and Mark de Jong – report back on workshops they recently attended. Alejandro attended a UK Data Service workshop and Mark an ESRC-funded advanced training course on Bayesian Hierarchical Models.

Hive with UK Data Service – Alejandro

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. In practice, researchers can face a big data challenge when the dataset cannot be loaded into a conventional desktop package such as SPSS, Stata or R. Besides being the curator of the largest collection of digital data in the social sciences and humanities in UK, the UK Data Service initiative is also currently organising a series of workshops focused on big data management. These workshops aim to promote better and more efficient user-manipulation of their databases (and other sources).

UKdataservice

Keen to attend to one of UK Data Service’s workshops, I visited the University of Manchester on 24 June 2016 to participate in “The Big Data Manipulation using Hive”. In a short, Hive™ is a tool that facilitates reading, writing and managing large datasets that reside in distributed storage using SQL (a special-purpose programming language). Although a variety of applications to access the Hive environment exist, the workshop trainers showed attendees a set of tools freely available for download (further details can be accessed from the workshop material). One of the advantages of the tool mentioned is its flexibility for implementation in well-known programming languages such as R and Python.

My attendance at the workshop was an invaluable experience to gain further knowledge about the existing tools and procedures for optimising the manipulation of large datasets. In the Geography domain, these large datasets have started to be more common and accessible from multiple sources i.e. Earth Observation Data. Consequently, optimised and efficient data manipulation is key to identifying trends and patterns hidden in large datasets not possible by traditional data-driven analyses derived from small data. The find out more yourself, consider joining the free UK Data Service Open Data Dive, scheduled for 24 September 2016 at The Shed, Chester Street, Manchester

Alejandro Coca Castro

 

Bayesian Hierarchical Models – Mark

I recently attended an ESRC funded advanced training course on spatial and spatio-temporal data analysis using Bayesian hierarchical models at the Department of Geography, Cambridge (convened by Prof. Bob Haining and Dr Guangquan Li), to gain an overview of Bayesian statistics and its applications to geographical modelling problems.

Compared to ‘classical’ frequentist statistical techniques, modern modelling approaches relying upon Bayesian inference are relatively new, despite being based upon principles first proposed in the works of Thomas Bayes in 1763. Since the 1990’s Bayesian methods have begun to be widely applied within the scientific community, as a result of both an increasing acceptance of the underpinning philosophy and increased computational power.

Thomas_BayesThomas Bayes

In contrast to frequentist approaches, Bayesian methods represent processes using model parameters and their associated uncertainty in terms of probabilities. Using existing knowledge (e.g. from previous studies, expert knowledge, common sense etc) about a process or parameter of interest, a ‘prior distribution’ is established. This is then used in conjunction with a ‘likelihood’ (derived entirely from a dataset relating to the specific parameter) to produce a ‘posterior distribution’ – essentially an updated belief or opinion about the model parameter.

In some situations, Bayesian approaches can be more powerful than traditional methods because they:

  1. are highly adaptable to individual modelling problems;
  2. make efficient use of available evidence relating to a quantity of interest;
  3. can provide an easily interpreted quantitative output.

Historically however, Bayesian approaches have been considered somewhat controversial, as the results of any analysis are heavily dependent upon the choice of the prior distribution and identification of the ‘best’ prior is often subjective. Arguably though, the existence of multiple justifiable priors may actually highlight additional uncertainty about a process that would be entirely ignored in a frequentist approach! Moreover, in many studies, it is common for researchers to make use of a ‘flat prior’ in order to reduce some of the subjectivity associated with prior selection.

Li_etal_2014

Hotspots in Peterborough with a persistently high risk of burglary, 2005/8,
as identified with a Bayesian spatio-temporal modelling approach.
[Kindly reproduced from Li et al. (2014) with author’s permission.]

 As part of the course, we learned to use the winBUGs software with a Markov Chain Monte Carlo approach to explore a variety of spatial modelling problems, including: the identification of high intensity crime areas in UK cities, investigating the relationships between exposure to air pollution and stroke mortality, and examining the spatio-temporal variations in burglary rates. More information on the approaches taken in these studies can be found in Haining & Law (2007), Maheswaran et al, (2006), and Li et al. (2014).

Overall, the course provided a very engaging, hands-on overview of a very powerful analytical framework that has extensive applications in the field of quantitative geography. A comprehensive introduction to Bayesian analysis from a geographical perspective such as this is hard to find, and I would highly recommend that anyone with an interest in alternative approaches to spatio-temporal modelling should attend this course in future years!

Mark de Jong