Big Data and Bayesian Modelling Workshops

In this blog post two PhD students associated with the Geocomputation Hub – Alejandro Coca Castro and Mark de Jong – report back on workshops they recently attended. Alejandro attended a UK Data Service workshop and Mark an ESRC-funded advanced training course on Bayesian Hierarchical Models.

Hive with UK Data Service – Alejandro

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. In practice, researchers can face a big data challenge when the dataset cannot be loaded into a conventional desktop package such as SPSS, Stata or R. Besides being the curator of the largest collection of digital data in the social sciences and humanities in UK, the UK Data Service initiative is also currently organising a series of workshops focused on big data management. These workshops aim to promote better and more efficient user-manipulation of their databases (and other sources).

UKdataservice

Keen to attend to one of UK Data Service’s workshops, I visited the University of Manchester on 24 June 2016 to participate in “The Big Data Manipulation using Hive”. In a short, Hive™ is a tool that facilitates reading, writing and managing large datasets that reside in distributed storage using SQL (a special-purpose programming language). Although a variety of applications to access the Hive environment exist, the workshop trainers showed attendees a set of tools freely available for download (further details can be accessed from the workshop material). One of the advantages of the tool mentioned is its flexibility for implementation in well-known programming languages such as R and Python.

My attendance at the workshop was an invaluable experience to gain further knowledge about the existing tools and procedures for optimising the manipulation of large datasets. In the Geography domain, these large datasets have started to be more common and accessible from multiple sources i.e. Earth Observation Data. Consequently, optimised and efficient data manipulation is key to identifying trends and patterns hidden in large datasets not possible by traditional data-driven analyses derived from small data. The find out more yourself, consider joining the free UK Data Service Open Data Dive, scheduled for 24 September 2016 at The Shed, Chester Street, Manchester

Alejandro Coca Castro

 

Bayesian Hierarchical Models – Mark

I recently attended an ESRC funded advanced training course on spatial and spatio-temporal data analysis using Bayesian hierarchical models at the Department of Geography, Cambridge (convened by Prof. Bob Haining and Dr Guangquan Li), to gain an overview of Bayesian statistics and its applications to geographical modelling problems.

Compared to ‘classical’ frequentist statistical techniques, modern modelling approaches relying upon Bayesian inference are relatively new, despite being based upon principles first proposed in the works of Thomas Bayes in 1763. Since the 1990’s Bayesian methods have begun to be widely applied within the scientific community, as a result of both an increasing acceptance of the underpinning philosophy and increased computational power.

Thomas_BayesThomas Bayes

In contrast to frequentist approaches, Bayesian methods represent processes using model parameters and their associated uncertainty in terms of probabilities. Using existing knowledge (e.g. from previous studies, expert knowledge, common sense etc) about a process or parameter of interest, a ‘prior distribution’ is established. This is then used in conjunction with a ‘likelihood’ (derived entirely from a dataset relating to the specific parameter) to produce a ‘posterior distribution’ – essentially an updated belief or opinion about the model parameter.

In some situations, Bayesian approaches can be more powerful than traditional methods because they:

  1. are highly adaptable to individual modelling problems;
  2. make efficient use of available evidence relating to a quantity of interest;
  3. can provide an easily interpreted quantitative output.

Historically however, Bayesian approaches have been considered somewhat controversial, as the results of any analysis are heavily dependent upon the choice of the prior distribution and identification of the ‘best’ prior is often subjective. Arguably though, the existence of multiple justifiable priors may actually highlight additional uncertainty about a process that would be entirely ignored in a frequentist approach! Moreover, in many studies, it is common for researchers to make use of a ‘flat prior’ in order to reduce some of the subjectivity associated with prior selection.

Li_etal_2014

Hotspots in Peterborough with a persistently high risk of burglary, 2005/8,
as identified with a Bayesian spatio-temporal modelling approach.
[Kindly reproduced from Li et al. (2014) with author’s permission.]

 As part of the course, we learned to use the winBUGs software with a Markov Chain Monte Carlo approach to explore a variety of spatial modelling problems, including: the identification of high intensity crime areas in UK cities, investigating the relationships between exposure to air pollution and stroke mortality, and examining the spatio-temporal variations in burglary rates. More information on the approaches taken in these studies can be found in Haining & Law (2007), Maheswaran et al, (2006), and Li et al. (2014).

Overall, the course provided a very engaging, hands-on overview of a very powerful analytical framework that has extensive applications in the field of quantitative geography. A comprehensive introduction to Bayesian analysis from a geographical perspective such as this is hard to find, and I would highly recommend that anyone with an interest in alternative approaches to spatio-temporal modelling should attend this course in future years!

Mark de Jong


Research Associate – ABM, Food and Land Use

This week we started advertising a post-doctoral Research Associate position to work with James on a project looking at the global food system, local land use change and how they’re connected. The successful candidate will drive the development and application of an integrated computer simulation model that represents land use decision-making agents and food commodity trade flows as part of the Belmont Forum (NERC) funded project, ‘Food Security and Land Use: The Telecoupling Challenge’.

Telecoupling is the conceptual framework of socioeconomic and environmental interactions between coupled human and natural systems (e.g., regions, nations) over distances and across scales. Telecouplings take place through socioeconomic and/or biophysical processes such as trade, species invasions, and migration. For example, while a number of countries such as China have experienced a shift from net forest loss to net forest recovery, this forest transition has been often at the cost of deforestation in other countries, such as Brazil where forested land is converted to meet global food demands for soybean and beef.

The goal of the project is to apply the telecoupling framework to understand the direct and collateral effects of feedbacks between food security and land use over long distances. To help achieve this the successful candidate will contribute to the development and application of an innovative computer simulation model that integrates data and analysis to represent coupled human and natural system components across scales, including local land use decision-making agents and global food commodity trade flows.

We’re looking for a quantitative scientist with a PhD (awarded or imminent) or equivalent in Geography, Computer Sciences, Earth Sciences or other related discipline. You should have experience in computer coding for simulation model development, preferably including agent-based modelling. Previous experience studying land use/cover change processes and dynamics or food production, trade and security is desirable.

This is a full-time position, with fixed term for up to 18 months. The deadline for applications is midnight on 19 April 2016. Interviews are scheduled to be held the week commencing 9 May 2016. For more details and how to apply see http://www.jobs.ac.uk/job/ANG825/research-associate/ and direct questions to James via email: james.millington at kcl.ac.uk

If this doesn’t sound quite like your thing, maybe you would be interested in one of the other positions we currently have open (with application deadline 30 March).

Image credit: Liu et al. (2015) Science doi: 10.1126/science.1258832