CUSP London Seminar: Dani Arribas-Bel

This past Thursday we were really lucky to catch Dani Arribas-Bel, Senior Lecturer in Geographic Data Science at the University of Liverpool and major contributor to PySAL, on his way back home following two weeks’ teaching in the Caribbean. Dani kindly agreed to give a talk in two parts on “Infusing Urban and Regional analysis with Geographic Data Science” (‘GDS’) which we will summarise below… As one of the first CUSP London-branded seminars, it was great to see so many Urban Informatics staff and students there (and even a few from UCL’s CASA!)

Sexiest job of the 21st century…

Geography & Computers

The first half of Dani’s talk covered highlights from a recently-published paper in Geography Compass titled “Geography & Computers: Past, Present, and Future” (an author pre-print is available via KCL’s Institutional Repository); in it, Dani and KCL’s Jon Reades link shifts in computing power and access to shifts in the ways in which geographers use computers to ‘do’ geography.

The basic contention is that there have been three waves of change that they (we) summarise as: 1) a computer in every institution (50s–70s); 2) a computer in every office (80s–00s); 3) a computer in every thing (10s–). We don’t need to revisit the article in full here since highlights are available in a previous blog post, but Dani’s focus was on the links to ‘data science’ the ‘sexiest job of the 21st century‘.

This led through to a discussion of ‘data-driven methods’ which, to a geographer, can sound like putting the cart before the horse. However, it’s important to keep in mind that we, as researchers, have little to no control over how the kinds of data underpinning a (geographic) data science are created and therefore need to adapt our approach to the data, and not the other way around.

I particularly appreciated Dani’s observation on the importance of data processing/handling as part of this shift: sometimes dismissed as ‘mere cleaning’, this stage is critical to ensuring that the data is both well-understood (shows what we think it shows) and fit-for-purpose (does what we want it to do).

I’ve seen the term ‘feature engineering‘ pop up in my own news feeds with increasing regularity and that has a nice ring to it (it’s engineering, not cleaning!) but it doesn’t quite capture the full scope of what good data science really entails. And it also doesn’t take into account the ‘baking’ of geo-data that is really required to ensure methods and models are appropriate.

Dani wrapped up this section with a discussion of how GDS can serve as the interface between geographers and data scientists, supporting the co-production of systems (a.k.a. tools), methods (spatially aware ML), and epistemologies (ways of knowing that are appropriate to these types of data).

Applications of Geographic Data Science

The second half of Dani’s talk covered a work-in-progress using a large building data set from Spain to delineate urban and employment boundaries. This nicely illustrated one of the key concepts elaborated in the first half of the talk: the importance of data-driven methods in geographical data science.

The question Dani and his co-authors are exploring is how one can meaningfully delimit the spatial extent of urban areas and economic activity with the minimum number of prior assumptions about spatial configuration or ‘auxiliary geographies’; by this we mean using other steps or data, such as rasterisation or regional boundaries, to constrain the process to our preconceived notions of what the answer ‘should be’.

The issues with rasterisation and the MAUP are well-known, but what do you do when you have 15 million data points to cluster and can no longer load the data set into memory? This is what we mean by data-driven methods: Dani’s exiting addition (which prompted a good deal of questioning from the audience) is a way to make an existing algorithm work not only in a large data context but which also does so in a way that works around what I feel is an important conceptual flaw in the existing algorithm to give you insights into the robustness of your results!

Such a method is not without theory, nor without empirical input: Dani and his colleagues use research findings on commuting distances and employment to provide essential parameters. I’m not able to share additional details at this stage, but I’m really looking forward to seeing this algorithm ‘in the wild’ since it addresses a number of issues that I have with some work that I’m (slowly) undertaking…

Big Data and Bayesian Modelling Workshops

In this blog post two PhD students associated with the Geocomputation Hub – Alejandro Coca Castro and Mark de Jong – report back on workshops they recently attended. Alejandro attended a UK Data Service workshop and Mark an ESRC-funded advanced training course on Bayesian Hierarchical Models.

Hive with UK Data Service – Alejandro

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. In practice, researchers can face a big data challenge when the dataset cannot be loaded into a conventional desktop package such as SPSS, Stata or R. Besides being the curator of the largest collection of digital data in the social sciences and humanities in UK, the UK Data Service initiative is also currently organising a series of workshops focused on big data management. These workshops aim to promote better and more efficient user-manipulation of their databases (and other sources).

UKdataservice

Keen to attend to one of UK Data Service’s workshops, I visited the University of Manchester on 24 June 2016 to participate in “The Big Data Manipulation using Hive”. In a short, Hive™ is a tool that facilitates reading, writing and managing large datasets that reside in distributed storage using SQL (a special-purpose programming language). Although a variety of applications to access the Hive environment exist, the workshop trainers showed attendees a set of tools freely available for download (further details can be accessed from the workshop material). One of the advantages of the tool mentioned is its flexibility for implementation in well-known programming languages such as R and Python.

My attendance at the workshop was an invaluable experience to gain further knowledge about the existing tools and procedures for optimising the manipulation of large datasets. In the Geography domain, these large datasets have started to be more common and accessible from multiple sources i.e. Earth Observation Data. Consequently, optimised and efficient data manipulation is key to identifying trends and patterns hidden in large datasets not possible by traditional data-driven analyses derived from small data. The find out more yourself, consider joining the free UK Data Service Open Data Dive, scheduled for 24 September 2016 at The Shed, Chester Street, Manchester

Alejandro Coca Castro

 

Bayesian Hierarchical Models – Mark

I recently attended an ESRC funded advanced training course on spatial and spatio-temporal data analysis using Bayesian hierarchical models at the Department of Geography, Cambridge (convened by Prof. Bob Haining and Dr Guangquan Li), to gain an overview of Bayesian statistics and its applications to geographical modelling problems.

Compared to ‘classical’ frequentist statistical techniques, modern modelling approaches relying upon Bayesian inference are relatively new, despite being based upon principles first proposed in the works of Thomas Bayes in 1763. Since the 1990’s Bayesian methods have begun to be widely applied within the scientific community, as a result of both an increasing acceptance of the underpinning philosophy and increased computational power.

Thomas_BayesThomas Bayes

In contrast to frequentist approaches, Bayesian methods represent processes using model parameters and their associated uncertainty in terms of probabilities. Using existing knowledge (e.g. from previous studies, expert knowledge, common sense etc) about a process or parameter of interest, a ‘prior distribution’ is established. This is then used in conjunction with a ‘likelihood’ (derived entirely from a dataset relating to the specific parameter) to produce a ‘posterior distribution’ – essentially an updated belief or opinion about the model parameter.

In some situations, Bayesian approaches can be more powerful than traditional methods because they:

  1. are highly adaptable to individual modelling problems;
  2. make efficient use of available evidence relating to a quantity of interest;
  3. can provide an easily interpreted quantitative output.

Historically however, Bayesian approaches have been considered somewhat controversial, as the results of any analysis are heavily dependent upon the choice of the prior distribution and identification of the ‘best’ prior is often subjective. Arguably though, the existence of multiple justifiable priors may actually highlight additional uncertainty about a process that would be entirely ignored in a frequentist approach! Moreover, in many studies, it is common for researchers to make use of a ‘flat prior’ in order to reduce some of the subjectivity associated with prior selection.

Li_etal_2014

Hotspots in Peterborough with a persistently high risk of burglary, 2005/8,
as identified with a Bayesian spatio-temporal modelling approach.
[Kindly reproduced from Li et al. (2014) with author’s permission.]

 As part of the course, we learned to use the winBUGs software with a Markov Chain Monte Carlo approach to explore a variety of spatial modelling problems, including: the identification of high intensity crime areas in UK cities, investigating the relationships between exposure to air pollution and stroke mortality, and examining the spatio-temporal variations in burglary rates. More information on the approaches taken in these studies can be found in Haining & Law (2007), Maheswaran et al, (2006), and Li et al. (2014).

Overall, the course provided a very engaging, hands-on overview of a very powerful analytical framework that has extensive applications in the field of quantitative geography. A comprehensive introduction to Bayesian analysis from a geographical perspective such as this is hard to find, and I would highly recommend that anyone with an interest in alternative approaches to spatio-temporal modelling should attend this course in future years!

Mark de Jong

Research Associate – ABM, Food and Land Use

This week we started advertising a post-doctoral Research Associate position to work with James on a project looking at the global food system, local land use change and how they’re connected. The successful candidate will drive the development and application of an integrated computer simulation model that represents land use decision-making agents and food commodity trade flows as part of the Belmont Forum (NERC) funded project, ‘Food Security and Land Use: The Telecoupling Challenge’.

Telecoupling is the conceptual framework of socioeconomic and environmental interactions between coupled human and natural systems (e.g., regions, nations) over distances and across scales. Telecouplings take place through socioeconomic and/or biophysical processes such as trade, species invasions, and migration. For example, while a number of countries such as China have experienced a shift from net forest loss to net forest recovery, this forest transition has been often at the cost of deforestation in other countries, such as Brazil where forested land is converted to meet global food demands for soybean and beef.

The goal of the project is to apply the telecoupling framework to understand the direct and collateral effects of feedbacks between food security and land use over long distances. To help achieve this the successful candidate will contribute to the development and application of an innovative computer simulation model that integrates data and analysis to represent coupled human and natural system components across scales, including local land use decision-making agents and global food commodity trade flows.

We’re looking for a quantitative scientist with a PhD (awarded or imminent) or equivalent in Geography, Computer Sciences, Earth Sciences or other related discipline. You should have experience in computer coding for simulation model development, preferably including agent-based modelling. Previous experience studying land use/cover change processes and dynamics or food production, trade and security is desirable.

This is a full-time position, with fixed term for up to 18 months. The deadline for applications is midnight on 19 April 2016. Interviews are scheduled to be held the week commencing 9 May 2016. For more details and how to apply see http://www.jobs.ac.uk/job/ANG825/research-associate/ and direct questions to James via email: james.millington at kcl.ac.uk

If this doesn’t sound quite like your thing, maybe you would be interested in one of the other positions we currently have open (with application deadline 30 March).

Image credit: Liu et al. (2015) Science doi: 10.1126/science.1258832