Understanding Gentrification through ML

Although it has taken rather a long time to see the light of day, our just-published paper is one of the reasons I love my job: drawing on a mix of data science and deep geographical knowledge, we look at the role that new Machine Learning (ML) techniques – normally seen as just a ‘black box’ for making predictions – can play in helping us to develop a deeper understanding of gentrification and neighbourhood change. For those of a ‘TL;DR’ nature (or without the privilege of an institutional subscription!), we wanted to share some of our key ideas in a more accessible format.

Background

We all know that London Fields has gentrified. And Brixton. And Peckham Rye. And… seemingly everywhere in London and quite a few places besides (e.g. Brighton, parts of Manchester, etc.). Often, these areas show up on our radar after having been featured in the news thanks to the careful efforts of activists and academics working through ‘case studies’ to theorise and engage:

But that approach comes with one important downside: we are at risk of focussing too much on a few ‘signifying’ locations. Or, as Neal et al. (2016) wittily put it: ‘You can’t move in Hackney without bumping into an anthropologist’. So qualitative strategies may overlook similar changes happening elsewhere but in less photogenic/well-connected locations (Barton 2016). What we wanted to do was to see if recent advances in Machine Learning could help us to see the ‘bigger picture’ at the London scale and come to a deeper understanding of how this process worked… and where it might spread to next!

About the Method

To employ Machine Learning (ML) algorithms we need a combination of training (i.e. learning) and testing (i.e. prediction) stages. We did this in two different ways:

  1. To train the algorithm in the first place we give it access to 80% of the data for the period 2001 to 2011, and then we checked how well it was doing by testing it against ‘predictions’ for the remaining 20% of the data it hadn’t yet seen. Doing this several times means that you avoid ‘overfitting’ in which your algorithm pays too much attention to outliers or other rare cases in the data at the expense of good overall accuracy.
  2. We can then use this trained algorithm to make predictions for the period between 2011 and 2021. There are some major caveats here (and we note them in the article) but this gives us a surprisingly robust way to peer into a murky future.

We used an approach called a Random Forest to make these predictions. To stretch an analogy to the breaking point: if 20 Questions is a good example of a single Decision Tree, then using a Random Forest is like having the chance to play the same game of 20 Questions a thousand times over in the hopes of finding the ‘right’ answer. It’s (a lot) more complicated than that, but it’s a good first approximation and we try to explain it in reader-friendly detail in the article.

Replication

One thing that was really important to us is that the data and the code be freely accessible to anyone who wants to replicate our results or experiment with adding or removing data/transformations/algorithms. So, as long as you know a bit about how to set up Python, then all of our code is available on GitHub: https://github.com/jreades/urb-studies-predicting-gentrification along with instructions on how to configure the Anaconda Python (3.6) environment.

Results

Without going into all the gory details (TL;DR) we find that the Random Forest algorithm outperforms multiple regression by about 10% with a R^2 of 0.699 vs 0.639 on predicting change between 2001 and 2011. So there’s still plenty of room for improvement (we note a few areas) but it’s already a meaningful difference. MSE and MAE are also quite a bit better, especially on the tuned Random Forest. As with regression, we also get some useful information about which variables are useful for making these predictions:

3_Predictions - Variable Importance.png

We could use these results to do things like reduce the number of variables used in a new (faster) model, but that’s not particularly relevant to our needs here. But being a geographer, of course, I ultimately want to see what this all looks like on a map! So below we see the difference in our ‘gentrification score’ over the two time periods: how did areas change between 2001 and 2011, and how will they change between 2011 and 2021? [Important caveats aside.]

5_Percentile-Change-2001-11.png

6_Percentile-Change-2011-21.png

Note that we’ve removed the +/-1 Standard Deviations since these are more readily attributable to random fluctuations (i.e. noise) than to meaningful change (i.e. signal).

Wrap-up

If you live in London (or have the benefit of hindsight) then some of these predictions might seem fairly obvious because they have already happened as of 2018, but it is worth recognising that the preconditions of these changes must have been in place by 2011 for these predictions to be made! In other words, had we had access to this data in 2011, then we might have been able to do something about it! If we could incorporate more ‘timely’ data – such as from Zoopla (a property price website) or Twitter (useful as a marker of cultural change) – then we could begin to develop the next stage of the real-time ‘early warning system’ anticipated by Chapple and Zuk (2016). And that is one reason that we feel really strongly that a stronger engagement between quantitative and qualitative methods in this area would be really helpful to both branches of research.

We are not claiming to have ‘solved’ neighbourhood change, nor are we suggesting that our approach supersedes the on-the-ground work undertaken by so many before, but we do hope that, in making these predictions about change in London, we are ultimately able to identify the ways that improvement or regeneration can occur without incurring displacement or disconcerting social change. We actually hope that our predictions will be wrong, but for all the right reasons…

P.S. Image credit for featured image: https://atlantablackstar.com/2015/02/20/10-us-cities-where-gentrification-is-happening-the-fastest/

Understanding Gentrification through ML

Although it has taken rather a long time to see the light of day, our just-published paper is one of the reasons I love my job: drawing on a mix of data science and deep geographical knowledge, we look at the role that new Machine Learning (ML) techniques – normally seen as just a ‘black box’ for making predictions – can play in helping us to develop a deeper understanding of gentrification and neighbourhood change. For those of a ‘TL;DR’ nature (or without the privilege of an institutional subscription!), we wanted to share some of our key ideas in a more accessible format.

Background

We all know that London Fields has gentrified. And Brixton. And Peckham Rye. And… seemingly everywhere in London and quite a few places besides (e.g. Brighton, parts of Manchester, etc.). Often, these areas show up on our radar after having been featured in the news thanks to the careful efforts of activists and academics working through ‘case studies’ to theorise and engage:

But that approach comes with one important downside: we are at risk of focussing too much on a few ‘signifying’ locations. Or, as Neal et al. (2016) wittily put it: ‘You can’t move in Hackney without bumping into an anthropologist’. So qualitative strategies may overlook similar changes happening elsewhere but in less photogenic/well-connected locations (Barton 2016). What we wanted to do was to see if recent advances in Machine Learning could help us to see the ‘bigger picture’ at the London scale and come to a deeper understanding of how this process worked… and where it might spread to next!

About the Method

To employ Machine Learning (ML) algorithms we need a combination of training (i.e. learning) and testing (i.e. prediction) stages. We did this in two different ways:

  1. To train the algorithm in the first place we give it access to 80% of the data for the period 2001 to 2011, and then we checked how well it was doing by testing it against ‘predictions’ for the remaining 20% of the data it hadn’t yet seen. Doing this several times means that you avoid ‘overfitting’ in which your algorithm pays too much attention to outliers or other rare cases in the data at the expense of good overall accuracy.
  2. We can then use this trained algorithm to make predictions for the period between 2011 and 2021. There are some major caveats here (and we note them in the article) but this gives us a surprisingly robust way to peer into a murky future.

We used an approach called a Random Forest to make these predictions. To stretch an analogy to the breaking point: if 20 Questions is a good example of a single Decision Tree, then using a Random Forest is like having the chance to play the same game of 20 Questions a thousand times over in the hopes of finding the ‘right’ answer. It’s (a lot) more complicated than that, but it’s a good first approximation and we try to explain it in reader-friendly detail in the article.

Replication

One thing that was really important to us is that the data and the code be freely accessible to anyone who wants to replicate our results or experiment with adding or removing data/transformations/algorithms. So, as long as you know a bit about how to set up Python, then all of our code is available on GitHub: https://github.com/jreades/urb-studies-predicting-gentrification along with instructions on how to configure the Anaconda Python (3.6) environment.

Results

Without going into all the gory details (TL;DR) we find that the Random Forest algorithm outperforms multiple regression by about 10% with a R^2 of 0.699 vs 0.639 on predicting change between 2001 and 2011. So there’s still plenty of room for improvement (we note a few areas) but it’s already a meaningful difference. MSE and MAE are also quite a bit better, especially on the tuned Random Forest. As with regression, we also get some useful information about which variables are useful for making these predictions:

3_Predictions - Variable Importance.png

We could use these results to do things like reduce the number of variables used in a new (faster) model, but that’s not particularly relevant to our needs here. But being a geographer, of course, I ultimately want to see what this all looks like on a map! So below we see the difference in our ‘gentrification score’ over the two time periods: how did areas change between 2001 and 2011, and how will they change between 2011 and 2021? [Important caveats aside.]

5_Percentile-Change-2001-11.png

6_Percentile-Change-2011-21.png

Note that we’ve removed the +/-1 Standard Deviations since these are more readily attributable to random fluctuations (i.e. noise) than to meaningful change (i.e. signal).

Wrap-up

If you live in London (or have the benefit of hindsight) then some of these predictions might seem fairly obvious because they have already happened as of 2018, but it is worth recognising that the preconditions of these changes must have been in place by 2011 for these predictions to be made! In other words, had we had access to this data in 2011, then we might have been able to do something about it! If we could incorporate more ‘timely’ data – such as from Zoopla (a property price website) or Twitter (useful as a marker of cultural change) – then we could begin to develop the next stage of the real-time ‘early warning system’ anticipated by Chapple and Zuk (2016). And that is one reason that we feel really strongly that a stronger engagement between quantitative and qualitative methods in this area would be really helpful to both branches of research.

We are not claiming to have ‘solved’ neighbourhood change, nor are we suggesting that our approach supersedes the on-the-ground work undertaken by so many before, but we do hope that, in making these predictions about change in London, we are ultimately able to identify the ways that improvement or regeneration can occur without incurring displacement or disconcerting social change. We actually hope that our predictions will be wrong, but for all the right reasons…

P.S. Image credit for featured image: https://atlantablackstar.com/2015/02/20/10-us-cities-where-gentrification-is-happening-the-fastest/

Geography & Computers: Past, present, and future

I’m really pleased to share a piece that Dani Arribas-Bel and I recently co-authored in Geography Compass on the sometimes fraught relationship between (human) geography and computers, and advocating for the creation of a Geographic Data Science. For those of a ‘TL; DR’ nature (or without the privilege of an institutional subscription!), we wanted to share some of our key ideas in a more accessible format. Continue reading

MoDS: Mapping Knowledge with Data Science (MSc + PhD Studentship)

Although we had some great responses to our initial call, we’re still looking for the ‘right’ candidate for this fully-funded studentship that is open to both undergraduate finalists as well as completing Masters students. The project involves the application of data science techniques (text-mining, topic modelling, graph analysis) to a large, rich data set of 450,000+ PhD theses in order to understand the evolving geography of academic knowledge production: how are groundbreaking ideas produced and circulated, and how does researcher mobility and institutional capacity shape this process? We’re looking for a great candidate (see ‘pathways’ below) with a demonstrable interest in interdisciplinary research – you will be working in collaboration with the British Library at the intersection between geography, computer science, and the humanities, and this will present unique challenges (and opportunities!) that call for resourcefulness, curiosity, and intellectual excellence. Continue reading

MoDS: Mapping Knowledge with Data Science

I’m really excited to announce the latest addition to our growing stable of computational geography research: a fully-funded ESRC studentship involving the application of cutting-edge techniques (text-mining, topic modelling, graph analysis) to a large, rich data set of 450,000 PhD theses in order to understanding the evolving geography of academic knowledge production: how are groundbreaking ideas produced, circulated, and ultimately succeeded, and how do issues such as researcher mobility and institutional capacity shape this process?

We’re looking for a stellar candidate (either undergraduate or Masters-level) with a demonstrable interest in interdisciplinary research – you will be working at the intersection between disciplines and this will present unique challenges (and opportunities!) that call for resourcefulness, curiosity, and intellectual excellence.

Project Overview

The British Library manages EThOS, the national database of UK doctoral theses, which enables users to discover and access theses for use in their own research. But the almost complete aggregation of metadata about more than 450,000 dissertations also enables us to begin asking very interesting questions about the nature and production of knowledge in an institutional and geographic context across nearly the entire U.K., and this anchors the project in quintessentially social science questions about the impact of individuals, work, and mobility on organisations and cultures.

However, textual data of this scale is solely interpretable and navigable through ‘distant reading’ approaches; so although it remains rooted in the interests and episteme of the social sciences, the research involves genuinely interdisciplinary work at the interfaces with both the natural sciences and the (digital) humanities! At its heart, this project is therefore an exciting example of ‘computational social science’ (Lazer et al. 2009) in that it involves the application of cutting-edge computational techniques to large, rich data sets of human behaviour.

Ultimately, this project seeks to understand changes in the U.K. geography of academic knowledge production over time and across two or more disciplines. All applicants are therefore expected to demonstrate an interest in the underlying social science research questions and (at a minimum) basic competence in programming. Additionally, the successful applicant for the 1+3 route would be expected to successfully complete King’s MSc Data Science programme, while the successful +3 applicant would be expected to demonstrate a degree of existing facility with core analytical approaches.

For more information on the project, please see here.

Studentship type

1+3 (1 year Masters + 3 year PhD) or +3 (PhD only), subject to candidate’s existing academic/professional background. For applicants with a social science background we are suggesting King’s MSc Data Science programme. For applicants with a natural science background we will need to discuss how best to achieve a grounding in the social sciences.

Application deadline

31 January 2018