I’m really excited to announce the latest addition to our growing stable of computational geography research: a fully-funded ESRC studentship involving the application of cutting-edge techniques (text-mining, topic modelling, graph analysis) to a large, rich data set of 450,000 PhD theses in order to understanding the evolving geography of academic knowledge production: how are groundbreaking ideas produced, circulated, and ultimately succeeded, and how do issues such as researcher mobility and institutional capacity shape this process?
We’re looking for a stellar candidate (either undergraduate or Masters-level) with a demonstrable interest in interdisciplinary research – you will be working at the intersection between disciplines and this will present unique challenges (and opportunities!) that call for resourcefulness, curiosity, and intellectual excellence.
The British Library manages EThOS, the national database of UK doctoral theses, which enables users to discover and access theses for use in their own research. But the almost complete aggregation of metadata about more than 450,000 dissertations also enables us to begin asking very interesting questions about the nature and production of knowledge in an institutional and geographic context across nearly the entire U.K., and this anchors the project in quintessentially social science questions about the impact of individuals, work, and mobility on organisations and cultures.
However, textual data of this scale is solely interpretable and navigable through ‘distant reading’ approaches; so although it remains rooted in the interests and episteme of the social sciences, the research involves genuinely interdisciplinary work at the interfaces with both the natural sciences and the (digital) humanities! At its heart, this project is therefore an exciting example of ‘computational social science’ (Lazer et al. 2009) in that it involves the application of cutting-edge computational techniques to large, rich data sets of human behaviour.
Ultimately, this project seeks to understand changes in the U.K. geography of academic knowledge production over time and across two or more disciplines. All applicants are therefore expected to demonstrate an interest in the underlying social science research questions and (at a minimum) basic competence in programming. Additionally, the successful applicant for the 1+3 route would be expected to successfully complete King’s MSc Data Science programme, while the successful +3 applicant would be expected to demonstrate a degree of existing facility with core analytical approaches.
For more information on the project, please see here.
1+3 (1 year Masters + 3 year PhD) or +3 (PhD only), subject to candidate’s existing academic/professional background. For applicants with a social science background we are suggesting King’s MSc Data Science programme. For applicants with a natural science background we will need to discuss how best to achieve a grounding in the social sciences.
31 January 2018
PhD student and King’s Geocomputation member Alejandro Coca-Castro attended Europe’s premier geosciences event, The European Geoscience Union (EGU) General Assembly, in Vienna, Austria (April 24th – 28th 2017). In addition to presenting his preliminary PhD results in the session “Monitoring the Sustainable Development Goals with the huge Remote Sensing archives”, Alejandro kindly dedicated part of his attendance at EGU to capture the emerging Geocomputation fields applied to Geosciences, and in particular for land and biosphere research. In this post Alejandro summarises the latest advances in Big Data technologies presented at EGU, which he sees as one of the two main emerging fields revolutionizing the data-driven analysis allows knowledge-production.
Well-known remote sensing data producers such as The European Space Agency (ESA) and NASA are developing a wide range of data products relevant to understand land surface processes and atmospheric phenomena as well as human-caused changes. However, although there is an unprecedented variety of long-term monitoring data, it remains challenging to understand exchanging processes between atmosphere and the terrestrial biosphere. To overcome this issue, ‘Big Data’ technologies are being proposed to tap the question of how to simultaneously explore multiple Earth Observations (EOs).
Fig 1. Emerging Big Data technologies make possible to co-explore multiple datasets with different characteristics and under different assumptions with an efficient and faster manner than traditional data management technologies. Source: M. Mahecha (2017) https://doi.org/10.6084/m9.figshare.4822930.v2
Amongst all collaborative initiatives presented at EGU, the Earth System Data Cube project led by the Max Planck Institute for Biogeochemistry and funded by ESA presented an emerging platform (E-Lab). The project aims to maximize the usage of ESA-EOs and other relevant data streams. The main concept behind the E-lab’s stream data maximization is the so-called ‘Data Cube’ concept. This ‘cube’ concept enables handling and extracting information for a given georeferenced dataset, optimising the management of its spatial and time dimensions. These dimensions are use to split data into smaller sub-cubes of of varying dimensions. In this way, dimension X and Y are the spatial dimensions (i.e., latitude and longitude). The third dimension corresponds to time; the fourth are the multiple variables or data streams themselves. All data uploaded into E-Lab are under the elegant and efficient ‘Data Cube’ umbrella and simultaneously exploration is mainly permitted by a set of predefined preprocessing rules applied during the data ingestion process.
Fig 2. Representation of the ‘Data cube’ concept and its related-structure applied to three different data sets (V1, V2, V3). Source: Earth System Data Cube (2017).
E-lab provides scientists a virtual online laboratory where the “Data Cube” can be explored, standard processing chains can be examined, and new work-flows can be tested. Jupyterhub is the underlying framework of the platform. This makes it simple for the users to work on the data cube using the popular Jupyter notebook, which supports high-level programming languages (mainly in Julia, Python and also a bit in R, although the latter is a bit underdeveloped at this stage).
The Earth System Data Cube initiative is a pioneering project offering an open and free-of-charge collaborative virtual platform with a solid background in the analysis of large data-sets and a sound understanding of the Earth System. However, a challenge remains in regards to the standards of data infrastructure, metadata and sharing protocols for existing and incoming, either private or public, projects supported by the ‘Data Cube’ concept. A first step towards tackling this concern is being led by The EarthCube and NextGEOSS initiatives which also were part of the transdisciplinary programme covered by the EGU of this year.
Interested in how Crowdsourcing is revolutionizing the way to collect/extract knowledge for data-driven analysis? If so, look out for a blog post on the topic right here this coming Friday.
The author is grateful to the Geography Department Small Grants and the P4GES: Can Paying for Global Ecosystem Services reduce poverty? project for providing funding for his successful attendance at the EGU General Assembly. Revision of English version by Sarah Jones and content by Miguel Mahecha.
For updates about Alejandro’s research follow @alejo_coca on twitter.
Introducing a new member of King’s Geocomputation – Dr Chen Zhong! Chen joined King’s College London in September 2016 and her work on urban mobility directly contributes to the Geocomputation Research Domain. Here she provides a brief intro to her work.
“Space shapes transport as much as transport shapes space, which is a salient example of the reciprocity of transport and its geography.”
Quite often, I use this quote to explain the behind story of my research. And I keep on correcting people that I am working on urban mobility, not transportation. The former, to me, has a much broader meaning and is about people and their interactions with the built environment.
Train in the sky, Singapore, 2013, source: Google
About Urban Mobility data
Most of my research explores the usage of automatically generated urban mobility data, such as smart-card data (my main source), mobile phone data and social media data. These types of data are generated by “citizens as sensors” as described by Goodchild (2007). People are carrying all kinds of sensors, such as mobile phones, smart wristbands and so on, all the time. The network formed by such sensors consists of the people themselves; therefore, it contains explicit spatial as well as implicit social information. These data sets offer us new potential to have a direct look into human behavior.
Compared to conventionally surveyed data, sensor data has a significant advantage in terms of granularity, coverage, efficiency and reliability. However, they are not perfect and often demographic information about the people carrying the sensors is absent. Nevertheless, these data sets still have a significant advantage for pattern detection and behavior analysis, thanks to its large sample size and less questionnaire bias. The challenge here is that data are collected with untapped purpose. We need be creative to make the best use of it and how. This challenge, as I see, is also the beauty of the “Big Data” concept.
I would like to show a few examples from my previous research – big data informed urban planning – which is one of many potential uses of mobility data. The first is about investigating functional urban changes in Singapore. There, we used a set of urban indicators to identify human activity centres and boundary of urban regions. Changing structure of traffic flows over years proved the successful implementation of decentralization in Singapore. Moreover, the significantly growing emerging sub-centre reveals how rapid the urban development of Singapore has been. This is definitely unique among all developed countries. When we mapped out the redrawn regional boundaries (see image at top), even a non-analytical government officer immediately interpreted the graphics and presented us the impact of new development on people’s locations choices.
One-north MRT station, Singapore, 2013
Note: Smart-card data is generated by automatic fare-collection systems. In London, it is Oyster card data. To find out more about smart-card data, and the above mentioned work, see my paper on detecting the dynamics of urban structure through spatial network analysis.
It is always interesting to compare things. We compared three world cities, namely London, Singapore and Beijing. You may expect Singapore to be the one with the most regular travel patterns using public transportation. Beijing, however, has the highest regularity with respect to “when to travel” and is second with respect to “where to go”. The most important reason is the regular passenger control measure which is applied to about 40 stations where passengers are held outside the stations before being allowed to enter at regular time intervals during the morning peak. Such queues can last for miles. Passengers can either wait there, search or use an alternative station or mode according to their situation. Moreover, this inconsistency of regularity can also subscribe to another unique phenomenon in Beijing which is due to Vehicles Plate Number Traffic Restriction Measures where many private car owners drive a car on most days but for one day use public transport system. Though people in China sometimes complain about the inconvenience of such policy, it indeed helps reducing Carbon emissions and relieving road congestion. Read more about this in my paper entitled Variability in Regularity: Mining Temporal Mobility Patterns in London, Singapore and Beijing Using Smart-Card Data.
Looking forward I have some ideas in mind and could easily list at least three directions:
- Comparative study of cities using urban mobility data (of course, not limited to smart-card data) is an option, and is already ongoing;
- Linking urban mobility patterns to urban health is another direction that could greatly widen the horizon of my research;
- Cross-checking detected patterns with multi-sources data could enhance and deepen previous findings.
Goodchild, M. F. (2007). “Citizens as sensors: the world of volunteered geography.” GeoJournal 69(4): 211-221.
Rodrigue, J.-P., et al. (2013). The geography of transport systems, Routledge.
In this blog post two PhD students associated with the Geocomputation Hub – Alejandro Coca Castro and Mark de Jong – report back on workshops they recently attended. Alejandro attended a UK Data Service workshop and Mark an ESRC-funded advanced training course on Bayesian Hierarchical Models.
Hive with UK Data Service – Alejandro
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. In practice, researchers can face a big data challenge when the dataset cannot be loaded into a conventional desktop package such as SPSS, Stata or R. Besides being the curator of the largest collection of digital data in the social sciences and humanities in UK, the UK Data Service initiative is also currently organising a series of workshops focused on big data management. These workshops aim to promote better and more efficient user-manipulation of their databases (and other sources).
Keen to attend to one of UK Data Service’s workshops, I visited the University of Manchester on 24 June 2016 to participate in “The Big Data Manipulation using Hive”. In a short, Hive™ is a tool that facilitates reading, writing and managing large datasets that reside in distributed storage using SQL (a special-purpose programming language). Although a variety of applications to access the Hive environment exist, the workshop trainers showed attendees a set of tools freely available for download (further details can be accessed from the workshop material). One of the advantages of the tool mentioned is its flexibility for implementation in well-known programming languages such as R and Python.
My attendance at the workshop was an invaluable experience to gain further knowledge about the existing tools and procedures for optimising the manipulation of large datasets. In the Geography domain, these large datasets have started to be more common and accessible from multiple sources i.e. Earth Observation Data. Consequently, optimised and efficient data manipulation is key to identifying trends and patterns hidden in large datasets not possible by traditional data-driven analyses derived from small data. The find out more yourself, consider joining the free UK Data Service Open Data Dive, scheduled for 24 September 2016 at The Shed, Chester Street, Manchester
Bayesian Hierarchical Models – Mark
I recently attended an ESRC funded advanced training course on spatial and spatio-temporal data analysis using Bayesian hierarchical models at the Department of Geography, Cambridge (convened by Prof. Bob Haining and Dr Guangquan Li), to gain an overview of Bayesian statistics and its applications to geographical modelling problems.
Compared to ‘classical’ frequentist statistical techniques, modern modelling approaches relying upon Bayesian inference are relatively new, despite being based upon principles first proposed in the works of Thomas Bayes in 1763. Since the 1990’s Bayesian methods have begun to be widely applied within the scientific community, as a result of both an increasing acceptance of the underpinning philosophy and increased computational power.
In contrast to frequentist approaches, Bayesian methods represent processes using model parameters and their associated uncertainty in terms of probabilities. Using existing knowledge (e.g. from previous studies, expert knowledge, common sense etc) about a process or parameter of interest, a ‘prior distribution’ is established. This is then used in conjunction with a ‘likelihood’ (derived entirely from a dataset relating to the specific parameter) to produce a ‘posterior distribution’ – essentially an updated belief or opinion about the model parameter.
In some situations, Bayesian approaches can be more powerful than traditional methods because they:
- are highly adaptable to individual modelling problems;
- make efficient use of available evidence relating to a quantity of interest;
- can provide an easily interpreted quantitative output.
Historically however, Bayesian approaches have been considered somewhat controversial, as the results of any analysis are heavily dependent upon the choice of the prior distribution and identification of the ‘best’ prior is often subjective. Arguably though, the existence of multiple justifiable priors may actually highlight additional uncertainty about a process that would be entirely ignored in a frequentist approach! Moreover, in many studies, it is common for researchers to make use of a ‘flat prior’ in order to reduce some of the subjectivity associated with prior selection.
Hotspots in Peterborough with a persistently high risk of burglary, 2005/8,
as identified with a Bayesian spatio-temporal modelling approach.
[Kindly reproduced from Li et al. (2014) with author’s permission.]
As part of the course, we learned to use the winBUGs software with a Markov Chain Monte Carlo approach to explore a variety of spatial modelling problems, including: the identification of high intensity crime areas in UK cities, investigating the relationships between exposure to air pollution and stroke mortality, and examining the spatio-temporal variations in burglary rates. More information on the approaches taken in these studies can be found in Haining & Law (2007), Maheswaran et al, (2006), and Li et al. (2014).
Overall, the course provided a very engaging, hands-on overview of a very powerful analytical framework that has extensive applications in the field of quantitative geography. A comprehensive introduction to Bayesian analysis from a geographical perspective such as this is hard to find, and I would highly recommend that anyone with an interest in alternative approaches to spatio-temporal modelling should attend this course in future years!