Geocomputation for Geoscience: The Earth System Data Cube

PhD student and King’s Geocomputation member Alejandro Coca-Castro attended Europe’s premier geosciences event, The European Geoscience Union (EGU) General Assembly, in Vienna, Austria (April 24th – 28th 2017). In addition to presenting his preliminary PhD results in the session “Monitoring the Sustainable Development Goals with the huge Remote Sensing archives”, Alejandro kindly dedicated part of his attendance at EGU to capture the emerging Geocomputation fields applied to Geosciences, and in particular for land and biosphere research. In this post Alejandro summarises the latest advances in Big Data technologies presented at EGU, which he sees as one of the two main emerging fields revolutionizing the data-driven analysis allows knowledge-production.


 

Well-known remote sensing data producers such as The European Space Agency (ESA) and NASA are developing a wide range of data products relevant to understand land surface processes and atmospheric phenomena as well as human-caused changes. However, although there is an unprecedented variety of long-term monitoring data, it remains challenging to understand exchanging processes between atmosphere and the terrestrial biosphere. To overcome this issue, ‘Big Data’ technologies are being proposed to tap the question of how to simultaneously explore multiple Earth Observations (EOs).

EarthDataCube3

Fig 1. Emerging Big Data technologies make possible to co-explore multiple datasets with different characteristics and under different assumptions with an efficient and faster manner than traditional data management technologies.  Source: M. Mahecha (2017) https://doi.org/10.6084/m9.figshare.4822930.v2

Amongst all collaborative initiatives presented at EGU, the Earth System Data Cube project led by the Max Planck Institute for Biogeochemistry and funded by ESA presented an emerging platform (E-Lab). The project aims to maximize the usage of ESA-EOs and other relevant data streams. The main concept behind the E-lab’s stream data maximization is the so-called ‘Data Cube’ concept. This ‘cube’ concept enables handling and extracting information for a given georeferenced dataset, optimising the management of its spatial and time dimensions. These dimensions are use to split data into smaller sub-cubes of of varying dimensions. In this way, dimension X and Y are the spatial dimensions (i.e., latitude and longitude). The third dimension corresponds to time; the fourth are the multiple variables or data streams themselves. All data uploaded into E-Lab are under the elegant and efficient ‘Data Cube’ umbrella and simultaneously exploration is mainly permitted by a set of predefined preprocessing rules applied during the data ingestion process.

CABLAB_structure

Fig 2. Representation of the ‘Data cube’ concept and its related-structure applied to three different data sets (V1, V2, V3). Source: Earth System Data Cube (2017).

E-lab provides scientists a virtual online laboratory where the “Data Cube” can be explored, standard processing chains can be examined, and new work-flows can be tested. Jupyterhub is the underlying framework of the platform. This makes it simple for the users to work on the data cube using the popular Jupyter notebook, which supports high-level programming languages (mainly in Julia, Python and also a bit in R, although the latter is a bit underdeveloped at this stage).

The Earth System Data Cube initiative is a pioneering project offering an open and free-of-charge collaborative virtual platform with a solid background in the analysis of large data-sets and a sound understanding of the Earth System. However, a challenge remains in regards to the standards of data infrastructure, metadata and sharing protocols for existing and incoming, either private or public, projects supported by the ‘Data Cube’ concept. A first step towards tackling this concern is being led by The EarthCube and NextGEOSS initiatives which also were part of the transdisciplinary programme covered by the EGU of this year.


 

Interested in how Crowdsourcing is revolutionizing the way to collect/extract knowledge for data-driven analysis? If so, look out for a blog post on the topic right here this coming Friday.

The author is grateful to the Geography Department Small Grants and the P4GES: Can Paying for Global Ecosystem Services reduce poverty? project for providing funding for his successful attendance at the EGU General Assembly. Revision of English version by Sarah Jones and content by Miguel Mahecha.

For updates about Alejandro’s research follow @alejo_coca on twitter.


Urban mobility data analysis

Introducing a new member of King’s Geocomputation – Dr Chen Zhong! Chen joined King’s College London in September 2016 and her work on urban mobility directly contributes to the Geocomputation Research Domain. Here she provides a brief intro to her work.

“Space shapes transport as much as transport shapes space, which is a salient example of the reciprocity of transport and its geography.”

Rodrigue, Comtois et al. 2013

Quite often, I use this quote to explain the behind story of my research. And I keep on correcting people that I am working on urban mobility, not transportation. The former, to me, has a much broader meaning and is about people and their interactions with the built environment.

skytrain

Train in the sky, Singapore, 2013, source: Google

About Urban Mobility data

Most of my research explores the usage of automatically generated urban mobility data, such as smart-card data (my main source), mobile phone data and social media data. These types of data are generated by “citizens as sensors” as described by Goodchild (2007). People are carrying all kinds of sensors, such as mobile phones, smart wristbands and so on, all the time. The network formed by such sensors consists of the people themselves; therefore, it contains explicit spatial as well as implicit social information. These data sets offer us new potential to have a direct look into human behavior.

Compared to conventionally surveyed data, sensor data has a significant advantage in terms of granularity, coverage, efficiency and reliability. However, they are not perfect and often demographic information about the people carrying the sensors is absent. Nevertheless, these data sets still have a significant advantage for pattern detection and behavior analysis, thanks to its large sample size and less questionnaire bias. The challenge here is that data are collected with untapped purpose. We need be creative to make the best use of it and how. This challenge, as I see, is also the beauty of the “Big Data” concept.

Smart–card data

I would like to show a few examples from my previous research – big data informed urban planning – which is one of many potential uses of mobility data. The first is about investigating functional urban changes in Singapore. There, we used a set of urban indicators to identify human activity centres and boundary of urban regions. Changing structure of traffic flows over years proved the successful implementation of decentralization in Singapore. Moreover, the significantly growing emerging sub-centre reveals how rapid the urban development of Singapore has been. This is definitely unique among all developed countries. When we mapped out the redrawn regional boundaries (see image at top), even a non-analytical government officer immediately interpreted the graphics and presented us the impact of new development on people’s locations choices.

smart_singapore2013

One-north MRT station, Singapore, 2013

Note: Smart-card data is generated by automatic fare-collection systems. In London, it is Oyster card data. To find out more about smart-card data, and the above mentioned work, see my paper  on detecting the dynamics of urban structure through spatial network analysis.

Comparing Cities

It is always interesting to compare things. We compared three world cities, namely London, Singapore and Beijing. You may expect Singapore to be the one with the most regular travel patterns using public transportation. Beijing, however, has the highest regularity with respect to “when to travel” and is second with respect to “where to go”. The most important reason is the regular passenger control measure which is applied to about 40 stations where passengers are held outside the stations before being allowed to enter at regular time intervals during the morning peak. Such queues can last for miles. Passengers can either wait there, search or use an alternative station or mode according to their situation. Moreover, this inconsistency of regularity can also subscribe to another unique phenomenon in Beijing which is due to Vehicles Plate Number Traffic Restriction Measures where many private car owners drive a car on most days but for one day use public transport system. Though people in China sometimes complain about the inconvenience of such policy, it indeed helps reducing Carbon emissions and relieving road congestion. Read more about this in my paper entitled Variability in Regularity: Mining Temporal Mobility Patterns in London, Singapore and Beijing Using Smart-Card Data.

Outlook

Looking forward I have some ideas in mind and could easily list at least three directions:

  1. Comparative study of cities using urban mobility data (of course, not limited to smart-card data) is an option, and is already ongoing;
  2. Linking urban mobility patterns to urban health is another direction that could greatly widen the horizon of my research;
  3. Cross-checking detected patterns with multi-sources data could enhance and deepen previous findings.

If you are interested in any of these ideas or you want to chat about your great idea, please do not hesitate to contact me.  To read more, click here.

References

Goodchild, M. F. (2007). “Citizens as sensors: the world of volunteered geography.” GeoJournal 69(4): 211-221.

Rodrigue, J.-P., et al. (2013). The geography of transport systems, Routledge.

 


The Full Stack: Tools & Processes for Urban Data Scientists

Recently, I was asked to give talks at both UCL’s CASA and the ETH Future Cities Lab in Singapore for students and staff new to ‘urban data science’ and the sorts of workflows involved in collecting, processing, analysing, and reporting on … Continue reading 

Big Data and Bayesian Modelling Workshops

In this blog post two PhD students associated with the Geocomputation Hub – Alejandro Coca Castro and Mark de Jong – report back on workshops they recently attended. Alejandro attended a UK Data Service workshop and Mark an ESRC-funded advanced training course on Bayesian Hierarchical Models.

Hive with UK Data Service – Alejandro

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. In practice, researchers can face a big data challenge when the dataset cannot be loaded into a conventional desktop package such as SPSS, Stata or R. Besides being the curator of the largest collection of digital data in the social sciences and humanities in UK, the UK Data Service initiative is also currently organising a series of workshops focused on big data management. These workshops aim to promote better and more efficient user-manipulation of their databases (and other sources).

UKdataservice

Keen to attend to one of UK Data Service’s workshops, I visited the University of Manchester on 24 June 2016 to participate in “The Big Data Manipulation using Hive”. In a short, Hive™ is a tool that facilitates reading, writing and managing large datasets that reside in distributed storage using SQL (a special-purpose programming language). Although a variety of applications to access the Hive environment exist, the workshop trainers showed attendees a set of tools freely available for download (further details can be accessed from the workshop material). One of the advantages of the tool mentioned is its flexibility for implementation in well-known programming languages such as R and Python.

My attendance at the workshop was an invaluable experience to gain further knowledge about the existing tools and procedures for optimising the manipulation of large datasets. In the Geography domain, these large datasets have started to be more common and accessible from multiple sources i.e. Earth Observation Data. Consequently, optimised and efficient data manipulation is key to identifying trends and patterns hidden in large datasets not possible by traditional data-driven analyses derived from small data. The find out more yourself, consider joining the free UK Data Service Open Data Dive, scheduled for 24 September 2016 at The Shed, Chester Street, Manchester

Alejandro Coca Castro

 

Bayesian Hierarchical Models – Mark

I recently attended an ESRC funded advanced training course on spatial and spatio-temporal data analysis using Bayesian hierarchical models at the Department of Geography, Cambridge (convened by Prof. Bob Haining and Dr Guangquan Li), to gain an overview of Bayesian statistics and its applications to geographical modelling problems.

Compared to ‘classical’ frequentist statistical techniques, modern modelling approaches relying upon Bayesian inference are relatively new, despite being based upon principles first proposed in the works of Thomas Bayes in 1763. Since the 1990’s Bayesian methods have begun to be widely applied within the scientific community, as a result of both an increasing acceptance of the underpinning philosophy and increased computational power.

Thomas_BayesThomas Bayes

In contrast to frequentist approaches, Bayesian methods represent processes using model parameters and their associated uncertainty in terms of probabilities. Using existing knowledge (e.g. from previous studies, expert knowledge, common sense etc) about a process or parameter of interest, a ‘prior distribution’ is established. This is then used in conjunction with a ‘likelihood’ (derived entirely from a dataset relating to the specific parameter) to produce a ‘posterior distribution’ – essentially an updated belief or opinion about the model parameter.

In some situations, Bayesian approaches can be more powerful than traditional methods because they:

  1. are highly adaptable to individual modelling problems;
  2. make efficient use of available evidence relating to a quantity of interest;
  3. can provide an easily interpreted quantitative output.

Historically however, Bayesian approaches have been considered somewhat controversial, as the results of any analysis are heavily dependent upon the choice of the prior distribution and identification of the ‘best’ prior is often subjective. Arguably though, the existence of multiple justifiable priors may actually highlight additional uncertainty about a process that would be entirely ignored in a frequentist approach! Moreover, in many studies, it is common for researchers to make use of a ‘flat prior’ in order to reduce some of the subjectivity associated with prior selection.

Li_etal_2014

Hotspots in Peterborough with a persistently high risk of burglary, 2005/8,
as identified with a Bayesian spatio-temporal modelling approach.
[Kindly reproduced from Li et al. (2014) with author’s permission.]

 As part of the course, we learned to use the winBUGs software with a Markov Chain Monte Carlo approach to explore a variety of spatial modelling problems, including: the identification of high intensity crime areas in UK cities, investigating the relationships between exposure to air pollution and stroke mortality, and examining the spatio-temporal variations in burglary rates. More information on the approaches taken in these studies can be found in Haining & Law (2007), Maheswaran et al, (2006), and Li et al. (2014).

Overall, the course provided a very engaging, hands-on overview of a very powerful analytical framework that has extensive applications in the field of quantitative geography. A comprehensive introduction to Bayesian analysis from a geographical perspective such as this is hard to find, and I would highly recommend that anyone with an interest in alternative approaches to spatio-temporal modelling should attend this course in future years!

Mark de Jong


‘Mapping the Space of Flows’: the geography of the London Mega-City Region

I’m pleased to be able to post here the penultimate version of an article that Duncan Smith and I recently had accepted to Regional Studies. In this article we look at ways of combining ‘big data’ from a telecoms network … Continue reading