Geocomputation for Geoscience: The Earth System Data Cube

PhD student and King’s Geocomputation member Alejandro Coca-Castro attended Europe’s premier geosciences event, The European Geoscience Union (EGU) General Assembly, in Vienna, Austria (April 24th – 28th 2017). In addition to presenting his preliminary PhD results in the session “Monitoring the Sustainable Development Goals with the huge Remote Sensing archives”, Alejandro kindly dedicated part of his attendance at EGU to capture the emerging Geocomputation fields applied to Geosciences, and in particular for land and biosphere research. In this post Alejandro summarises the latest advances in Big Data technologies presented at EGU, which he sees as one of the two main emerging fields revolutionizing the data-driven analysis allows knowledge-production.

Well-known remote sensing data producers such as The European Space Agency (ESA) and NASA are developing a wide range of data products relevant to understand land surface processes and atmospheric phenomena as well as human-caused changes. However, although there is an unprecedented variety of long-term monitoring data, it remains challenging to understand exchanging processes between atmosphere and the terrestrial biosphere. To overcome this issue, ‘Big Data’ technologies are being proposed to tap the question of how to simultaneously explore multiple Earth Observations (EOs).

EarthDataCube3

Fig 1. Emerging Big Data technologies make possible to co-explore multiple datasets with different characteristics and under different assumptions with an efficient and faster manner than traditional data management technologies. Source: M. Mahecha (2017) https://doi.org/10.6084/m9.figshare.4822930.v2

Amongst all collaborative initiatives presented at EGU, the Earth System Data Cube project led by the Max Planck Institute for Biogeochemistry and funded by ESA presented an emerging platform (E-Lab). The project aims to maximize the usage of ESA-EOs and other relevant data streams. The main concept behind the E-lab’s stream data maximization is the so-called ‘Data Cube’ concept. This ‘cube’ concept enables handling and extracting information for a given georeferenced dataset, optimising the management of its spatial and time dimensions. These dimensions are use to split data into smaller sub-cubes of of varying dimensions. In this way, dimension X and Y are the spatial dimensions (i.e., latitude and longitude). The third dimension corresponds to time; the fourth are the multiple variables or data streams themselves. All data uploaded into E-Lab are under the elegant and efficient ‘Data Cube’ umbrella and simultaneously exploration is mainly permitted by a set of predefined preprocessing rules applied during the data ingestion process.

CABLAB_structure

Fig 2. Representation of the ‘Data cube’ concept and its related-structure applied to three different data sets (V1, V2, V3). Source: Earth System Data Cube (2017).

E-lab provides scientists a virtual online laboratory where the “Data Cube” can be explored, standard processing chains can be examined, and new work-flows can be tested. Jupyterhub is the underlying framework of the platform. This makes it simple for the users to work on the data cube using the popular Jupyter notebook, which supports high-level programming languages (mainly in Julia, Python and also a bit in R, although the latter is a bit underdeveloped at this stage).

The Earth System Data Cube initiative is a pioneering project offering an open and free-of-charge collaborative virtual platform with a solid background in the analysis of large data-sets and a sound understanding of the Earth System. However, a challenge remains in regards to the standards of data infrastructure, metadata and sharing protocols for existing and incoming, either private or public, projects supported by the ‘Data Cube’ concept. A first step towards tackling this concern is being led by The EarthCube and NextGEOSS initiatives which also were part of the transdisciplinary programme covered by the EGU of this year.

Interested in how Crowdsourcing is revolutionizing the way to collect/extract knowledge for data-driven analysis? If so, look out for a blog post on the topic right here this coming Friday.

The author is grateful to the Geography Department Small Grants and the P4GES: Can Paying for Global Ecosystem Services reduce poverty? project for providing funding for his successful attendance at the EGU General Assembly. Revision of English version by Sarah Jones and content by Miguel Mahecha.

For updates about Alejandro’s research follow @alejo_coca on twitter.