Data-driven revolution in Earth Observation (EO) research

PhD student and King’s EOES member Alejandro Coca-Castro attended Europe’s premier geosciences event, The European Geoscience Union (EGU) General Assembly, which was held in Vienna, Austria (April 8th – 13th, 2018). Alejandro contributed to the EGU’s session: “Information extraction from satellite Earth observations using data-driven methods” with a 12-min presentation of the main progress of the 2nd year of his PhD research at King’s. This blog post summarises the key context of the session as well as describing the main facts of Alejandro’s work related to the application of a novel data-driven algorithm based on deep learning principles for land cover classification.

Fig 1. Alejandro introducing of the underlying concepts behind the applied data-driven method for large-scale land cover classification. Photo by Alöis Tilloy.

Thanks to the unprecedented volumes of EO archives, emerging data-driven methods provide more possibilities and perspectives to investigate and understand the functioning  and complexity of the hydrosphere, biosphere and atmosphere.

The session “Information extraction from satellite Earth observations using data-driven methods” contributed to the EGU assembly with six exciting oral presentations and thirteen posters; these include a broad and diverse types of algorithms and proposed methods for filling the gaps in oceanic, hydrology, vegetation, urban, and atmospheric research.

The main aspects, benefits and challenges of a deep-learning based method in performing large-scale land cover classification were presented in this session. This particular deep learning method is inspired by an “sequence-to-sequence” (seq2seq) model which has been successfully applied in the commercial and research sectors for machine translation, speech recognition, image captioning recognition, among others. seq2seq models map arbitrary-length sequences to other arbitrary-length sequences using fixed-size architectures.

Unlike traditional data-driven methods (random forest, support vector machine, decision trees), a seq2seq model provides an elegant way to compress the available temporal dimension from multitemporal EO data (so-called time series) without further pre/post-processing. Figure 2 illustrates an unconventional way to introduce to seq2seq models by Nag (2016). In this analogy, the clowns are the EO observations as the “clown car” is a given land cover class. The information of each clown (hat shape, clothes color, gender) might be understood as the spectral data and contextual information (fire, elevation, climate) provided by EO data. To make a classification decision, the spectral and contextual information of every EO observation (or clown) is efficiently analysed by a type of algorithm (or a clown boss) able to indicate or give directions about which information should be transferred within the sequence of EO observations.

Figure 2. Where were all those clowns hiding in that tiny car? Source: Nag (2016).

For sequence (or time series) analysis, the seq2seq scheme uses Recurrent Neural Networks (RNNs) (Figure 3) as specialized algorithms in processing sequential data. RNNs are a type of artificial neural network designed in the 1980’s which have recently gained huge popularity in particular  adapted versions of RNNs such as Long-short term memory (LSTM) and General Recurrent Units (GRU). These adapted versions are basically memory cells with a set of gating units which regulate the information flow into and out of the cell, allowing memorisation of both long and short term information. Where spatio-temporal correlations exist, as commonly occurs in remote sensing classification, more sophisticated RNNs called Convolutional RNNs (denoted as ConvRNN) are already available with plausible advances in precipitation forecasting, see Shi et al. (2015). ConvRNNs add convolutional properties to RNNs cells allowing to capturing efficiently contextual information carried by the neighbour pixels beside temporal attributes taken inherently by the RNN structure.

Figure 3. Alejandro’s slide illustrating an overview of Artificial Neural Networks and types according to their connection structure.

Based on the principles of seq2seq and adapted ConvRNNs discussed above, Alejandro and collaborators are applying and evaluating those concepts to produce consistent, and where possible operational, large-scale land cover classification using MODIS archives of surface reflectance along with existing multitemporal land cover maps (MCD12Q1 and ESA-CCI). A set of experiments across South America have allowed to Alejandro to highlight the following key messages amongst the audience attending to the EGU’s data-driven session:

  • The capability and consistency of multitemporal data-driven approaches such as seq2seq for large-scale LC classification should be explored further by using long-term high temporal EO archives such as MODIS;
  • Experimental sets using two MODIS surface reflectance products and two global reference datasets confirm both information extraction and discriminative properties through time (including cloud filtering) of seq2seq architecture applied, in this case using ConvLSTM cells (an adapted version of RNNs) for land cover classification.
  • Similar (and faster) results are expected if ConvGRU are trained according to Rußwurm and Körner (2018)’s findings.
  • Models trained with preprocessed reference datasets had considerably better models’ accuracies than non-preprocessed version, particularly for MCD12Q1 than ESA-CCI. The latter has a ruled-based scheme to reduce illogical LC transitions that might affect the preprocessing approach.
  • Ancillary data such as elevation and climate variables might help to increase the accuracy for large scale classification (further evidences are needed).

Interested to explore further the seq2seq model originally applied to crop classification using Sentinel data? Here’s the source code in Github:

The author is grateful to the Geography Department Small Grants for providing funding for Alejandro’s successful attendance to the EGU General Assembly as well to the conveners organizing the EGU session where Alejandro contributed. Additionally, the author acknowledges the support of the partner institutions and collaborators of his research, in particular to Marc Rußwurm (The Technical University of Munich) for assisting in the adaptation and application of the data-driven method used and The International Center of Tropical Agriculture (CIAT) for providing all resources to process and run the experiments presented at EGU. Revision by Dr. Emma Tebbs.

For updates about Alejandro’s research follow @alejo_coca on twitter.