Data Cube
Back
2017-09-26 14:15 - 2017-09-26 15:45
Chairs: Baumann, Peter (Jacobs University | rasdaman GmbH) - Desnos, Yves-Louis (ESA-ESRIN)
-
Paper 119 - Session title: Data Cube
15:30 Evaluation of Array Database to Manage and Query Massive Sensor Data
Joshi, Jigeeshu; Pebesma, Edzer; Appel, Marius Institute für Geoinformatik (ifgi), Heisenbergstrasse 2, 48149, Münster, Germany
Show abstract
Many environmental monitoring observations are recorded at a fixed number of stations with constant temporal frequency. Arrays are a natural choice to represent these, with space and time as the two dimensions. Current implementations of Sensor Observation Services (SOS) from Open Geospatial Consortium (OGC) however typically use a normalized relational database as data backend. If data size grows to billions of records, querying and updating the database becomes relatively slow, with indexes taking up a lot of resources. This study evaluates the use of SciDB, an array database as a backend for massive time series data. In multi-dimensional array database system like SciDB dimensions are used as index. Indexes are implied and not stored, which saves substantial resources. Moreover, in the array data model, data that are close together are placed close together on the disk. This is advantageous for sensor observation data with spatial and temporal patterns. Thus, proximity in storage of correlated data improves efficiency in query processing. In a use case, SciDB features like parallelization, array dimension manipulation, chunk storage and compression are demonstrated by using an air quality dataset provided by the European Environment Agency. It is essentially a time series with air quality observations recorded by all member states at selected stations in Europe. The study compares SciDB as array database to PostgreSQL (which is a popular choice in geospatial community) as relational database. A common platform was set up to test the performance and compare the approaches of the two systems. Results show that SciDB has significant advantage for queries like sequential scan, group by aggregation, join, and computation of quantiles on the used dataset. An index on PostgreSQL table reduces the response time for filter and slicing queries, but the index size grows substantially as data size increases. On the other hand, the SciDB compression technique greatly reduces the disk storage space. The result of these tests and the discussions presented are particularly useful for Earth Observation studies that require multi-dimensional data management.
Presentation
[Authors] [ Overview programme]
-
Paper 139 - Session title: Data Cube
15:15 Application Of The Data Cube Concept For Multi‐Temporal Satellite Imagery – A Complete Open Science Workflow For Data Science In EO
Kristen, Harald (1); Jacob, Alexander (1); Vicente, Fernando (2); Marin, Carlo (1); Monsorno, Roberto (1); Notarnicola, Claudia (1) 1: Eurac Research, Italy; 2: DARES Technology, Spain
Show abstract
ESA´s Sentinel satellites provide satellite imagery in an unprecedented temporal and spatial resolution, shared as open data to everyone. ESA foresees that its Earth Observation (EO) archive will grow from the 23 PB in 2018 to more than 51 TB in 2022, with the Copernicus Sentinel missions being the major driver.
To exploit this large amount of data fully, the EO user community, both scientific and commercial, has to change the way it is using remote sensing imagery. It will not be feasible anymore to store and process EO data on single workstations, especially when it comes to time series analysis on larger scale.
This paper shows a novel approach to big data science for EO, which is able to deal with the specific challenges of long time series processing using only Free and Open Source Software (FOSS). In detail, we will address the problem of land cover and vegetation mapping using time series of Sentinel-1 data, by exploiting the temporal evolution of the interferometric coherence. The analysis will be conducted for the study area of South Tyrol in Italy as part of the recently started ESA SEOM SlnCohMap project ( http://www.sincohmap.org ).
All the exploited data is hosted and processed directly on the high performance cloud infrastructure of the Eurac Research Sentinel Alpine Observatory ( http://sao.eurac.edu ). In detail, the pre-processed data is organized in data cubes, implemented in the raster array database RASDMAN. This guarantees fast and standardized access to the data via the OGC Web Coverage Service (WCS), following the principle “you only get what you need”. Furthermore, a good portion of the classic processing tasks such as sub-setting, re-projection, resampling and time series aggregation are directly done in the data cubes via the OGC protocol WCPS (Web Coverage Processing Service), according to the second principle “bring the processing to the data”.
Because of its simplicity, the open source policy and its rising popularity in the EO community, Python is used as programming language for the implementation of the land cover and vegetation mapping. In detail, to download and process data via WCPS, we developed a Python module that sends WCPS queries in the right format to the Rasdaman server and transforms the response in data structures that can be directly used for further analysis in Python. For the land cover and vegetation classification two well-described algorithms in this field, Random Forests and Support Vector Machine, are applied. All the Python scripts run on a Jupyter server, which again is running in the SAO cloud.
GIT is used as version control system to document and publish the created code for potential users under the MIT license. This fosters collaboration as other researchers can clearly see, comment and help developing the code.
With the proposed approach the EO users can process and analyze large amounts of data in a cloud infrastructure, accessing services and data remotely using a high level programming language. In this way, EO users can focus on algorithm development, instead of dealing with data preparation.
Presentation
[Authors] [ Overview programme]
-
Paper 172 - Session title: Data Cube
14:45 Copernicus Climate Data at your fingertips - ECMWF’s data archive as a data cube
Wagemann, Julia (1); Kakaletris, George (2); Apostolopoulos, Konstantinos (2); Kouvarakis, Manos (2); Merticariu, Vlad (3); Siemen, Stephan (1); Baumann, Peter (3) 1: European Centre for Medium-Range Weather Forecasts (ECMWF), United Kingdom; 2: Communication and Information Technologies (CITE) S.A., Greece; 3: Jacobs University Bremen, Germany
Show abstract
The Meteorological Archival and Retrieval System (MARS) is the world’s largest archive of meteorological and climate data. It currently holds more than 250 petabytes of operational and research data; and data of the three Copernicus services, Atmosphere Monitoring Service (CAMS), Climate Change Service (C3S) and Emergency Management Service (CEMS). ECMWF’s current data retrieval system is an efficient Download service for global fields and geographical subsets thereof. The system offers data in either GRIB or NetCDF format.
The growing volume of the data, however, makes it challenging for users to fully exploit long time-series of climate reanalysis data (more than 35 years), as generally, more data than actually required has to be downloaded and then extensively processed on local machines.
As part of the EarthServer-2 project, ECMWF’s MARS archive has been connected with an OGC-based standardized web service layer, in order to offer on-demand access to and server-based processing of ECMWF data for a non-meteorological audience. This connection transforms ECMWF’s data archive into a flexible data cube that allows for the efficient retrieval and processing of geographical subsets and individual point data information at the same time. Downloading ECMWF’s ERA-interim data is no longer required and data access can directly be integrated into custom processing workflows.
The approach combines the efficient retrieval of data from MARS with a MARS request with intelligent data cube processing with rasdaman, an efficient array database technology. FeMME, a metadata management engine, is responsible to connect both processes and to send back the data to the user. The pilot allows users to request data via an OGC Web Coverage Processing Service (WCPS) request. The application then translates the WCS request into an appropriate MARS request to retrieve data from the archive. The returned data is on-the-fly registered with rasdaman and, after being processed as requested, returned to the user in a standardized way and a chosen format encoding (e.g. netCDF, CSV or JSON).
Based on a practical use-case from the climate sciences, the presentation will showcase how open data from the Copernicus Climate Change Service from ECMWF’s MARS archive can directly be retrieved and processed with the help of a WC(P)S request. A specific focus will be set on the benefits data provider and data users gain from offering and accessing large volumes of Earth Science data as a data cube.
Presentation
[Authors] [ Overview programme]
-
Paper 197 - Session title: Data Cube
15:00 The Earth System Data Cube Service
Brandt, Gunnar (1); Mahecha, Miguel (2); Fomferra, Norman (1); Permana, Hans (1); Gans, Fabian (2) 1: Brockmann Consult GmbH, Geesthacht, Germany; 2: Max Planck Institute for Biogeochemistry, Jena, Germany
Show abstract
The ability to measure and quantify our environment, including ourselves, has increased significantly in recent years. Big data is now affecting almost every aspect of life and there are no signs that this trend is going to cease any time soon. For the natural environment, Earth Observation has been driving the data revolution by developing and putting into operation many new sensors with high spatial resolution and observation frequency generating high-quality products on a global scale to which everyone has free access.
With this wealth of data on hand, the chances to describe and understand the complex dynamics of our planet and particularly the role of humans in affecting the Earth System are unprecedented. The bottleneck for many applications is, however, no longer the lack of suitable data, but the means to turn the huge amounts of available data from different sources into valuable information. Accessing several product archives, handling of big data volumes, providing adequate computing and network resources, and working with different product formats and data models are challenging even for experienced users. Furthermore, coping with these issues consumes lots of time, resources that are no longer available for the actual task of data exploitation and research.
The Earth System Data Cube (ESDC) service aims at reducing such overhead for the joint exploration of relevant parameters of the coupled Biosphere-Atmosphere system. Currently, more than 30 parameters, most of them from Earth Observation and numerical models, are processed to share a common spatio-temporal grid and a common data model that greatly facilitate multivariate analysis. The service is provided through a virtual research environment based on the Jupyterhub technology, which offers access to the ESDC, cloud processing, and APIs for Python, Julia, and Python. There is a growing number of tailored open source software tools for typical operations that make working with the ESDC very efficient and enable collaboration between users. Moreover, an advanced visualisation application provides a graphical user interface for intuitive and interactive exploration of the ESDC.
We present here different use cases that are currently implemented with user of the service. The examples, which comprise the calculation of marine primary productivity, biogeochemical model optimization, the development of a regional Ecological Observation System, the combination of social data on human disasters with extreme events in the ESDC, underpin the diversity of potential applications of the ESDC service. As the service matures on its way to operationality, we are welcoming new users to explore the potentials the ESDC for their own applications and to further advance the capabilities of the service with us.
Presentation
[Authors] [ Overview programme]
-
Paper 201 - Session title: Data Cube
14:15 The Datacube Manifesto
Baumann, Peter (1); Merticariu, Vlad (1); Misev, Dimitar (1); Hogan, Patrick (2) 1: Jacobs University, Germany; 2: NASA, USA
Show abstract
Recently, the term datacube is receiving increasing attention as it has the potential of greatly simplifying "Big Earth Data" services for users by providing massive spatio-temporal data in an analysis-ready way. However, there is considerable confusion about the data and service model of such datacubes.
With this Manifesto we provide a concise, vendor-neutral, standards-aware definition of datacubes, The six simple rules are based on our two decades of experience in datacube modeling, query languages, architectures, distributed processing, standards development, and active deployment of Petascale datacube services at some of the largest data centers worldwide. In particular, the intercontinental EarthServer initiative is demonstrating large-scale use of datacubes, with Petabyte portals at the time of the conference and experimental data center federation between Europe and Australia. Further, the authors are editors and main writers of geo datacube standards in OGC and ISO, as well as the generic datacube extension to ISQL SQL.
We exemplify feasibility and advantages of datacube services based on large-scale running services and discuss seamless connection of servers and visual clients through open datacube standards.
We hope that by providing this crisp guidance we can clarify discussion and support assessment of technologies and services in this exciting new technology field.
Presentation
[Authors] [ Overview programme]