ESRF takes the helm in saving data

04-02-2016

The ESRF has taken a decisive step towards the preservation of data sets with the adoption of a data policy. The implementation of this data policy reinforces the ESRF’s leading role in the advancement of science by advocating for full and open access to scientific data collected through publicly funded research after an embargo period of 3 years.

  • Share

The main deliverable of large-scale research facilities is the production of new knowledge. Today’s information technology has transformed scientific investigation and data is the raw material of science.  With data volumes increasing constantly, improved storage infrastructures and data storage management are required so that data can be linked to publications, and preserved for future verification and new research.

A three-year embargo period

Based on the PaNdata Data Policy resulting from a European FP7 project delivered in 2011, the ESRF will be the custodian of raw data and metadata. It will automatically collect metadata for all experiments carried out on its beamlines, including the beamlines from Collaborating Research Groups. The metadata will be stored in a metadata catalogue.  The experimental team will have sole access to the data during a three-year embargo period, renewable if necessary. After the embargo, the data will be released into the public domain with open access.

“ESRF data will be traceable, verifiable and re-useable. The metadata and raw data will be archived for 10 years, with an option for longer archiving for more sensitive and unique data sets,” says Rudolf Dimper, ESRF head of Technical Infrastructure. The cost of archiving data is in many cases only a fraction of the cost of preparing the sample, shipping it to the ESRF and collecting the data.

Implementation of the data policy at the ESRF has already begun. Two beamlines have started collecting metadata in the metadata catalogue and 9 beamlines will be depositing data into the archive before the end of the year. The ESRF aims for all its beamlines to deposit data in the data archive by 2020 when ESRF restarts with EBS – the extremely brilliant source.

“Implementation of the data policy at the ESRF is a huge task, involving all actors responsible for producing and storing data. At a rate of 10 beamlines per year, all the ESRF beamlines will be implementing the data policy by 2020,” says Andy Götz, ESRF head of Software.

The ESRF will archive the data

Currently data at the ESRF are kept on disk for 50 days and are then deleted from the ESRF disks. The onus is on the user to keep a copy. The data policy will mean the ESRF will archive the data and the user does not have this burden anymore.  Today, the ESRF collects and stores 2 petabytes (PB) of data per year. In 2025, this figure is expected to climb to 15PB per year. If that is not an easy figure to grasp, imagine that one petabyte of songs recorded on an MP3 player would last over 2000 years played continuously. In DVD terms, 1 PB is equivalent to 223 000 DVDs of 4.7 GB each! 

Anticipating on additional storage requirements, and as part of the ESRF Upgrade Programme, in May 2011, the ESRF inaugurated a second data centre to provide a high-quality environment for central data storage. The new centre is equipped with state-of-the-art file servers capable of storing almost 4 petabytes, a tape-based archiving facility of tens of petabytes, compute clusters with a peak performance of 15 teraflops and an extensive high performance Ethernet infrastructure. Storage capacity can easily be extended thanks to pre-installed power, cooling and networking resources. A key part of the ESRF data policy is the tape archive system. Tapes are best adapted for archiving data over the long term because they do not consume power and hence require no additional cooling. The ESRF has two tape libraries fully automated with robots. Two copies are kept of each data set to ensure redundancy in case of fire or other major failure of the tape system. The tape storage will be increased significantly to handle the data volumes. The ESRF will profit from improvements in tape capacity which foresee a 10 fold increase in tape capacity in the medium term.

ESRF users will be required to accept the terms of the data policy at the time of applying for beamtime, starting from the September 2016 proposal round.

 

Further information

Full text of the ESRF Data Policy (PDF file)

More on the ESRF data centre - http://www.esrf.fr/news/general/data-centre/index_html

More about PaNdata - http://pan-data.eu/

 

Top image: The high performance data communication network is at the heart of all data movements at ESRF. Credit: ESRF/I. Ginzburg