Logo

1 Basic Information:

1.1 What is the project name or acronym?

Who is the most likely to benefit from the data?

1.3 Other DMP Metadata

1.4 Please select from the following options

Where will you submit your data as endpoints?

2. What kind of data will you handle?








3. How much data will you likely to generate?

GB
GB














4. Are any of the following standards relevant to your project?

4.1 Will you adhere to any high level metadata submission standards?

4.2 When will you make your data public?















5. Do you intend to use data visualization in your project?

























The project aim should be a apart of a sentence.

Example 1: aims at creating a computational model of carbon and water flow within a whole plant architecture


Example 2: aims at generating data management plan with minimal effort and making the data as open as possible

The project object = target.

Example 1: carbon and water flow in plants


Example 2: data management plan

Here is the space for additional sentence.

Example 1: Industry, politicians and students can also use the data for different purposes.


Example 2: The data acquired in the project can be used by a wide range of people with different purpose.

Information in this section is only used in DMP metadata and not used in the document

Data officers are also known as data stewards and curator.

software that legally remains the property of the organization, group, or individual who created it.

User-defined template

You can click the dotted box to start editing.
Click the grey buttons to reuse templates.
Click submit when you finished.



Data Management Plan of the H2020 Project $_PROJECT

Action Number:

$_PROJECT

Action Acronym:

$_PROJECT

Action Title:

$_PROJECT

Date:

DMP version:

$_DMPVERSION


1    Introduction

#if$_EU The $_PROJECT is part of the Open Data Initiative (ODI) of the EU. #endif$_EU To best profit from open data, it is necessary not only to store the data but to make it Findable, Accessible, Interoperable, and Reusable (FAIR).#if$_PROTECT We support open and FAIR data, however, we also consider the need to protect individual data sets. #endif$_PROTECT

The aim of this document is to provide guidelines on the principles of data management in the $_PROJECT and to specify which type of data will be stored, this will be achieved by using the responses to the EU questionnaire on Data Management Plan (DMP) as a DMP document.

The detailed DMP states how data will be handled during and after the project. The $_PROJECT DMP is prepared according to the Horizon 2020 and Horizon Europe online manual. #if$_UPDATE It will be updated/its validity checked during the $_PROJECT project several times. At the very least, this will happen at month $_UPDATEMONTH. #endif$_UPDATE

2    Data Management Plan EU Template

2.1    Data Summary

What is the purpose of the data collection/generation and its relation to the objectives of the project?

The $_PROJECT has the following aim: $_PROJECTAIM. Therefore, data collection#if!$_VVISUALIZATION and integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, integration and visualization #endif$_VVISUALIZATION #if$_DATAPLANT using the DataPLANT ARC structure are absolutely necessary #endif$_DATAPLANT #if!$_DATAPLANT through a standardized data management process is absolutely necessary #endif!$_DATAPLANT because the data are used not only to understand principles, but also be informed about the provenance of data analyzing data. Stakeholders must also be informed about the provenance of data. It is therefore necessary to ensure that the data are well generated and also well annotated with metadata using open standards, as laid out in the next section.

What types and formats of data will the project generate/collect?

The $_PROJECT will collect and/or generate the following types of raw data : $_PHENOTYPIC, $_GENETIC, $_IMAGE, $_RNASEQ, $_GENOMIC, $_METABOLOMIC, $_PROTEoMIC, $_TARGETED, $_MODELS, $_CODE, $_EXCEL, $_CLONED-DNA data which are related to $_STUDYOBJECT. In addition, the raw data will also be processed and modified using analytical pipelines, which may yield different results or include ad hoc data analysis parts. #if$_DATAPLANT These pipelines will be tracked in the DataPLANT ARC.#endif$_DATAPLANT Therefore, care will be taken to document and archive these resources (including the analytical pipelines) as well#if$_DATAPLANT relying on the expertise in the DataPLANT consortium#endif$_DATAPLANT.

Will you re-use any existing data and how?

The project builds on existing data sets and relies on them. #if$_RNASEQ For example, without a proper genomic reference it is very difficult to analyze next-generation sequencing (NGS) data sets.#endif$_RNASEQ It is also important to include existing data-sets on the expression and metabolic behavior of the $_STUDYOBJECT, and on existing background knowledge. #if$_PARTNERS of the partners. #endif$_PARTNERS Genomic references can be gathered from reference databases for genomes/ and sequences, like the US National Center for Biotechnology Information: NCBI, European Bioinformatics Institute: EBI; DNA Data Bank of Japan: DDBJ. Furthermore, prior 'unstructured' data in the form of publications and data contained therein will be used for decision making.

What is the origin of the data?

Public data will be extracted as described in the previous paragraph. For the $_PROJECT, specific data sets will be generated by the consortium partners.

Data of different types or representing different domains will be generated using unique approaches. For example:

#if$_PREVIOUSPROJECTS

Data from previous projects such as $_PREVIOUSPROJECTS will be considered.

#endif$_PREVIOUSPROJECTS

What is the expected size of the data?

We expect to generate $_RAWDATA GB of raw data and up to $_DERIVEDDATA GB of processed data.

To whom might it be useful ('data utility')?

The data will initially benefit the $_PROJECT partners, but will also be made available to selected stakeholders closely involved in the project, and then the scientific community working on $_STUDYOBJECT. $_DATAUTILITY In addition, the general public interested in $_STUDYOBJECT can also use the data after publication. The data will be disseminated according to the $_PROJECT's dissemination and communication plan, #if$_DATAPLANT which aligns with DataPLANT platform or other means#endif$_DATAPLANT

2.2    FAIR data

Making data findable, including provisions for metadata

Are the data produced and/or used in the project discoverable with metadata, identifiable and locatable by means of a standard identification mechanism (e.g. persistent and unique identifiers such as Digital Object Identifiers)?

All datasets will be associated with unique identifiers and will be annotated with metadata.#if$_MIAPPE The $_PROJECT will rely on community standards plus additional recommendations applicable in the plant science, such as the Minimum Information About a Plant Phenotyping Experiment (MIAPPE). #endif$_MIAPPE Unlike cross-domain minimal sets such as the Dublin core, which mostly define the submitter and the general type of data, allow reusability by other researchers by defining properties of the plant (see the preceding section). However, minimal cross-domain annotations also remain part of the $_PROJECT. #if$_DATAPLANT The core integration with DataPLANT will also allow individual releases to be tagged with a Digital Object Identifier (DOI). #endif$_DATAPLANT #if$_OTHERSTANDARDS $_OTHERSTANDARDINPUT #endif$_OTHERSTANDARDS

What naming conventions do you follow?

Data variables will be allocated standard names. For example, genes, proteins and metabolites will be named according to approved nomenclature and conventions. These will also be linked to functional ontologies where possible. Datasets will also be named I a meaningful way to ensure readability by humans. Plant names will include traditional names, binomials, and all strain/cultivar/subspecies/variety identifiers.

Will search keywords be provided that optimize possibilities for re-use?

Keywords about the experiment and consortium will be included, as well as an abstract about the data, where useful. In addition, certain keywords can be auto-generated from dense metadata and its underlying ontologies. #if$_DATAPLANT Here, DataPLANT strives to complement these with standardized DataPLANT ontologies that are provided where the ontology does not yet include such variables. #endif$_DATAPLANT

Do you provide clear version numbers?

To maintain data integrity and facilitate reanalysis, data sets will be allocated version numbers where this is useful (e.g. raw data must not be changed and will not get a version number and is considered immutable). #if$_DATAPLANT This is automatically supported by the ARC Git DataPLANT infrastructure. #endif$_DATAPLANT

What metadata will be created? In case metadata standards do not exist in your discipline, please outline what type of metadata will be created and how.

We will use Investigation, Study, Assay (ISA) specification for metadata creation. #if$_RNASEQ|$_GENOMIC For specific data (e.g., RNASeq or genomic data), we use metadata templates from the end-point repositories. #if$_MINSEQE The Minimum Information About a Next-generation Sequencing Experiment (MinSEQe) will also be used. #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC #if$_METABOLOMIC Metabolights submission compliant standards will be used for metabolomic data where this is acccepted by the consortium partners.#issuewarning some Metabolomics partners considers Metabolights not an accepted standard#endissuewarning#endif$_METABOLOMIC As a part of plant research community, we use #if$_MIAPPE MIAPPE for phenotyping data in the broadest sense, but we will also be rely on #endif$_MIAPPE specific SOPs for additional annotations #if$_DATAPLANT that consider advanced DataPLANT annotation and ontologies. #endif$_DATAPLANT

Making data openly accessible

Which data produced and/or used in the project will be made openly available as the default? If certain datasets cannot be shared (or need to be shared under restrictions), we explain why, clearly separating legal and contractual reasons from voluntary restrictions.

By default, all data sets from the $_PROJECT will be shared with the community and made openly available. However, before the data are released, all will be provided with an opportunity to check for potential IP (according to the consortium agreement and background IP rights). #if$_INDUSTRY This applies in particular to data pertaining to the industry. #endif$_INDUSTRY IP protection will be prioritized for datasets that offer the potential for exploitation.

Note that in multi-beneficiary projects it is also possible for specific beneficiaries to keep their data closed if relevant provisions are made in the consortium agreement and are in line with the reasons for opting out.

How will the data be made accessible (e.g. by deposition in a repository)?

#if!$_DATAPLANT Data will be made available via the $_PROJECT platform using a user-friendly front end that allows data visualization. Besides this it will be ensured that data which can be stored in international discipline related repositories which use specialized technologies (Sequencing at the #if$_NCBI national US center: NCBI:#endif$_NCBI #if$_GEO Gene Expression Ominibus: GEO;#endif$_GEO European Bioinformatics Institute (EBI) archives: #if$_ENA European Nucleotide Archive: ENA;#endif$_ENA #if$_ARRAYEXPRESS Functional Genomics Data Archive: ArrayExpress;#endif$_ARRAYEXPRESS #if$_PRIDE Proteome database: PRIDE;#endif$_PRIDE #if$_METABOLIGHTS metabolomic database: MetaboLights;#endif$_METABOLIGHTS #if$_OTHEREP and $_OTHEREP #endif$_OTHEREP ) will be used to store data and the data will be processed there as well. #endif!$_DATAPLANT

Specialized repositories will be used where appropriate, such as INSDC (GenBank, EBI, DDBJ) for nucleotide sequence data, PIR/UniProt/SWISS-PROT for proteins, PDB for protein structures, GEO for transcriptomics, PRIDE for proteomics data, and METLIN for metabolomics data. For unstructured and less standardized data (e.g., experimental phenotypic measurements), these will be annotated with metadata and if complete allocated a digital object identifier (DOI). #if$_DATAPLANT Whole datasets will also be wrapped into an ARC with allocated DOIs. The ARC and the converters provided by DataPLANT will ensure that the upload into the endpoint repositories is fast and easy. #endif$_DATAPLANT

What methods or software tools are needed to access the data?

#if$_PROPRIETARY The $_PROJECT relies on the tool(s) $_PROPRIETARY. #endif$_PROPRIETARY

#if!$_PROPRIETARY No specialized software will be needed to access the data, just a modern browser. Access will be possible through web interfaces. For data processing after obtaining raw data, typical open-source software can be used. #endif!$_PROPRIETARY

#if$_DATAPLANT DataPLANT offers tools such as the open-source SWATE plugin for Excel, the ARC commander, arcCommander, and DataPLAN #endif$_DATAPLANT

Is documentation about the software needed to access the data included?

#if$_DATAPLANT DataPLANT resources are well described, and their setup is documented on a github project guide is provided on the GitHub project pages. #endif$_DATAPLANT All external software documentation will be duplicated locally and stored near the software.

Is it possible to include the relevant software (e.g. in open-source code)?

As stated above, the $_PROJECT will use publicly available open-source and well-documented certified software #if$_PROPRIETARY except for $_PROPRIETARY #endif$_PROPRIETARY.

Where will the data and associated metadata, documentation and code be deposited? Preference should be given to certified repositories that support open access, where possible.

As noted above, specialized repositories will be used for common data types. For unstructured and less standardized data (e.g., experimental phenotypic measurements), these will be annotated with metadata and if complete allocated a digital object identifier (DOI).#if$_DATAPLANT The Whole datasets will also be wrapped into an ARC with allocated DOIs.#endif$_DATAPLANT.

Have you explored appropriate arrangements with the identified repository?

The submission is for free, and it is the goal (at least of ENA) to obtain as much data as possible. Therefore, arrangements are neither necessary nor useful. Catch-all repositories are not required. #if$_DATAPLANT , and this has been confirmed for data associated with DataPLANT #endif$_DATAPLANT. #issuewarning if no data management platform such as DataPLANT is used, then you need to find appropriate repository to store or archive your data after publication. #endissuewarning

If there are restrictions on use, how will access be provided?

There are no restrictions beyond the IP screening described above, which is in line with European open data policies.

Is there a need for a data access committee?

There is no need for a data access committee.

Are there well described conditions for access (i.e. a machine-readable license)?

Yes, where possible, e.g. CC REL will be used for data not submitted to specialized repositories such as ENA.

How will the identity of the person accessing the data be ascertained?

Where data are shared only within the consortium, if the datasets are not yet finished or are undergoing IP checks, the data will be hosted internally and a username and password will be required for access (see GDPR rules). When the data are made public in EU or US repositories, completely anonymous access is normally allowed. This is the case for ENA as well and both are in line with GDPR requirements.

#if$_DATAPLANT Currently, data management relies on the annotated research context (ARC). It is password protected, so before any data or samples can be obtained, user authentication is required. #endif$_DATAPLANT

Making data interoperable

Are the data produced in the project interoperable, that is allowing data exchange and re-use between researchers, institutions, organizations, countries, etc. (i.e. adhering to standards for formats, as much as possible compliant with available (open) software applications, and in particular facilitating re-combinations with different datasets from different origins)?

Whenever possible, data will be stored in common and openly defined formats including all the necessary metadata to interpret and analyze data in a biological context. By default, no proprietary formats will be used. However Microsoft Excel files (according to ISO/IEC 29500-1:2016) might be used as intermediates by the consortium#if$_DATAPLANT and by some ARC components#endif$_DATAPLANT. In addition, text files might be edited in text processor files, but will be shared as pdf.

What data and metadata vocabularies, standards or methodologies will you follow to make your data interoperable?

As noted above, we foresee using minimal standards such as #if$_RNASEQ|$_GENOMIC #if$_MINSEQE MinSEQe for sequencing data and #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Metabolights compatible forms for metabolites #if$_MIAPPE and MIAPPE for phenotyping-like data #endif$_MIAPPE. The minimal information standards will allow the integration of data across projects, and its reuse according to established and tested protocols. We will also use ontological terms to enrich the data sets relying on free and open ontologies where possible. Additional ontology terms might be created and canonized during the $_PROJECT.

Will you be using standard vocabularies for all data types present in your data set, to allow inter-disciplinary interoperability?

Open ontologies will be used where they are mature. As stated above, some ontologies and controlled vocabularies might need to be extended. #if$_DATAPLANT Here, the $_PROJECT will build on the advanced ontologies developed in DataPLANT. #endif$_DATAPLANT

In case it is unavoidable that you use uncommon or generate project specific ontologies or vocabularies, will you provide mappings to more commonly used ontologies?

Common and open ontologies will be used, so this question does not apply.

Increase data reuse (by clarifying licences)

How will the data be licensed to permit the widest re-use possible?

Open licenses, such as Creative Commons (CC), will be used whenever possible.

When will the data be made available for re-use? If an embargo is sought to give time to publish or seek patents, specify why and how long this will apply, bearing in mind that research data should be made available as soon as possible.

#if$_early The data will be published as soon as possible to guarantee reusability. #endif$_early #if$_ipissue IP issues will be checked before publication. #endif$_ipissue All consortium partners will be encouraged to make data available before publication, openly and/or under pre-publication agreements #if$_GENOMIC such as those started in Fort Lauderdale and set forth by the Toronto International Data Release Workshop. #endif$_GENOMIC This will be implemented as soon as IP-related checks are complete.

Are the data produced and/or used in the project usable by third parties, in particular after the end of the project? If the re-use of some data is restricted, explain why.

There will be no restrictions once the data are made public.

How long is it intended that the data remains re-usable?

The data will be made available for many years#if$_DATAPLANT and ideally indefinitely after the end of the project#endif$_DATAPLANT.

Data submitted to repositories (as detailed above) e.g. ENA /PRIDE would be subject to local data storage regulation.

Are data quality assurance processes described?

The data will be checked and curated. #if$_DATAPLANT Furthermore, data will be quality controlled (QC) using automatic procedures as well as manual curation #endif$_DATAPLANT.

2.3    Allocation of resources

What are the costs for making data FAIR in your project?

The $_PROJECT will bear the costs of data curation, #if$_DATAPLANT ARC consistency checks, #endif$_DATAPLANT and data maintenance/security before transfer to public repositories. Subsequent costs are then borne by the operators of these repositories.

Additionally, costs for after publication storage are incurred by end-point repositories (e.g. ENA) but not charged against the $_PROJECT or its members but by the operation budget of these repositories.

How will these be covered? Note that costs related to open access to research data are eligible as part of the Horizon 2020 or Horizon Europe grant (if compliant with the Grant Agreement conditions).

The cost born by the $_PROJECT are covered by the project funding. Pre-existing structures #if$_DATAPLANT such as structures, tools, and knowledge laid down in the DataPLANT consortium#endif$_DATAPLANT will also be used.

Who will be responsible for data management in your project?

The responsible person will be $_DATAOFFICER of the $_PROJECT.

Are the resources for long term preservation discussed (costs and potential value, who decides and how/what data will be kept and for how long)?

The data officer #if$_PARTNERS or $_PARTNERS #endif$_PARTNERS will ultimately decides on the strategy to preserve data that are not submitted to end-point subject area repositories #if$_DATAPLANT or ARCs in DataPLANT #endif$_DATAPLANT when the project ends. This will be in line with EU guidlines, institute policies, and data sharing based on EU and international standards.

2.4    Data security

What provisions are in place for data security (including data recovery as well as secure storage and transfer of sensitive data)?

Online platforms will be protected by vulnerability scanning, two-factor authorization and daily automatic backups allowing immediate recovery. All partners holding confidential project data to use secure platforms with automatic backups and offsite secure copies. #if$_DATAPLANT DataHUB and ARCs have been generated in DataPLANT, data security will be imposed. This comprises secure storage, and the use of password and usernames is generally transferred via separate safe media.#endif$_DATAPLANT

Is the data safely stored in certified repositories for long term preservation and curation?

Wherever there are certified repositories, these will be used as end-point repositories. #if$_RNASEQ Transcriptomics data and gene sequence data will be also made available upon publication via the standards ENA/SRA, #endif$_RNASEQ #if$_METABOLOMIC metabolite data in e.g. Metabolights (and/or Nationwide repositories like the German NFDI or the French INRAe), #endif$_METABOLOMIC #if$_PROTEOMIC Proteomics data in e.g. Pride/Proteomexchange #endif$_PROTEOMIC. In addition, the national resource will maintain safekeeping of data also after the project ends. #if$_DATAPLANT In addition, databases like e.g. Proteomexchange do not support deep plant specific metadata; hence ARCs will be maintained to ensure the reusability of plant-specific metadata. #endif$_DATAPLANT

2.5    Ethical aspects

Are there any ethical or legal issues that can have an impact on data sharing? These can also be discussed in the context of an ethics review. If relevant, include references to ethics deliverables and ethics chapter in the Description of the Action (DoA).

At the moment, we do not anticipate ethical or legal issues with data sharing. In terms of ethics, since this is plant data, there is no need for an ethics committee to deal with data from plants, although we will diligently follow the Nagoya protocol on access and benefit sharing. (🡺see Nagoya protocol). #issuewarning you have to check here and enter any due diligence here at the moment we are awaiting if Nagoya gets also part of sequence information. In any case if you use material not from your (partner) country and characterize this physically e.g., metabolites, proteome, biochemically RNASeq etc. this might represent a Nagoya relevant action unless this is from e.g. US (non partner), Ireland (not signed still contact them) etc but other laws might apply…. #endissuewarning

Is informed consent for data sharing and long term preservation included in questionnaires dealing with personal data?

The only personal data that will potentially be stored is the submitter name and affiliation in the metadata for data. In addition, personal data will be collected for dissemination and communication activities using specific methods and procedures developed by the $_PROJECT partners to adhere to data protection. #issuewarning you need to inform and better get WRITTEN consent that you store emails and names or even pseudonyms such as twitter handles, we are very sorry about these issues we didn’t invent them #endissuewarning

2.6    Other issues

Do you make use of other national/funder/sectorial/departmental procedures for data management? If yes, which ones?

Yes, the $_PROJECT will use common Research Data Management (RDM) tools#if$_DATAPLANT and in particular resources developed by the NFDI of Germany#endif$_DATAPLANT.

3     Annexes

3.1     Abbreviations

#if$_DATAPLANT

ARC Annotated Research Context

#endif$_DATAPLANT

CC Creative Commons

CC CEL Creative Commons Rights Expression Language

DDBJ DNA Data Bank of Japan

DMP Data Management Plan

DoA Description of Action

DOI Digital Object Identifier

EBI European Bioinformatics Institute

ENA European Nucleotide Archive

EU European Union

FAIR Findable Accessible Interoperable Reproducible

GDPR General data protection regulation (of the EU)

IP Intellectual Property

ISO International Organization for Standardization

MIAMET Minimal Information about Metabolite experiment

MIAPPE Minimal Information about Plant Phenotyping Experiment

MinSEQe Minimum Information about a high-throughput Sequencing Experiment

NCBI National Center for Biotechnology Information

NFDI National Research Data Infrastructure (of Germany)

NGS Next Generation Sequencing

RDM Research Data Management

RNASeq RNA Sequencing

SOP Standard Operating Procedures

SRA Short Read Archive

#if$_DATAPLANT

SWATE Swate Workflow Annotation Tool for Excel

#endif$_DATAPLANT

ONP Oxford Nanopore

qRTPCR quantitative real time polymerase chain reaction

WP Work Package


Data Management Plan of the Horizon Europe Project $_PROJECT

Action Number:

$_PROJECT

Action Acronym:

$_PROJECT

Action Title:

$_PROJECT

Date:

17. März 2023

DMP version:

$_DMPVERSION


Introduction

#if$_EU The $_PROJECT is part of the Open Data Initiative (ODI) of the EU. #endif$_EU To best profit from open data, it is necessary not only to store data but to make data Findable, Accessible, Interoperable, and Reusable (FAIR).#if$_PROTECT We support open and FAIR data, however, we also consider the need to protect individual data sets. #endif$_PROTECT

The aim of this document is to provide guidelines on principles guiding the data management in the $_PROJECT and what data will be stored by using the responses to the EU questionnaire on Data Management Plan (DMP) as a DMP document.

The detailed DMP instructs how data will be handled during and after the project. The $_PROJECT DMP is modified according to the Horizon Europe and Horizon Europe online Manual. #if$_UPDATE It will be updated/its validity checked during the $_PROJECT project several times. At the very least, this will happen at month $_UPDATEMONTH. #endif$_UPDATE

1.    Data Summary

Will you re-use any existing data and what will you re-use it for? State the reasons if re-use of any existing data has been considered but discarded.

The project builds on existing data sets and relies on them. #if$_RNASEQ For instance, without a proper genomic reference it is very difficult to analyze NGS data sets.#endif$_RNASEQ It is also important to include existing data sets on the expression and metabolic behaviour of $_STUDYOBJECT, but of course, also on existing characterization and the background knowledge. #if$_PARTNERS of the partners. #endif$_PARTNERS Genomic references can simply be gathered from reference databases for genomes/sequences, like the National Center for Biotechnology Information: NCBI (US); European Bioinformatics Institute: EBI (EU); DNA Data Bank of Japan: DDBJ (JP). Furthermore, prior 'unstructured' data in the form of publications and data contained therein will be used for decision making.

What types and formats of data will the project generate or re-use?

The $_PROJECT will collect and/or generate the following types of raw data : $_PHENOTYPIC, $_GENETIC, $_IMAGE, $_RNASEQ, $_GENOMIC, $_METABOLOMIC, $_PROTEoMIC, $_TARGETED, $_MODELS, $_CODE, $_EXCEL, $_CLONED-DNA data which are related to $_STUDYOBJECT. In addition, the raw data will also be processed and modified using analytical pipelines, which may yield different results or include ad hoc data analysis parts. #if$_DATAPLANT These pipelines will be tracked in the DataPLANT ARC.#endif$_DATAPLANT Therefore, care will be taken to document and archive these resources (including the analytical pipelines) as well#if$_DATAPLANT relying on the expertise in the DataPLANT consortium#endif$_DATAPLANT.

What is the purpose of the data generation or re-use and its relation to the objectives of the project?

The $_PROJECT has the following aim: $_PROJECTAIM. Therefore, data collection#if!$_VVISUALIZATION and integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, integration and visualization #endif$_VVISUALIZATION #if$_DATAPLANT using the DataPLANT ARC structure are absolutely necessary #endif$_DATAPLANT #if!$_DATAPLANT through a standardized data management process is absolutely necessary #endif!$_DATAPLANT because the data are used not only to understand principles, but also be informed about the provenance of data analyzing data. Stakeholders must also be informed about the provenance of data. It is therefore necessary to ensure that the data are well generated and also well annotated with metadata using open standards, as laid out in the next section.

What is the expected size of the data that you intend to generate or re-use?

We expect to generate raw data in the range of $_RAWDATA GB of data. The size of the derived data will be about $_DERIVEDDATA GB.

What is the origin/provenance of the data, either generated or re-used?

Public data will be extracted as described in the previous paragraph. For the $_PROJECT, specific data sets will be generated by the consortium partners.

Data of different types or representing different domains will be generated using unique approaches. For example:

#if$_PREVIOUSPROJECTS

Data from previous projects such as $_PREVIOUSPROJECTS will be considered.

#endif$_PREVIOUSPROJECTS

To whom might it be useful ('data utility'), outside your project?

The data will initially benefit the $_PROJECT partners, but will also be made available to selected stakeholders closely involved in the project, and then the scientific community working on $_STUDYOBJECT. $_DATAUTILITY In addition, the general public interested in $_STUDYOBJECT can also use the data after publication. The data will be disseminated according to the $_PROJECT's dissemination and communication plan#if$_DATAPLANT, which aligns with DataPLANT platform or other means#endif$_DATAPLANT.

$_DATAUTILITY

2    FAIR data

2.1. Making data findable, including provisions for metadata

Will data be identified by a persistent identifier?

All data sets will receive unique identifiers, and they will be annotated with metadata.

Will rich metadata be provided to allow discovery? What metadata will be created? What disciplinary or general standards will be followed? In case metadata standards do not exist in your discipline, please outline what type of metadata will be created and how.

#if$_MIAPPE The $_PROJECT will rely on community standards plus additional recommendations necessary in plant science adapted by e.g. using suggestions from the Minimum Information About a Plant Phenotyping Experiment (MIAPPE). #endif$_MIAPPE These unlike cross-domain minimal sets such as Dublin core, which mostly defines the submitter and what general type of data is being dealt with (e.g. images), allow reusability by other researchers as it also defines properties of the plant (see the preceding section). However, of course minimal cross-domain annotations are part of the $_PROJECT. #if$_DATAPLANT The core integration with DataPLANT will also allow one to tag individual releases with a Digital Object Identifier (DOI). #endif$_DATAPLANT #if$_OTHERSTANDARDS $_OTHERSTANDARDINPUT #endif$_OTHERSTANDARDS

Will search keywords be provided in the metadata to optimize the possibility for discovery and then potential re-use?

Keywords about the experiment and the general consortium will be included, as well as an abstract about the data, where useful. In addition, certain keywords can be auto-generated from dense metadata and its underlying ontologies. #if$_DATAPLANT Here, DataPLANT strives to complement these with standardized DataPLANT ontologies that are supplemented where the ontology does not yet include the variables. #endif$_DATAPLANT

Will metadata be offered in such a way that it can be harvested and indexed?

To maintain data integrity and to be able to re-analyze data, data sets will get version numbers where this is useful (e.g. raw data must not be changed and will not get a version number and is considered immutable). #if$_DATAPLANT This is automatically supported by the ARC Git DataPLANT infrastructure. #endif$_DATAPLANT Data variables will be allocated standard names. For example, genes, proteins and metabolites will be named according to approved nomenclature and conventions. These will also be linked to functional ontologies where possible. Datasets will also be named I a meaningful way to ensure readability by humans. Plant names will include traditional names, binomials, and all strain/cultivar/subspecies/variety identifiers.

2.2.    Making data accessible

Repository

Will the data be deposited in a trusted repository?

#if!$_DATAPLANT Data will be made available via the $_PROJECT platform using a user-friendly front end that allows data visualization. Besides this it will be ensured that data which can be stored in international discipline related repositories which use specialized technologies (Sequencing at the #if$_NCBI national US center: NCBI:#endif$_NCBI #if$_GEO Gene Expression Ominibus: GEO;#endif$_GEO European Bioinformatics Institute (EBI) archives: #if$_ENA European Nucleotide Archive: ENA;#endif$_ENA #if$_ARRAYEXPRESS Functional Genomics Data Archive: ArrayExpress;#endif$_ARRAYEXPRESS #if$_PRIDE Proteome database: PRIDE;#endif$_PRIDE #if$_METABOLIGHTS metabolomic database: MetaboLights;#endif$_METABOLIGHTS #if$_OTHEREP and $_OTHEREP #endif$_OTHEREP ) will be used to store data and the data will be processed there as well. #endif!$_DATAPLANT

Specialized repositories will be used where appropriate, such as INSDC (GenBank, EBI, DDBJ) for nucleotide sequence data, PIR/UniProt/SWISS-PROT for proteins, PDB for protein structures, GEO for transcriptomics, PRIDE for proteomics data, and METLIN for metabolomics data. For unstructured and less standardized data (e.g., experimental phenotypic measurements), these will be annotated with metadata and if complete allocated a digital object identifier (DOI). #if$_DATAPLANT Whole datasets will also be wrapped into an ARC with allocated DOIs. The ARC and the converters provided by DataPLANT will ensure that the upload into the endpoint repositories is fast and easy. #endif$_DATAPLANT

Have you explored appropriate arrangements with the identified repository where your data will be deposited?

The submission is for free, and it is the goal (at least of ENA) to obtain as much data as possible. Therefore, arrangements are neither necessary nor useful. Catch-all repositories are not required. #if$_DATAPLANT For DataPLANT, this has been agreed upon, as all the omics repositories of International Nucleotide Sequence Database Collaboration (INSDC) will be used. #endif$_DATAPLANT #issuewarning if no data management platform such as DataPLANT is used, then you need to find appropriate repository to store or archive your data after publication. #endissuewarning

Does the repository ensure that the data is assigned an identifier? Will the repository resolve the identifier to a digital object?

#if!$_DATAPLANT Data will be made available via the $_PROJECT platform using a user-friendly front end that allows data visualization. Besides this it will be ensured that data which can be stored in international discipline related repositories which use specialized technologies (Sequencing at the #if$_NCBI national US center: NCBI:#endif$_NCBI #if$_GEO Gene Expression Ominibus: GEO;#endif$_GEO European Bioinformatics Institute (EBI) archives: #if$_ENA European Nucleotide Archive: ENA;#endif$_ENA #if$_ARRAYEXPRESS Functional Genomics Data Archive: ArrayExpress;#endif$_ARRAYEXPRESS #if$_PRIDE Proteome database: PRIDE;#endif$_PRIDE #if$_METABOLIGHTS metabolomic database: MetaboLights;#endif$_METABOLIGHTS #if$_OTHEREP and $_OTHEREP #endif$_OTHEREP ) will be used to store data and the data will be processed there as well. The ARC and the converters provided by DataPLANT will guarantee the upload into the endpoint repositories is fast and easy.#endif!$_DATAPLANT

As noted above, specialized repositories like SRA /ENA, Pride /Proteomexchange are the most common ones and will be used when appropriate. In the case of unstructured less standardized data (e.g. experimental phenotypic measurements), these will be metadata annotated and if complete given a digital object identifier (DOI). #if$_DATAPLANT and the whole data sets wrapped into an ARC will get DOIs as well. #endif$_DATAPLANT

Data:

Will all data be made openly available? If certain datasets cannot be shared (or need to be shared under restricted access conditions), explain why, clearly separating legal and contractual reasons from intentional restrictions. Note that in multi-beneficiary projects it is also possible for specific beneficiaries to keep their data closed if opening their data goes against their legitimate interests or other constraints as per the Grant Agreement.

By default, all data sets from the $_PROJECT will be shared with the community and made openly available. However, before the data are released, all will be provided with an opportunity to check for potential IP (according to the consortium agreement and background IP rights). #if$_INDUSTRY This applies in particular to data pertaining to the industry. #endif$_INDUSTRY IP protection will be prioritized for datasets that offer the potential for exploitation.

Note that in multi-beneficiary projects it is also possible for specific beneficiaries to keep their data closed if relevant provisions are made in the consortium agreement and are in line with the reasons for opting out.

If an embargo is applied to give time to publish or seek protection of the intellectual property (e.g. patents), specify why and how long this will apply, bearing in mind that research data should be made available as soon as possible.

#if$_early The data will be published as soon as possible to guarantee reusability. #endif$_early #if$_ipissue IP issues will be checked before publication. #endif$_ipissue All consortium partners will be encouraged to make data available before publication, openly and/or under pre-publication agreements #if$_GENOMIC such as those started in Fort Lauderdale and set forth by the Toronto International Data Release Workshop. #endif$_GENOMIC This will be implemented as soon as IP-related checks are complete.

Will the data be accessible through a free and standardized access protocol?

#if$_DATAPLANT DataPLANT stores data in the ARC, which is a git repo. The DataHUB shares data and metadata as a gitlab instance. The "Git" and "Web" protocol are opensourced and freely accessible. In addition, #endif$_DATAPLANT Zenodo and the endpoint repositories will also be used for access. In General, web-based protocols are free and standardized for access.

If there are restrictions on use, how will access be provided to the data, both during and after the end of the project?

There are no restrictions, beyond the aforementioned IP checks, which are in line with e.g. European open data policies.

How will the identity of the person accessing the data be ascertained?

In case data is only shared within the consortium, if the data is not yet finished or under IP checks, the data is hosted internally and username and password will be required (see also our GDPR rules). In the case data is made public under final EU or US repositories, completely anonymous access is normally allowed. This is the case for ENA as well and both are in line with GDPR requirements. #if$_DATAPLANT Currently, data management relies on the annotated research context ARC. It is password protected, so before any data can be obtained or samples generated an authentication needs to take place. #endif$_DATAPLANT

Is there a need for a data access committee (e.g. to evaluate/approve access requests to personal/sensitive data)?

Consequently, there is no need for a committee.

Metadata:

Will metadata be made openly available and licenced under a public domain dedication CC0, as per the Grant Agreement? If not, please clarify why. Will metadata contain information to enable the user to access the data?

Yes, where possible, e.g. CC REL will be used for data not submitted to specialized repositories such as ENA.

How long will the data remain available and findable? Will metadata be guaranteed to remain available after data is no longer available?

The data will be made available for many years#if$_DATAPLANT and ideally indefinitely after the end of the project#endif$_DATAPLANT. In any case data submitted to repositories (as detailed above) e.g. ENA /PRIDE would be subject to local data storage regulation.

Will documentation or reference about any software be needed to access or read the data be included? Will it be possible to include the relevant software (e.g. in open source code)?

#if$_PROPRIETARY The $_PROJECT relies on the tool(s) $_PROPRIETARY. #endif$_PROPRIETARY #if!$_PROPRIETARY No specialized software will be needed to access the data, usually just a modern browser. Access will be possible through web interfaces. For data processing after obtaining raw data, typical open-source software can be used. #endif!$_PROPRIETARY #if$_DATAPLANT DataPLANT offers tools such as the open-source SWATE plugin for Excel, the ARC commander, and the DMP tool which will not necessarily make the interaction with data more convenient. #endif$_DATAPLANT #if$_DATAPLANT However, DataPLANT resources are well described, and their setup is documented on their github project pages. #endif$_DATAPLANT As stated above, here we use publicly available open-source and well-documented certified software #if$_PROPRIETARY except for $_PROPRIETARY #endif$_PROPRIETARY

2.3. Making data interoperable

What data and metadata vocabularies, standards, formats or methodologies will you follow to make your data interoperable to allow data exchange and re-use within and across disciplines? Will you follow community-endorsed interoperability best practices? Which ones?

As noted above, we foresee using minimal standards such as #if$_RNASEQ|$_GENOMIC #if$_MINSEQE MinSEQe for sequencing data and #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Metabolights compatible forms for metabolites #if$_MIAPPE and MIAPPE for phenotyping-like data #endif$_MIAPPE. The minimal information standards will allow the integration of data across projects, and its reuse according to established and tested protocols. Specialized repositories will be used for common data types. For unstructured and less standardized data (e.g., experimental phenotypic measurements), these will be annotated with metadata and if complete allocated a digital object identifier (DOI).#if$_DATAPLANT The Whole datasets will also be wrapped into an ARC with allocated DOIs.#endif$_DATAPLANT. Whenever possible, data will be stored in common and openly defined formats including all the necessary metadata to interpret and analyze data in a biological context. By default, no proprietary formats will be used. However Microsoft Excel files (according to ISO/IEC 29500-1:2016) might be used as intermediates by the consortium#if$_DATAPLANT and by some ARC components#endif$_DATAPLANT. In addition, text files might be edited in text processor files, but will be shared as pdf. Open ontologies will be used where they are mature. As stated above, some ontologies and controlled vocabularies might need to be extended. #if$_DATAPLANT Here, the $_PROJECT will build on the advanced ontologies developed in DataPLANT. #endif$_DATAPLANT

In case it is unavoidable that you use uncommon or generate project specific ontologies or vocabularies, will you provide mappings to more commonly used ontologies? Will you openly publish the generated ontologies or vocabularies to allow reusing, refining or extending them?

Common and open ontologies will be used. In fact, open biomedical ontologies will be used where they are mature. As stated in the previous question, sometimes ontologies and controlled vocabularies might have to be extended. #if$_DATAPLANT Here, the $_PROJECT will build on the DataPLANT biology ontology (DPBO) developed in DataPLANT. #endif$_DATAPLANT. Ontology databases such as OBO Foundry will be used to publish ontology. #if$_DATAPLANT The DPBO is also published in GitHub https://github.com/nfdi4plants/nfdi4plants_ontology #endif$_DATAPLANT.

Will your data include qualified references to other data (e.g. other data from your project, or datasets from previous research)?

The references to other data will be made in the form of DOI and ontology terms.

2.4. Increase data re-use

How will you provide documentation needed to validate data analysis and facilitate data re-use (e.g. readme files with information on methodology, codebooks, data cleaning, analyses, variable definitions, units of measurement, etc.)?

The documentation will be provided in the form of ISA (Investigation Study Assay) and CWL (Common Workflow Language). #if$_DATAPLANT Here, the $_PROJECT will build on the ARC container, which includes all the data, metadata, and documentations. #endif$_DATAPLANT

Will your data be made freely available in the public domain to permit the widest re-use possible? Will your data be licensed using standard reuse licenses, in line with the obligations set out in the Grant Agreement?

Yes, our data will be made freely available in the public domain to permit the widest re-use possible. Open licenses, such as Creative Commons (CC), will be used whenever possible.

Will the data produced in the project be useable by third parties, in particular after the end of the project?

There will be no restrictions once the data is made public.

Will the provenance of the data be thoroughly documented using the appropriate standards? Describe all relevant data quality assurance processes.

The $_PROJECT has the following aim: $_PROJECTAIM. Therefore, data collection#if!$_VVISUALIZATION and integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, integration and visualization #endif$_VVISUALIZATION #if$_DATAPLANT using the DataPLANT ARC structure are absolutely necessary #endif$_DATAPLANT #if!$_DATAPLANT through a standardized data management process is absolutely necessary #endif!$_DATAPLANT because the data are used not only to understand principles, but also be informed about the provenance of data analyzing data. Stakeholders must also be informed about the provenance of data. It is therefore necessary to ensure that the data are well generated and also well annotated with metadata using open standards, as laid out in the next section.

Describe all relevant data quality assurance processes. Further to the FAIR principles, DMPs should also address research outputs other than data, and should carefully consider aspects related to the allocation of resources, data security and ethical aspects.

The data will be checked and curated by using data collection protocol, personnel training, data cleaning, data analysis, and quality control #if$_DATAPLANT Furthermore, data will be analyzed for quality control (QC) problems using automatic procedures as well as by manual curation #endif$_DATAPLANT. Document all data quality assurance processes, including the data collection protocol, data cleaning procedures, data analysis techniques, and quality control measures. This documentation should be kept for future reference and should be made available to stakeholders upon request.

3    Other research outputs

In addition to the management of data, beneficiaries should also consider and plan for the management of other research outputs that may be generated or re-used throughout their projects. Such outputs can be either digital (e.g. software, workflows, protocols, models, etc.) or physical (e.g. new materials, antibodies, reagents, samples, etc.).

In the current data management plan, any digital output including but not limited to software, workflows, protocols, models, documents, templates, notebooks are all treated as data. Therefore, all aforementioned digital objects are already described in detail. For the non-digital objects, the data management plan will be closely connected to the digitalisation of the physical objects. #if$_DATAPLANT $_PROJECT will build a workflow which connects the ARC with an electronic lab notebook in order to also manage the physical objects. #endif$_DATAPLANT

Beneficiaries should consider which of the questions pertaining to FAIR data above, can apply to the management of other research outputs, and should strive to provide sufficient detail on how their research outputs will be managed and shared, or made available for re-use, in line with the FAIR principles.

Open licenses, such as Creative Commons CC, will be used whenever possible even on the other digital objects.

4.    Allocation of resources

What will the costs be for making data or other research outputs FAIR in your project (e.g. direct and indirect costs related to storage, archiving, re-use, security, etc.)?

The $_PROJECT will bear the costs of data curation, #if$_DATAPLANT ARC consistency checks, #endif$_DATAPLANT and data maintenance/security before transfer to public repositories. Subsequent costs are then borne by the operators of these repositories.

Additionally, costs for after publication storage are incurred by end-point repositories (e.g. ENA) but not charged against the $_PROJECT or its members but by the operation budget of these repositories.

How will these be covered? Note that costs related to research data/output management are eligible as part of the Horizon Europe grant (if compliant with the Grant Agreement conditions)

The cost born by the $_PROJECT are covered by the project funding. Pre-existing structures #if$_DATAPLANT such as structures, tools, and knowledge laid down in the DataPLANT consortium#endif$_DATAPLANT will also be used.

Who will be responsible for data management in your project?

The responsible person will be $_DATAOFFICER of the $_PROJECT.

How will long term preservation be ensured? Discuss the necessary resources to accomplish this (costs and potential value, who decides and how, what data will be kept and for how long)?

The data officer #if$_PARTNERS or $_PARTNERS #endif$_PARTNERS will ultimately decides on the strategy to preserve data that are not submitted to end-point subject area repositories #if$_DATAPLANT or ARCs in DataPLANT #endif$_DATAPLANT when the project ends. This will be in line with EU guidlines, institute policies, and data sharing based on EU and international standards.

5.    Data security

What provisions are or will be in place for data security (including data recovery as well as secure storage/archiving and transfer of sensitive data)?

Online platforms will be protected by vulnerability scanning, two-factor authorization and daily automatic backups allowing immediate recovery. All partners holding confidential project data to use secure platforms with automatic backups and offsite secure copies. #if$_DATAPLANT DataHUB and ARCs have been generated in DataPLANT, data security will be imposed. This comprises secure storage, and the use of password and usernames is generally transferred via separate safe media.#endif$_DATAPLANT

Will the data be safely stored in trusted repositories for long term preservation and curation?

Wherever there are certified repositories, these will be used as end-point repositories. #if$_RNASEQ Transcriptomics data and gene sequence data will be also made available upon publication via the standards ENA/SRA, #endif$_RNASEQ #if$_METABOLOMIC metabolite data in e.g. Metabolights (and/or Nationwide repositories like the German NFDI or the French INRAe), #endif$_METABOLOMIC #if$_PROTEOMIC Proteomics data in e.g. Pride/Proteomexchange #endif$_PROTEOMIC. In addition, the national resource will maintain safekeeping of data also after the project ends. #if$_DATAPLANT In addition, databases like e.g. Proteomexchange do not support deep plant specific metadata; hence ARCs will be maintained to ensure the reusability of plant-specific metadata. #endif$_DATAPLANT

6.    Ethics

Are there, or could there be, any ethics or legal issues that can have an impact on data sharing? These can also be discussed in the context of the ethics review. If relevant, include references to ethics deliverables and ethics chapter in the Description of the Action (DoA).

At the moment, we do not anticipate ethical or legal issues with data sharing. In terms of ethics, since this is plant data, there is no need for an ethics committee, however, diligence for plant resource benefit sharing is considered (🡺see Nagoya protocol). #issuewarning you have to check here and enter any due diligence here at the moment we are awaiting if Nagoya gets also part of sequence information. In any case if you use material not from your (partner) country and characterize this physically e.g., metabolites, proteome, biochemically RNASeq etc. this might represent a Nagoya relevant action unless this is from e.g. US (non partner), Ireland (not signed still contact them) etc but other laws might apply…. #endissuewarning

Will informed consent for data sharing and long term preservation be included in questionnaires dealing with personal data?

The only personal data that will potentially be stored is the submitter name and affiliation in the metadata for data. In addition, personal data will be collected for dissemination and communication activities using specific methods and procedures developed by the $_PROJECT partners to adhere to data protection. #issuewarning you need to inform and better get WRITTEN consent that you store emails and names or even pseudonyms such as twitter handles, we are very sorry about these issues we didn’t invent them #endissuewarning

7.    Other issues

Do you, or will you, make use of other national/funder/sectorial/departmental procedures for data management? If yes, which ones (please list and briefly describe them)?

Yes, the $_PROJECT will use common Research Data Management (RDM) tools#if$_DATAPLANT and in particular resources developed by the NFDI of Germany#endif$_DATAPLANT.

3     Annexes

3.1     Abbreviations

#if$_DATAPLANT

ARC Annotated Research Context

#endif$_DATAPLANT

CC Creative Commons

CC CEL Creative Commons Rights Expression Language

DDBJ DNA Data Bank of Japan

DMP Data Management Plan

DoA Description of Action

DOI Digital Object Identifier

EBI European Bioinformatics Institute

ENA European Nucleotide Archive

EU European Union

FAIR Findable Accessible Interoperable Reproducible

GDPR General data protection regulation (of the EU)

IP Intellectual Property

ISO International Organization for Standardization

MIAMET Minimal Information about Metabolite experiment

MIAPPE Minimal Information about Plant Phenotyping Experiment

MinSEQe Minimum Information about a high-throughput Sequencing Experiment

NCBI National Center for Biotechnology Information

NFDI National Research Data Infrastructure (of Germany)

NGS Next Generation Sequencing

RDM Research Data Management

RNASeq RNA Sequencing

SOP Standard Operating Procedures

SRA Short Read Archive

#if$_DATAPLANT

SWATE Swate Workflow Annotation Tool for Excel

#endif$_DATAPLANT

ONP Oxford Nanopore

qRTPCR quantitative real time polymerase chain reaction

WP Work Package


Data Management Plan of the DFG Project $_PROJECT

1.    Data description

1.1    Introduction

#if$_EU

The $_PROJECT is part of the Open Data Initiative (ODI) of the EU. #endif$_EU To best profit from open data, it is necessary not only to store data but to make data Findable, Accessible, Interoperable and Reusable (FAIR). #if$_PROTECT Open and FAIR data, however, considers the need to protect individual data sets. #endif$_PROTECT

The aim of this document is to provide guidelines on principles guiding the data management in the $_PROJECT and what data will be stored by using the responses to the DFG Data Management Plan (DMP) checklist to generate a DMP document.

The detailed DMP instructs how data will be handled during and after the project. The $_PROJECT DMP is modified according to the DFG data management checklist. #if$_UPDATE It will be updated/its validity checked during the $_PROJECT project several times. At the very least, this will happen at month $_UPDATEMONTH. #endif$_UPDATE

1.2    How does your project generate new data?

Data of different types or of different domains will be generated differently. For example:

    #if$_RNASEQ
  • Short read sequencing will be either collected or outsourced and raw data will be received.

  • #endif$_RNASEQ #if$_METABOLOMIC
  • Metabolomic data will be generated using chromatography coupled to mass spectrometry and from enzyme platforms mostly.

  • #endif$_METABOLOMIC #if$_PROTEOMIC
  • Proteomic data will be generated using an EU platform which are in line with community standards.

  • #endif$_PROTEOMIC #if$_IMAGE
  • Image data will be generated by using equipment (cameras, scanners, and microscopes) or software. Original images which contain metadata such as exif photo information will be archived.

  • #endif$_IMAGE #if$_GENOMIC
  • Genomic data will be created from sequencing data. The sequencing data will be collected by Next Generation Sequencing (NGS) equipment#if$_PARTNERS or get from parterners#endif$_PARTNERS. Then the sequencing data will be processed to get the genomic data.

  • #endif$_GENOMIC #if$_GENETIC
  • Genetic data will be generated by using Next Generation Sequencing (NGS) equipment.

  • #endif$_GENETIC #if$_TARGETED
  • Targeted assays (e.g. glucose and fructose content) will be generated using specific equipment or experiments. The procedure is fully documented in the lab book.

  • #endif$_TARGETED #if$_MODELS
  • Models data will be generated by software simulations. The complete workflow, which includes the environment, runtime, parameter and results will be documented and achieved.

  • #endif$_MODELS #if$_CODE
  • The code data will be generated by programmers.

  • #endif$_CODE #if$_EXCEL
  • The Excel data will be generated by experimentalists or data analysts by using Office or open-source software.

  • #endif$_EXCEL #if$_CLONED-DNA
  • The cloned DNA data will be generated by using a sequencing tool.

  • #endif$_CLONED-DNA #if$_PHENOTYPIC
  • Phenotypic data will be generated using phenotyping platforms.

  • #endif$_PHENOTYPIC

The $_PROJECT has the following aim: $_PROJECTAIM. Therefore, data collection#if!$_VVISUALIZATION and integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, integration and visualization #endif$_VVISUALIZATION #if$_DATAPLANT using the DataPLANT ARC structure are absolutely necessary #endif$_DATAPLANT #if!$_DATAPLANT through a standardized data management process is absolutely necessary #endif!$_DATAPLANT because the data are used not only to understand principles, but also be informed about the provenance of data analyzing data. Stakeholders must also be informed about the provenance of data. It is therefore necessary to ensure that the data are well generated and also well annotated with metadata using open standards, as laid out in the next section.

Public data will be extracted as described in paragraph 1.3. For the $_PROJECT, specific data sets will be generated by the consortium partners.

1.3    Is existing data reused?

The project builds on existing data sets and relies on them. #if$_RNASEQ For instance, without a proper genomic reference it is very difficult to analyze NGS data sets.#endif$_RNASEQ It is also important to include existing data sets on the expression and metabolic behaviour of $_STUDYOBJECT, but of course, also on existing characterization and the background knowledge. #if$_PARTNERS of the partners. #endif$_PARTNERS Genomic references can simply be gathered from reference databases for genomes/sequences, like the National Center for Biotechnology Information: NCBI (US); European Bioinformatics Institute: EBI (EU); DNA Data Bank of Japan: DDBJ (JP). Furthermore, prior 'unstructured' data in the form of publications and data contained therein will be used for decision making.

1.4    Which data types (in terms of data formats like image data, text data or measurement data) arise in your project and in what way are they further processed?

We foresee that the following data about $_STUDYOBJECT will be collected and generated at the very least: $_PHENOTYPIC, $_GENETIC, $_GENOMIC, $_METABOLOMIC, $_RNASEQ, $_IMAGE, $_PROTEOMIC, $_TARGETED, $_MODELS, $_CODE, $_EXCEL, $_CLONED-DNA and result data. Furthermore, data derived from the original raw data sets will also be collected. This is important, as different analytical pipelines might yield different results or include ad-hoc data analysis parts#if$_DATAPLANT and these pipelines will be tracked in the DataPLANT ARC#endif$_DATAPLANT. Therefore, specific care will be taken, to document and archive these resources (including the analytic pipelines) as well#if$_DATAPLANT relying on the vast expertise in the DataPLANT consortium #endif$_DATAPLANT.

1.5    To what extent do these arise or what is the anticipated data volume?

We expect to generate raw data in the range of $_RAWDATA GB of data. The size of the derived data will be about $_DERIVEDDATA GB.

2.    Documentation and data quality

2.1.    What approaches are being taken to describe the data in a comprehensible manner (such as the use of available metadata, documentation standards or ontologies)?

As noted above, we foresee using minimal standards such as #if$_RNASEQ|$_GENOMIC #if$_MINSEQE MinSEQe for sequencing data and #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Metabolights compatible forms for metabolites #if$_MIAPPE and MIAPPE for phenotyping-like data #endif$_MIAPPE. The minimal information standards will allow the integration of data across projects, and its reuse according to established and tested protocols. Specialized repositories will be used for common data types. For unstructured and less standardized data (e.g., experimental phenotypic measurements), these will be annotated with metadata and if complete allocated a digital object identifier (DOI).#if$_DATAPLANT The Whole datasets will also be wrapped into an ARC with allocated DOIs.#endif$_DATAPLANT. Whenever possible, data will be stored in common and openly defined formats including all the necessary metadata to interpret and analyze data in a biological context. By default, no proprietary formats will be used. However Microsoft Excel files (according to ISO/IEC 29500-1:2016) might be used as intermediates by the consortium#if$_DATAPLANT and by some ARC components#endif$_DATAPLANT. In addition, text files might be edited in text processor files, but will be shared as pdf.

We will use Investigation, Study, Assay (ISA) specification for metadata creation. #if$_RNASEQ|$_GENOMIC For specific data (e.g., RNASeq or genomic data), we use metadata templates from the end-point repositories. #if$_MINSEQE The Minimum Information About a Next-generation Sequencing Experiment (MinSEQe) will also be used. #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC #if$_METABOLOMIC Metabolights submission compliant standards will be used for metabolomic data where this is acccepted by the consortium partners.#issuewarning some Metabolomics partners considers Metabolights not an accepted standard#endissuewarning#endif$_METABOLOMIC As a part of plant research community, we use #if$_MIAPPE MIAPPE for phenotyping data in the broadest sense, but we will also be rely on #endif$_MIAPPE specific SOPs for additional annotations #if$_DATAPLANT that consider advanced DataPLANT annotation and ontologies. #endif$_DATAPLANT

#if$_OTHERSTANDARDS Other standards will also be used, such as $_OTHERSTANDARDINPUT. #endif$_OTHERSTANDARDS

Open ontologies will be used where they are mature. As stated above, some ontologies and controlled vocabularies might need to be extended. #if$_DATAPLANT Here, the $_PROJECT will build on the advanced ontologies developed in DataPLANT. #endif$_DATAPLANT Keywords about the experiment and the general consortium will be included, as well as an abstract about the data, where useful. In addition, certain keywords can be auto-generated from dense metadata and its underlying ontologies. #if$_DATAPLANT Here, DataPLANT strives to complement these with standardized DataPLANT ontologies that are supplemented where the ontology does not yet include the variables. #endif$_DATAPLANT

In fact, open biomedical ontologies will be used where they are mature. As stated in the previous question, sometimes ontologies and controlled vocabularies might have to be extended. #if$_DATAPLANT Here, the $_PROJECT will build on the advanced ontologies developed in DataPLANT. #endif$_DATAPLANT

2.2    What measures are being adopted to ensure high data quality?

The $_PROJECT aims at the following aim: $_PROJECTAIM. Therefore, data collection#if!$_VVISUALIZATION and integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, integration and visualization #endif$_VVISUALIZATION #if$_DATAPLANT using the DataPLANT ARC structure are absolutely necessary #endif$_DATAPLANT #if!$_DATAPLANT through a standardized data management process is absolutely necessary #endif!$_DATAPLANT because the data are used not only to understand principles, but also be informed about the provenance of data analyzing data. Stakeholders must also be informed about the provenance of data. It is therefore necessary to ensure that the data are well generated and also well annotated with metadata using open standards. Data variables will be allocated standard names. For example, genes, proteins and metabolites will be named according to approved nomenclature and conventions. These will also be linked to functional ontologies where possible. Datasets will also be named I a meaningful way to ensure readability by humans. Plant names will include traditional names, binomials, and all strain/cultivar/subspecies/variety identifiers.

To maintain data integrity and to be able to re-analyze data, data sets will get version numbers where this is useful (e.g. raw data must not be changed and will not get a version number and is considered immutable). #if$_DATAPLANT this is automatically supported by the ARC Git DataPLANT infrastructure. #endif$_DATAPLANT

As mentioned above, we foresee using e.g. #if$_RNASEQ|$_GENOMIC #if$_MINSEQE MinSEQe for sequencing data and #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Metabolights compatible forms for metabolites#if$_MIAPPE as well as MIAPPE for phenotyping-like data#endif$_MIAPPE. The latter will thus allow the integration of data across projects and safeguards that reuse established and tested protocols. Additionally, we will use ontology terms to enrich the data sets relying on free and open ontologies. In addition, additional ontology terms might be created and be canonized during the $_PROJECT.

2.3    Are quality controls in place and if so, how do they operate?

The data will be checked and curated through the project period. #if$_DATAPLANT Furthermore, data will be analyzed for quality control (QC) problems using automatic procedures as well as by manual curation. #endif$_DATAPLANT Phd students and lab professionals will be responsible for the first-hand quality control. Afterwards, the data will be checked and annotated by $_DATAOFFICER. #if$_RNASEQ|$_GENOMIC FastQC will be conducted on the base-calling. #endif$_RNASEQ|$_GENOMIC Before publication, the data will be controlled again.

2.4    Which digital methods and tools (e.g. software) are required to use the data?

The $_PROJECT will use common Research Data Management (RDM) tools#if$_DATAPLANT and in particular resources developed by the NFDI of Germany#endif$_DATAPLANT.

#if$_PROPRIETARY The $_PROJECT relies on the tool(s) $_PROPRIETARY. #endif$_PROPRIETARY

#if!$_PROPRIETARY No specialized software will be needed to access the data, usually just a modern browser. Access will be possible through web interfaces. For data processing after obtaining raw data, typical open-source software can be used. As no proprietary software is needed, no documentation needs to be provided. #endif!$_PROPRIETARY

#if$_DATAPLANT However, DataPLANT resources are well described, and their setup is documented on their github project pages. #endif$_DATAPLANT

#if$_DATAPLANT DataPLANT offers tools such as the open-source SWATE plugin for Excel, the ARC commander, and the DMP tool which will not necessarily make the interaction with data more convenient. #endif$_DATAPLANT

As stated above, here we use publicly available open-source and well-documented certified software #if$_PROPRIETARY except for $_PROPRIETARY #endif$_PROPRIETARY.

3.    Storage and technical archiving the project

3.1    How is the data to be stored and archived throughout the project duration?

Wherever there are certified repositories, these will be used as end-point repositories. #if$_RNASEQ Transcriptomics data and gene sequence data will be also made available upon publication via the standards ENA/SRA, #endif$_RNASEQ #if$_METABOLOMIC metabolite data in e.g. Metabolights (and/or Nationwide repositories like the German NFDI or the French INRAe), #endif$_METABOLOMIC #if$_PROTEOMIC Proteomics data in e.g. Pride/Proteomexchange #endif$_PROTEOMIC. In addition, the national resource will maintain safekeeping of data also after the project ends. #if$_DATAPLANT In addition, databases like e.g. Proteomexchange do not support deep plant-specific metadata; hence ARCs will be maintained to ensure reusability. #endif$_DATAPLANT

Data will be made available for many years#if$_DATAPLANT and potentially indefinitely after the end of the project#endif$_DATAPLANT.

In any case data submitted to international discipline related repositories which use specialized technologies (as detailed above) e.g. ENA /Pride would be subject to local data storage regulation.

3.2    What is in place to secure sensitive data throughout the project duration (access and usage rights)?

#if$_DATAPLANT In DataPLANT, data management relies on the Annotated Research Context (ARC). It is password protected, so before any data can be obtained or samples generated, an authentication needs to take place. #endif$_DATAPLANT

In case data is only shared within the consortium, if the data is not yet finished or under IP checks, the data is hosted internally, and the username and the password will be required (see also our GDPR rules). In the case data is made public under final EU or US repositories, completely anonymous access is normally allowed. this is the case for ENA as well and both are in line with GDPR requirements.

There will be no restrictions once the data is made public.

4.    Legal obligations and conditions

4.1    What are the legal specifics associated with the handling of research data in your project?

At the moment, we do not anticipate ethical or legal issues with data sharing. In terms of ethics, since this is plant data, there is no need for an ethics committee, however, diligence for plant resource benefit sharing is considered (🡺see Nagoya protocol). #issuewarning you have to check here and enter any due diligence here at the moment we are awaiting if Nagoya gets also part of sequence information. In any case if you use material not from your (partner) country and characterize this physically e.g., metabolites, proteome, biochemically RNASeq etc. this might represent a Nagoya relevant action unless this is from e.g. US (non partner), Ireland (not signed still contact them) etc but other laws might apply…. #endissuewarning

The only personal data that will potentially be stored is the submitter name and affiliation in the metadata for data. In addition, personal data will be collected for dissemination and communication activities using specific methods and procedures developed by the $_PROJECT partners to adhere to data protection. #issuewarning you need to inform and better get WRITTEN consent that you store emails and names or even pseudonyms such as twitter handles, we are very sorry about these issues we didn’t invent them #endissuewarning

4.2    Do you anticipate any implications or restrictions regarding subsequent publication or accessibility?

Once data is transferred to the $_PROJECT platform#if$_DATAPLANT and ARCs have been generated in DataPLANT#endif$_DATAPLANT, data security will be imposed. This comprises secure storage, and the use of passwords and usernames is generally transferred via separate safe media.

4.3    What is in place to consider aspects of use and copyright law as well as ownership issues?

Open licenses, such as Creative Commons (CC), will be used whenever possible.

4.4    Are there any significant research codes or professional standards to be taken into account?

Whenever possible, data will be stored in common and openly defined formats including all the necessary metadata to interpret and analyze data in a biological context. By default, no proprietary formats will be used; however, Microsoft Excel files (according to ISO/IEC 29500-1:2016) might be used as intermediates by the consortium#if$_DATAPLANT and by some ARC components in form#endif$_DATAPLANT. In addition, text files might be edited in text processor files, but will be shared as pdf.

5.    Data exchange and long-term data accessibility

5.1    Which data sets are especially suitable for use in other contexts?

The data will be useful for the $_PROJECT partners, the scientific community working on $_STUDYOBJECT or the general public interested in $_STUDYOBJECT. Hence, the $_PROJECT also strives to collect the data that has been disseminated and potentially advertise it#if$_DATAPLANT e.g. through the DataPLANT platform or other means #endif$_DATAPLANT, if it is not included in a publication anyway, which is the most likely form of dissemination.

5.2    Which criteria are used to select research data to make it available for subsequent use by others?

By default, all data sets from the $_PROJECT will be shared with the community and made openly available. This is, however, after partners have had the ability to check for IP protection (according to agreements and background rights). #if$_INDUSTRY This applies in particular to data pertaining to the industry. #endif$_INDUSTRY However, all partners also strive for IP protection of data sets which will be tested and due diligence will be given.

Note that in multi-beneficiary projects it is also possible for specific beneficiaries to keep their data closed if relevant provisions are made in the consortium agreement and are in line with the reasons for opting out.

5.3    Are you planning to archive your data in a suitable infrastructure?

#if$_DATAPLANT As the $_PROJECT is closely aligned with DataPLANT, the ARC converter and DataHUB will be used to find the end-point repositories and upload to the repositories automatically. #endif$_DATAPLANT

#if!$_DATAPLANT Data will be made available via the $_PROJECT platform using a user-friendly front end that allows data visualization. Besides this it will be ensured that data which can be stored in international discipline related repositories which use specialized technologies (Sequencing at the #if$_NCBI national US center: NCBI:#endif$_NCBI #if$_GEO Gene Expression Ominibus: GEO;#endif$_GEO European Bioinformatics Institute (EBI) archives: #if$_ENA European Nucleotide Archive: ENA;#endif$_ENA #if$_ARRAYEXPRESS Functional Genomics Data Archive: ArrayExpress;#endif$_ARRAYEXPRESS #if$_PRIDE Proteome database: PRIDE;#endif$_PRIDE #if$_METABOLIGHTS metabolomic database: MetaboLights;#endif$_METABOLIGHTS #if$_OTHEREP and $_OTHEREP #endif$_OTHEREP ) will be used to store data and the data will be processed there as well. The ARC and the converters provided by DataPLANT will guarantee the upload into the endpoint repositories is fast and easy.#endif!$_DATAPLANT

As noted above, specialized repositories like SRA /ENA, Pride /Proteomexchange are the most common ones and will be used when appropriate. In the case of unstructured less standardized data (e.g. experimental phenotypic measurements), these will be metadata annotated and if complete given a digital object identifier (DOI). #if$_DATAPLANT and the whole data sets wrapped into an ARC will get DOIs as well. #endif$_DATAPLANT

The submission is for free, and it is the goal (at least of ENA) to obtain as much data as possible. Therefore, arrangements are neither necessary nor useful. Catch-all repositories are not required. #if$_DATAPLANT For DataPLANT, this has been agreed upon. #endif$_DATAPLANT #issuewarning if no data management platform such as DataPLANT is used, then you need to find appropriate repository to store or archive your data after publication. #endissuewarning

5.4    If so, how and where? Are there any retention periods?

There are no restrictions, beyond the aforementioned IP checks, which are in line with e.g. European open data policies.

The $_PARTNERS decides on preservation of data not submitted to end-point subject area repositories #if$_DATAPLANT or ARCs in DataPLANT#endif$_DATAPLANT after project end. This will be in line with EU institute policies and data sharing based on EU and international standards.

5.5    When is the research data available for use by third parties?

#if$_early The data will be published as soon as possible to guarantee reusability. #endif$_early #if$_ipissue In general, IP issues will first be checked. #endif$_ipissue All consortium partners will be encouraged to make data available prior to publication openly and/or under pre-publication agreements #if$_GENOMIC such as those started in Fort Lauderdale and set forth by the Toronto International Data Release Workshop. #endif$_GENOMIC

6.    Responsibilities and resources

6.1    Who is responsible for adequate handling of the research data (description of roles and responsibilities within the project)?

The responsible will be $_DATAOFFICER as data Officer. The data responsible(s) (data officer#if$_PARTNERS or $_PARTNERS #endif$_PARTNERS) decides on the preservation of data not submitted to end-point subject area repositories #if$_DATAPLANT or ARCs in DataPLANT #endif$_DATAPLANT after the project end. This will be in line with EU institute policies, and data sharing based on EU and international standards.

6.2    Which resources (costs; time or other) are required to implement adequate handling of research data within the project?

The costs comprise data curation, #if$_DATAPLANT ARC consistency checks, #endif$_DATAPLANT and maintenance on the $_PROJECT´s side.

Additionally, last-level costs for storage are incurred by end-point repositories (e.g. ENA) but not charged against the $_PROJECT or its members but by the operation budget of these repositories.

A large part of the cost is covered by the $_PROJECT #if$_DATAPLANT and the structures, tools and knowledge laid down in the DataPLANT consortium. #endif$_DATAPLANT

6.3    Who is responsible for curating the data once the project has ended?

As applicable, $_DATAOFFICER, who is responsible for ongoing data maintenance will also take care of it after the finish of the $_PROJECT. #if$_DATAPLANT DataPLANT as external data archives may provide such services in some cases. #endif$_DATAPLANT

7     Annexes

7.1     Abbreviations

#if$_DATAPLANT

ARC Annotated Research Context

#endif$_DATAPLANT

CC Creative Commons

CC CEL Creative Commons Rights Expression Language

DDBJ DNA Data Bank of Japan

DMP Data Management Plan

DoA Description of Action

DOI Digital Object Identifier

EBI European Bioinformatics Institute

ENA European Nucleotide Archive

EU European Union

FAIR Findable Accessible Interoperable Reproducible

GDPR General data protection regulation (of the EU)

IP Intellectual Property

ISO International Organization for Standardization

MIAMET Minimal Information about Metabolite experiment

MIAPPE Minimal Information about Plant Phenotyping Experiment

MinSEQe Minimum Information about a high-throughput Sequencing Experiment

NCBI National Center for Biotechnology Information

NFDI National Research Data Infrastructure (of Germany)

NGS Next Generation Sequencing

RDM Research Data Management

RNASeq RNA Sequencing

SOP Standard Operating Procedures

SRA Short Read Archive

#if$_DATAPLANT

SWATE Swate Workflow Annotation Tool for Excel

#endif$_DATAPLANT

ONP Oxford Nanopore

qRTPCR quantitative real time polymerase chain reaction

WP Work Package

Practical Data Management Guide of the $_PROJECT

This practical guide of data management in the $_PROJECT should be considered as a minimum description, leaving flexibility to include additional actions of specific domain or to national or local legislation.#if$_EU The $_PROJECT will follow EU FAIR principles.  #endif$_EU 


The practical guide of data management in the $_PROJECT aims at providing a complete walkthrough for the researcher. The contents are customized based on the user input in the Data Management Plant Generator (DMPG). The practices in this guide are customized to fit related legal, ethical, standardization and funding body requirements. The suitable practices will cover all steps of a data management life-cycle:


  1. Data acquisition:

    1. Data generation

Data should be generated by devices that are compatible with the open-source format. The $_STUDYOBJECT should be compliant to biodiversity protocols. The protocols used to collect $_PHENOTYPIC, $_GENETIC, $_GENOMIC, $_METABOLOMIC, $_RNASEQ data about $_STUDYOBJECT will be stored#if$_DATAPLANT in the assays folder of ARC repositories.#endif$_DATAPLANT#if!$_DATAPLANT in a FAIR data storage. #endif!$_DATAPLANT 

  1. Data collection

The data collection process is conducted by experimental scientists and stewarded by $_DATAOFFICER.#if$_DATAPLANT An electronic lab notebook will be used to ensure enough metadata is recorded and guarantees that the data can be further reused.#endif$_DATAPLANT 

  1. Data Organization

The data organization process is conducted by $_DATAOFFICER. The detailed organization method and procedure are reported to the PIs. #if$_DATAPLANT The data organization will profit from the knowledge-base and data-base of DataPLANT, elastic search will be used to find better ways to organize the data. #endif$_DATAPLANT 



  1. Annotation

    1. Workflow documentation

Because the data collection process is conducted by experimental scientists and stewarded by $_DATAOFFICER.#if$_DATAPLANT An electronic lab notebook was used to ensure enough metadata is recorded and guarantees that the data can be further reused. The workflow can be retrieved from the electronic workbook by using the toolkits provided from the DataPLANT such as SWATE and arccommander. #endif$_DATAPLANT 

  1. Metadata completion

In case some of the metadata is still missing from the documentation from the experimental scientists and data officer. #if$_DATAPLANT Raw data identifier and parsers provided by DataPLANT will be used to get meta data directly from the raw data file. The metadata collected from the raw data file can also be used to validate the metadata previously collected in case there are any mistakes. #endif$_DATAPLANT We foresee using #if$_RNASEQ|$_GENOMIC e.g.#if$_MINSEQE MinSEQe for sequencing data and#endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Metabolights compatible forms for metabolites as well as MIAPPE for phenotyping like data. The latter will thus allow the integration of data across projects and safeguards that reuse established and tested protocols. Additionally, we will use ontology terms to enrich the data sets relying on free and open ontologies. In addition, additional ontology terms might be created and be canonized during the $_PROJECT.


  1. Maintenance: 

  1. Data storage

Raw data collected in previous steps are stored immediately by using#if$_DATAPLANT the infrastructure of DataPLANT #endif$_DATAPLANT #if!$_DATAPLANT in a secure infrastructure. ARC (Annotated Research Context) is used as a container to store the raw data as well as metadata and workflow.#endif!$_DATAPLANT

  1. Data curation

#if$_DATAPLANT Data stored in ARC is curated regularly as long as there are needs for update or revision.#endif$_DATAPLANT #if!$_DATAPLANT Data is curated regularly as long as there are needs for update or revision.#endif!$_DATAPLANT



  1. Publication and sharing

    1. Data publishing

#if$_RNASEQ Transcriptomics data and gene sequence data will be also made available upon publication via the standards ENA/SRA. #endif$_RNASEQ #if$_METABOLOMIC Metabolite data in e.g. Metabolights (and/or Nationwide repositories like the German NFDI or the French INRAe). #endif$_METABOLOMIC #if$_PROTEOMIC and Proteomics data in e.g. Pride/Proteomexchange. #endif$_PROTEOMIC In addition, the national resource will maintain safekeeping of data also after the project ends. #if$_DATAPLANT In addition, databases like e.g. Proteomexchange does not support deep plant-specific metadata; hence ARCs will be maintained to ensure reusability. #endif$_DATAPLANT

  1. Data sharing

In case data is only shared within the consortium, if the data is not yet finished or under IP checks, the data is hosted internally, and the username and the password will be required (see also our GDPR rules). In the case data is made public under final EU or US repositories, completely anonymous access is normally allowed. This is the case for ENA as well and both are in line with GDPR requirements.

Metadata focus timeline



Stages

Actions

Study

initialization

The metadata of study is created at the beginning of the project and updated continuously afterwards#if$_DATAPLANT, the input of the DMP generator created during the proposal stage can be reused. #endif$_DATAPLANT 

Sample

Collection

The information used to identify exact samples are initiated before experiments and updated at assay creation stages.

#if$_DATAPLANT The sample SWATE template will be used to document the sample metadata. A part of sample metadata which can be retrieved from the raw data will be updated afterwards using the ARC parsers #endif$_DATAPLANT 

Assay

Creation

Assay metadata must be collected as a daily routine during the experimental phrase. #if$_DATAPLANT A electronic lab notebooks will be used to guarantee the applicability and correctness of the notebook content#endif$_DATAPLANT 

Computational Analysis


Workflow annotation will be conducted during the computational analysis phrase. #if$_DATAPLANT The workflow metadata will be stored in the assay folder of the ARC.#endif$_DATAPLANT 

Results Sharing

The metadata of results are collected after all modifications and should not be changed after publication. #if$_DATAPLANT Collection of result metadata before publication and the conversion from ARC to the repositories will be taken care of by the ARC2REPO converter and done with minimal efforts. #endif$_DATAPLANT 


Preferred formats for raw data

#if$_GENOMIC  

extension_ident

Format Name

.h5

Hierarchical Data Format

.bam

compressed binary version of a SAM file

.cram

compressed columnar file format for storing biological sequences aligned to a reference sequence

.fa

fasta

.faa

fasta

.fas

fasta

.fasta

fasta

.fastq

fastq

.ffn

fasta

.fna

fasta

.fq

fastq

.frn

fasta

.sff

sff-trim

#endif$_GENOMIC


#if$_RNASEQ  

.bam

compressed binary version of a SAM file

.cram

compressed columnar file format for storing biological sequences aligned to a reference sequence

.fa

fasta

.faa

fasta

.fas

fasta

.fast5

HDF5

.fasta

fasta

.fastq

fastq

.ffn

fasta

.fna

fasta

.fq

fastq

.frn

fasta

.sff

sff-trim

bas.h5

HDF5

.h5

Hierarchical Data Format

#endif$_RNASEQ


 #if$_METABOLOMIC  

.cdf

netCDF (AIA/ANDI) interchange data format

.cmp

netCDF compare file

.abf

Axon Binary File

.d

Agilent

.dat

Chromtech, Finnigan, VG

.idb

MASSLAB binary file

.jpf

Mass Center Main Mass Spectrometry Data (JEOL USA, Inc.)

.lcd

Shimadzu LC Solution / Labsolutions Data File

.mgf

Mascot Generic File

.raw

Thermo Xcalibur, Micromass (Waters), PerkinElmer, Waters

.scan

a spectrum or a Total Ion Chromatogram (TIC)

.wiff

ABI/Sciex

.xps

Thermo Fisher Scientific K-Alpha+ spectrometer file

cdf.cmp

netCDF compare file

#endif$_METABOLOMIC


 #if$_PROTEOMIC  

.baf

Bruker

.d

Agilent

.dat

Chromtech, Finnigan, VG

.fid

Bruker

.ita

ION-TOF

.itm

ION-TOF

.mgf 

Mascot Generic File

.ms

Finnigan (Thermo)

.ms2

Sequest MS/MS peak list

.pkl

Micromass peak list

.qgd

Shimadzu

.qgd

Shimadzu

.raw

Thermo Xcalibur, Micromass (Waters), PerkinElmer, Waters

.raw

Physical Electronics/ULVAC-PHI

.sms

Bruker/Varian

.spc

Shimadzu

.splib 

spectral library file

.t2d

ABI/Sciex

.tdc

Physical Electronics/ULVAC-PHI

.wiff

ABI/Sciex

.xms

Bruker/Varian

.yep

Bruker

.dta

Sequest MS/MS peak list

.msp


.nist


#endif$_PROTEOMIC



Datenmanagementplan

Projektname: $_PROJECT

Forschungsförderer: Bundesministerium für Bildung und Forschung

Förderprogramm:

FKZ:

Primärforscher/Wissenschaftler:

ID Primärforscher/Wissenschaftler: $_USERNAME

Kontaktperson Datenmanagement: $_DATAOFFICER

ID Kontaktperson Datenmanagement:

Kontakt: $_EMAIL

Projektbeschreibung:

Erstellungsdatum:

Änderungsdatum:

Zu beachtende Vorgaben:

Datenspeicherung

Die Dateibenennung erfolgt nach folgendem Standard:

Dateien werden in möglichst offenen, standardisierten Formaten gespeichert.

Datendokumentation

Folgende Dokumente werden erstellt:

Public data will be extracted as described in the previous paragraph. For $_PROJECT, specific data sets will be generated by the consortium partners.

#if$_RNASEQ

Short read sequencing will either be collected or outsourced and raw data will be received.

#endif$_RNASEQ#if$_METABOLOMIC

Metabolomic data will be generated using chromatography coupled to mass spectrometry and from enzyme platforms mostly.

#endif$_METABOLOMIC#if$_PROTEOMIC

proteomic data will be generated using an EU platform which are in line with community standards.

#endif$_PROTEOMIC

#if$_PREVIOUSPROJECTS data from previous projects such as $_PREVIOUSPROJECTS will be considered. #endif$_PREVIOUSPROJECTS

Legitimität

Data Sharing

Datenerhalt

Data management plan of $_PROJECT for BBSRC

a document template