RDM: FAQ
General
Where can I go with questions about Research Data Management?
Either the Faculty Data Steward or the Library’s RDM Support Desk. The Data Steward can help you best with questions that are specific to the work we do at the faculty, and with questions about specific grants; the RDM Support Desk is best equipped to deal with questions that could be asked by any VU researcher, for example about storage or archiving options. But both work together, so you can’t ask the wrong person.
Where can I find more information about Research Data Management?
We maintain a number of resources you can consult. At the VU level, there is the Research Data Support Portal which contains links to anything you might want to know about RDM, and the library maintains a series of Libguides explaining various topics related to the management of your data. At the faculty level, the data steward maintains a page giving advice that is specific to researchers in Social Sciences, including links to content you can use in proposals.
I don’t have data. I only have observations.
As RDM experts, we would say that observations are also a type of data. But that doesn’t really matter: in any case, you want to protect your observations, make backups of your observations and make sure that they are archived securely or shared with the world so that you can demonstrate you did your research well. That is to say, many important aspects of Research Data Management apply whether you call your observations data or not.
Data Storage and Security
Where should I store my data?
Our advice is to use Yoda for storage and archiving, even for sensitive data, and only use VU-managed to devices to access the data. You can find other options in the Storage Finder, but check with the Data Steward if they work for the sensitivity level of your data.
What is the difference between data storage and archiving?
Data storage refers to where you save your data during the research. Your data storage option needs to be available to all collaborators, while still ensuring a sufficient level of security.
Data archiving is where you keep your data after you are done with it, but it may need to be accessed when there are doubts about research integrity. Your data archiving solution can be publicly accessible if there is no sensitive data, or restricted-access if there is. The solution needs to be permanent and secure, so that the data cannot be changed, and any links to it will remain functional indefinitely.
What security measures should I take?
When using personal data (see below), per GDPR you should take “appropriate organizational and technical measures” to secure your data. The specific actions you should take are not set in stone; you should consider the potential consequences of a data breach, and whether or not the actions you would have to take to prevent them are reasonable.
All VU storage solutions offer a number of security measures. For example, access is only allowed using passwords and multi-factor authentication. If your data is sensitive, there is a number of additional security measures you can take that reduce the risk of data leak, either by reducing the chance a leak happens, or by reducing the impact of a leak:
- Make sure people only have access to the data they need to do their task in your project. For example, with Research Drive it is possible to give each collaborator only access to the folders they need.
- Don’t sync data from your Research Drive to your personal computer if you don’t have to. For example, once you are done with your raw data, having it on your personal computer only increases the chance that your data is leaked: keep the raw data online-only, and only sync the processed (pseudonymized) data you are working on.
- Make sure everyone in your project is trained in security procedures, such as strong passwords, not clicking attachments in emails from unknown senders etc.
- Pseudonymize your personal data by removing any directly identifying information, so that any data that is leaked is less likely to be linked to your respondents. If you need to keep the directly identifying data (for example, because it’s part of your raw data which you want to keep to demonstrate the provenance of your data, or because you need to contact participants for follow-up), make sure this data will not be leaked at the same time as the pseudonymized data. You can do this either by storing it separately or by encrypting it. See “When should I pseudonymize?”, below.
- Encrypt your data, so that if someone accesses the hard drive that holds your data, they can’t read the data. Software such as Cryptomator makes encryption very convenient. Encryption does have a large downside: loss of your password means loss of the data. You can use a password manager to minimize this risk, but it is wise to think twice before deciding to use encryption.
For help on deciding what measures are appropriate for your data, and with the practical implementation of any of these, you can contact your data steward.
Personal Data
What is personal data?
Personal data is any data that can be directly or indirectly linked to a living person. You can directly link data to a person if a direct identifier like their name, phone number, email address etc. is included in the data. You can indirectly link the data if you can combine the data with another piece of data or information to find the person who the data is about. This is possible for more data than you think, so if you collected data from people, it’s safe to assume your data is personal data, even if you remove things like names, phone numbers and addresses.
What is the difference between anonymization and pseudonymization?
Both these terms mean that you make it less likely that the data that you have can be linked to your respondents, increasing the security of your data. In case of pseudonymization, you remove the possibility of directly linking the data to your respondents, by removing things like names and addresses from the data. Anonymization removes entirely the possibility of linking your data to your respondents, both directly and indirectly. This means that the data is no longer personal data, and GDPR does not apply. However, anonymization is difficult and we don’t usually recommend it (see below).
When should I pseudonymize?
There is no “one-size-fits-all” answer to this: in general, we do recommend pseudonymizing your data, but in some cases the benefits of pseudonymization may not outweigh the costs. These costs and benefits depend on the nature of your data. A tabular data set is easily pseudonymized by dropping certain variables and generating random identifiers, so it should probably be pseudonymized. On the other hand, for an audio recording it may be practically impossible to edit out all the names. Likewise, the benefits differ; for a dataset containing speeches by famous politicians, leaving out the names will not make identification appreciably more difficult, and yield no security benefits as the data is publicly available anyway. Whether the costs of pseudonymization outweigh the benefits thus depends on the specific project. If you feel the costs don’t outweigh the benefits, feel free to contact your Data Steward to see if they agree, and make sure to write down your reasoning in your Data Management Plan.
Should I keep my unpseudonymized data? If so, where?
If possible, directly identifying data is kept completely separate from research data. For example, your Qualtrics form should not contain fields for email addresses if that’s not needed for the research itself. If you need email addresses to send rewards, use a separate form. In this way, you can destroy any personal data as soon as possible, without editing the raw data. However, this is not always possible, since some times the directly identifying data is integral part of the raw data (for example in video recording). In these cases you should not destroy the data, because you should keep an unedited version of your data for transparency purposes. This raw copy of the data should be stored safely, and in such a way that a data breach doesn’t necessarily mean a breach of both pseudonymized and unpseudonymized data. Examples are:
- Store the raw data on a separate server (however, most research programs don’t have two servers available).
- Store both raw and pseudonymized data on the same server (or device) but encrypt the raw data. You should make sure that you can’t lose your encryption password, or else you lose your raw data.
- Keep both both pseudonymized data on the same server, but make sure the raw data is never synced to personal computers or other devices (for example by adding it to a Yoda Vault). This way, the raw data is protected from the most common data breaches (e.g. losing a laptop in the train).
Why is it so hard to anonymize data?
Anonymization is potentially very attractive because it removes the need to comply with GDPR. However, it is difficult to combine with the goals of researchers in practice. This is because it will almost always involve making data less detailed, which will harm your ability to draw conclusions from the data.
To see why, first consider a quantitative data set about work satisfaction, containing gender and age of all respondents. If I know my colleague is a respondent in this survey, I may be able to infer things about my colleague from the public data set. If only one person in the data set matches his age and gender, I have successfully (indirectly) identified him in the data set. If there are multiple people matching his age and gender are present, but none has indicated liking their colleagues, I have still inferred something about him, and may become very disappointed! To prevent me from identifying of my colleague, you as the researcher should thus ensure that there are no unique combinations of age and gender (for example by using broader age bins) and that within each combination of age and gender there is sufficient variation in answers that nothing can be inferred about individuals (so there is always a mix of people who like their colleagues and those who don’t). It is easy to see how the binning of variables may lead to less precision in the analysis, and how difficult it is to ensure that proper variation exists in all (combinations of) variables. There are ways to do this, but it is usually more attractive to keep the data as personal data, even if this puts restrictions on data use due to GDPR.
Qualitative data sets are usually so rich that all observations are unique, and thus potentially identifiable by someone who knows your respondents well (or otherwise has detailed information on them). Qualitative data is therefore usually impossible to fully anonymize, though pseudonymization may be possible.
I know it’s difficult, but I would still like to anonymize my data, how do I do this?
That’s great! A good place to start is the R package sdcMicro. Your data steward may be able to help out when using it. Alternatively, there is Amnesia. Note that anonymization means modifying your data, so if you want to anonymize data for replication purposes, not all analyses that you did with your unanonymized data can be fully replicated using anonymized data. This is acceptable, if explained properly in your paper (e.g. in a footnote).
I don’t have informed consent forms for my research. Is that bad?
It’s not necessarily bad, because written informed consent is only required by law in cases of health research (where WMO applies). You can have participants give informed consent orally if you’re not doing WMO research, but make sure you record it and store it safely. You can ask advice from your data steward or privacy champion if you will ask oral informed consent. There are also other legal grounds (than informed consent) on which you can do research. However, if you should have asked informed consent, but did not do so, that could be bad and we recommend that you contact your privacy champion as soon as possible.
Data Management Plans (DMPs) metadata and FAIR Data
How can I start writing a Data Management Plan?
You can log into DMPOnline with your VU credentials to start writing a DMP. It has templates of most funders which are kept up to date by the university library. If you need any help (for example with the technical terms used in many DMP templates), feel free to contact the faculty data steward.
Where can I find examples of Data Management Plans?
DMPOnline has a large number of Data Management Plans from which you can get inspiration for your own DMP.
What DMP template should I use?
For projects involving personal data, it is recommended that you use the VU template. You can currently only access the VU Template by ticking the box “No funder associated with this plan or my funder is not listed”. This template is accepted by ZonMW and NWO. By using this template you make sure that the information of your project can be used in the “GDPR registry”, which the VU is obligated to maintain and provide to the privacy authorities on request.
What if my funder doesn’t accept the VU template?
You can make use of the template provided by your funder (most funders’ templates are on DMP Online). To make sure your project is included in the GDPR registry, you need to fill out the VU GDPR registration for for reseatch, which you can find as a template on DMP Online: select “Create Plans”, and then make sure to tick the box “No funder associated with this plan or my funder is not listed”.
I never created a Data Management Plan. Is that bad?
There are some situations in which writing a DMP is mandatory. For example, if you have received a grant, you almost always have to write a DMP. The Faculty of Social Sciences also requires you to write a DMP for any new research project you start. And DMPs are sometimes necessary components of various requests, such as an ethics application (for a full procedure) and a storage application (in some cases). And obviously, if you are following the course “Writing a DMP”, you have to write a DMP to complete the course.
Now, for research that is already underway, writing a DMP is a good practice, but not doing so is not necessarily bad. Writing a DMP, though, is a good way to keep you accountable and not let things come down to chance and luck. And it helps you to avoid last-minute panic. Although you are already under way with the research, it can still be a good idea to write a DMP. You can get in touch with your data steward or the RDM Support Desk if you still want to do it.
What is metadata?
Metadata is data about your data. It is simply information such as authors, colloborators, dates, description, key words. It is not the data itself. So even if your data itself is very sensitive, the metadata may be freely published (though in some cases metadata can be sensitive as well).
What metadata and documentation should I include with my data?
If you use Yoda, you can simply fill out the metadata form included in the portal. Otherwise, you can use this file to write a “readme file” that contains the same information.
As for documentation: this is very much dependent on your data. A useful excercise is to imagine yourself having to take a break from research for a few years. What information would you need to get back into understanding your data again? This can include questionnaires, codebooks, field manuals, topic lists, proposals, ethics applications, data management plans etc.
What metadata standard should I use?
Some DMP templates ask this, but it’s usually not something you as resarcher should worry about. You should let this depend on the service you use to register, archive and/or publish your data set. The forms on those services will ensure that everyhting is stored according to a certain standard. PURE uses a standard called CERIF; Yoda uses DataCite.
What does FAIR data stand for?
FAIR stands for Findable, Accessbile, Inter-operable, and Re-usable. Many of these sub-elements of FAIR cover a wide spectrum of possibilities: some data can be made more easily accessible than others, for example. Advice on how to make your data FAIR can be found in the FSS Archiving Guidelines
My funder wants me to make my data FAIR, but I can’t share the data because of privacy concerns.
This is not a problem. Data needs to be Accessible. This means that there is a well-defined procedure for accessing the data, for example in cases where there are doubt about scientific integrity. Archiving your data securely on Yoda is sufficient for this.