Provenance#
Provenance, also referred to as lineage, is the documentation of a dataset origin. It includes how the data was collected or generated, which methodologies, instruments and/or software were used for its creation. You could think of it as the data workflow from the start of a project to the point a dataset is published and you have the project final product.
Provenance can be complex, as usually data goes through a lot of steps and re-iterations. Given the nature of research itself, the objectives and methods of your analysis might change numerous times, you might keep some of the steps and modify others. Which is why having a good provenance is so important for the reproducibility of your research. There are tools available to help record some of the steps automatically, but ultimately none of them will produce a good provenance record without regular manual intervention. It is not enough to track the changes, you also need to know why they happened.
Important things to keep track of are:
what data was used as input, if any, it helps if the dataset has been published and has a DOI that can be referred to, or it is at least well documented and properly versioned. While there are situations where you do not have choice, when you do you should always prefer a well-documented dataset to one which is poorly documented.
use a version control system for your analysis code, we recommend
git
. Git and GitHub come with lots of tools and options, make the most of them: readme files, releases, issues, project plans and commit messages, they all help with not only tracking the changes but why they happened.if using someone elses code, as for input data, make sure it is properly documented and versioned.
use good coding practices and metadata conventions for your dataset. Metadata and use of standards make it easier to share data and code, and in fact might be required. It will also help in the future if you need to get back to them after a long break.
review your provenance often. Make a habit at the end of a working day to ensure your previous notes, metadata etc are all still relevant. It will only take a few minutes.
In conclusion, provenance is a progressive account of your research, part of the provenance will be directly attached to the data or the code you used, but it is good to have one document that collects all the other sources. A data management plan is a good template for such a document. If you create one at the start of your project and update it regularly you will have your work done when you want to publish the data, when you need to describe your research in a paper, or even before leaving an institution at the end of your PhD or postdoc.
The ARDC also has a lot of online resources covering data provenance.