Hey Claudia,
I know we've reached out to the osf for the same reason (data sovereignty rules in Australia). Ping me outside of this thread and I'll connect you with the person in charge of poking our institutional data repository? It's a bit larger scope than what you're thinking, but I suspect much of our discussion is commutable.
Cheers,
-Brian
________________________________
From: Claudia Beleites <***@chemometrix.eu>
Sent: Sunday, 22 July 2018 11:42:49 PM
To: Tim Head via discuss
Subject: Re: [discuss] Version control and collaboration with large datasets.
Tim, (and everyone who just has the same questions here)
OSF definitively is something I'll check out.
However, I note the privacy document explicitly spells out that as a US-based repo, it does not meet the requirements of EU privacy legislation (and I've been working with sensitive/patient data, so privacy and related security aspects are an important consideration). This, together with the experience that some research labs prefer to keep their data in-house. My guess is, that a system that can be set up in-house* would have much better chances to be approved by management over here also because of legal considerations.
* or in a DMZ, giving the chance to expose their public project parts or running two instances, an internal one very much in-house and one for public parts that is exposed.
As OFS states it is FOSS, this should be possible, but I did not immediately see instructions "how to run on your own server" nor technical requirements. Could you point me to such information, or is there even something like a "we run our own instances" user group?
Many thanks,
Clauida
Am 21.07.2018 um 19:29 schrieb Tim Head via discuss:
Hello all,
in the hopes of making it easier to use osf.io<https://protect-au.mimecast.com/s/eviECwV1jpSGlwV5SVLtP8?domain=osf.io> with large datasets last summer we* had some time and funding to start building http://osfclient.readthedocs.io/en/latest/cli-usage.html<https://protect-au.mimecast.com/s/ktMuCxngGkf1EWRpfv5HYy?domain=osfclient.readthedocs.io> which is both a command-line program and a Python library for osf.io<https://protect-au.mimecast.com/s/eviECwV1jpSGlwV5SVLtP8?domain=osf.io>. The tool works well for gigabyte sized files and there is starting to be a small community of people who contribute fixes and new features when something they need is missing. It would be great to grow this further.
Maybe this removes that one last hurdle that was stopping you from putting all your datasets on osf.io<https://protect-au.mimecast.com/s/eviECwV1jpSGlwV5SVLtP8?domain=osf.io> (when we asked about size limits they were confident no one would ever reach them ... and I still don't know anyone who has found it)
T
* we in this case is Titus Brown and me
On Sat, Jul 21, 2018 at 6:29 PM Claudia Beleites <***@chemometrix.eu<mailto:***@chemometrix.eu>> wrote:
Hi all,
I'm also very interested in learning solutions for this.
At the moment I distinguish two use cases:
- focus of project is coding (developing software/package/library) vs.
- focus of project is data analysis, with the sub-topic of projects
where various "slices" of the data are important.
**Code project**
I have one project where I use git-lfs on github (got a promo offer for
free use). The project is about *code* (R package) that however has some
100 MB binary data attached to it (it was larger at some point before I
could get smaller but equally suitable example files for some formats).
The binary data are example files in various file formats for the file
import filters the package provides. Initially, we had them in git as
well, but that horribly bloated the repo so it got unusable after a few
years. The files themselves, however, hardly need any versioning. I get
them and store them as they are, and only very occasionally is one of
those files replaced. The main point of the git-lfs storage is to make
sure that all files are where they are supposed to be without having too
much of manual hassle.
At some point I was lucky to get a github promo offer for free git-lfs
(test) usage and gave it a try - which is the current state.
Experiences:
- (due to free promo I don't have bandwidth billing trouble)
- Git is largely independent of git-lfs: you can still fork/clone the
git-only part of the repo and work with that. For the project in
question, the files stored in git-lfs are only needed for developing and
unit testing of file import filters, everything else does not need
git-lfs. I decided I don't want to force collaborators to install
git-lfs, so set up the project in a way that e.g. the file filter unit
tests check whether those files are available, and if not skips those
tests (visibly).
This does also make sense because of size restrictions for the R package
submission to CRAN, and as I'm the maintainer in the view of CRAN, I can
always make sure I properly run all tests.
- With this setup, I do not experience the collaboration trouble/broken
forking issues Peter Stéphane describes in the link in Carl's mail. At
least not for the parts of the project that are stored as "normal" git.
I've not yet had anyone trying to directly submit files that should go
into the lfs part of the repo.
- I tried to get git-lfs installed together with a private gitlab
instance (thinking we may want to use it for data-type projects), but
like Carl, I gave up. That was IIRC 3 years ago, so things may have
improved meanwhile.
For other "code-type" projects (model/algorithm development), I tend to
take a two-layered approach. Data sets that are small enough to be
shipped as example and unit test data, say, in an R package are kept
with the code. In fact, many of them are toy data computed from code,
and I just store that code. The 2nd layer are well-known example data
sets, and there I simply rely on those data sets staying available. (I'm
talking e.g. the NASA AVIRIS data sets
https://aviris.jpl.nasa.gov/data/free_data.html<https://protect-au.mimecast.com/s/vY93Cyoj8PurOB2GIQkguW?domain=aviris.jpl.nasa.gov>)
(Side note: I'm somewhat wary of papers proposing their own new
algorithm solely on their own data set, and of algorithm comparisons
based on one or few data sets)
**Data Project**
This is where I think things could be improved :-)
The majority of projects I work on are data analysis projects. I.e. we
have measurement data, do an analysis and draw conclusions, write a
report or paper.
For these projects, we tend to take a "raw data and code are real"
approach that also implies that the raw data is never changed (with the
only exception of renaming files - but the files I'm thinking off store
their orignal name, so even that can be reconstructed). So we basically
have storage and distribution needs, but not really versioning needs. We
sometimes produce pre-processed intermediate data, but that again is
defined by the code that produces this data from the raw data, and the
results are considered temporary files. If I do manual curation (mostly
excluding bad runs with certain artifacts), I produce code or data files
that say which files were excluded and for what reason. Most of this can
be and is done in an automated fashion, though.
Producing versions of this that are to be kept (such as making snapshots
of the state of data for a paper) is sufficiently infrequent to just zip
those data and have the version in the file name.
Recently, I tend to use nextcloud to share such data. We did use git for
a while, but with large amounts of data that does become cumbersome, and
we found that few collaborators were willing to learn even just the
level of git that lets them clone and pull. Owncloud/Nextcloud is a much
lower barrier in that respect.
At the moment I think what I'd like to see would be nextcloud with
commits, ignores and maybe a somewhat more distributed and less central
approach ...
Versioning binary data would be far more important for colleagues who
extensively use GUI software for their analyses: not all of the relevant
software does keep logs/recovery data (some do, though, as they are to
be used in fields like pharma where full audit trails are required).
**Data Projects II**
(Here I see huge possibilities for improvement)
OTOH, we also have some projects where it is clear that a large variety
of subsets of the data is to be requested and analysed, and we've set up
data bases for those purposes. Here again, I do dumps/backups, and in
the rare occasion that a version should be tagged that can be done to
the backup/dump. Again, these data bases are set up in a way that easily
allows adding/inserting, but changing or deleting requires admin rights
- and admin should make sure of the backup before doing any such
"surgery" to the data base.
I may say that I'm originally from a wet-lab field (chemistry): I'm
trained to work under conditions where mistakes irretrievably mess up
things. Version control and being able to undo mistakes is good and
important, but if these techniques (luxuries?) are not available at
every point, that's as it is right now.
I admit that I never bothered about implementing full audit trails - and
the supervisors I had were already suspicious whether it is worth while
bothering to set up a data base and very much against "waste of time"
such as (for code projects) unit testing and encapsulating code in
packages/libraries/their own namespace...
I've met one research institute, though, that run a full LIMS
(laboratory information management system) which however, is more suited
for situations where the same types of analyses are repeatedly done for
new samples rather than research questions where not only samples but
also analysis methods change from project to project.
But e.g. RedCap https://projectredcap.org/<https://protect-au.mimecast.com/s/mBRoCzvkmpfMWjwXHKdQEV?domain=projectredcap.org> produces data bases with
audit trails. (Never tried it, though).
Best,
Claudia
--
Claudia Beleites Chemometric Consulting
Södeler Weg 19
61200 Wölfersheim
Germany
phone: +49 (15 23) 1 83 74 18
e-mail: ***@chemometrix.eu<mailto:***@chemometrix.eu>
USt-ID: DE305606151
------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M278751611e28840c648e49a9<https://protect-au.mimecast.com/s/bHX1CANpnDCNpDlxsQyjir?domain=carpentries.topicbox.com>
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription<https://protect-au.mimecast.com/s/s7C2CBNqgBC7WLRxH1gDzM?domain=carpentries.topicbox.com>
--
Claudia Beleites Chemometric Consulting
Södeler Weg 19
61200 Wölfersheim
Germany
phone: +49 (15 23) 1 83 74 18
e-mail: ***@chemometrix.eu<mailto:***@chemometrix.eu>
USt-ID: DE305606151
The Carpentries<https://protect-au.mimecast.com/s/1TlOCD1vRkC5Y8J7iR_2Q2?domain=carpentries.topicbox.com> / discuss / see discussions<https://protect-au.mimecast.com/s/X_MPCE8wlRC34Z0wHxFEvU?domain=carpentries.topicbox.com> + participants<https://protect-au.mimecast.com/s/VSeaCGv0Z6f13zOrfrja_r?domain=carpentries.topicbox.com> + delivery options<https://protect-au.mimecast.com/s/s7C2CBNqgBC7WLRxH1gDzM?domain=carpentries.topicbox.com> Permalink<https://protect-au.mimecast.com/s/SyXRCK1DOrC27RMQckkZLm?domain=carpentries.topicbox.com>
------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M3d0f5f1ce18277b9088f646f
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription