[discuss] Version control and collaboration with large datasets.

Discussion:

thompson.m.j via discuss

2018-07-20 16:08:01 UTC

Hello all,
I am a member of a computational biology lab that models processes in developmental biology and cell signaling and calibrates these models with microscopy data. I've recently gotten into using version control using git for our codes, and I am now trying to determine the best course of action to take for the data. These are the tools I'm aware of but have not tested:

The Dat Project https://datproject.org/
Git Large File Storage https://git-lfs.github.com/
Git Annex https://git-annex.branchable.com/
Data Version Control (DVC) https://dvc.org/

All projects seem to be aimed at researchers trying to integrate data versioning into their workflow and collaboration, and some seem to have a few other bells and whistles.

Now, the only reason I settled on using git for my work is that it seems to be the de facto standard version control just about the whole world uses. Using this same reasoning, does anyone here have a keen insight into which of the data versioning tools listed here or otherwise is (or will most likely become) the standard for data version control?
------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M26854e6b9b3500ea27de1bc9
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Dav Clark via discuss

2018-07-20 22:22:29 UTC

Permalink

I highly doubt that any of them will become *the* standard, though you may
see some convergence like we saw with NetCDF4 and HDF5. For now, the most
robust solution is probably git LFS. It's backed by a major company and
many commercial providers are competing to provide performant back ends. In
my work at Gigantum (which I'll talk about sometime soon!), we evaluated
what makes sense for most users and Git LFS was the clear answer. The data
syncronization model is simple and there are few choices beyond the level
of the repo. The standard thing is that all the files still come along with
the repo, though you can organize things so you only have copies of some
files.

Git Annex is the closest to the "right" solution for a "traditional"
workflow, IMHO. Joey Hess (who was a core Debian member for a long time)
developed it, and it's in Haskell, so the compiler is working with Joey -
who is already very smart and thoughtful. The plugin architecture, however,
means your mileage may vary and in my experience, working on Windows is not
to be taken lightly. On POSIX-y filesystems, git annex allows more
flexibility in terms of which files are where. Git Annex was chosen by the
datalad project in neuroimaging, and Joey is an advisor to that project:
http://www.datalad.org/

Dat is a whole 'nother level. Its synchronization layer is exciting, but
again probably a bit sharp-edged for an academic lab that just started
using git.

I know less about DVC. You might also throw quilt in there (
https://quiltdata.com/) - my sense is that they are trying to make it
closer to the kinds of datasets you have in R where you use the same
datasets again and again.

But to sharpen the question - it probably depends on the relationship of
the data to the code (one-to-one, one-to-many, etc.), and also the size (if
files are < 100MB you can just put them directly in regular git on
GitHub!). Also to a lesser extent the infrastructure you're using (laptops?
Shared server? network file share?), data use restrictions / privacy, etc.

I for one would be happy to read your reasoning "out loud" here.

Best,
Dav

On Fri, Jul 20, 2018 at 5:48 PM thompson.m.j via discuss <

Post by thompson.m.j via discuss
Hello all,
I am a member of a computational biology lab that models processes in
developmental biology and cell signaling and calibrates these models with
microscopy data. I've recently gotten into using version control using git
for our codes, and I am now trying to determine the best course of action
The Dat Project https://datproject.org/
Git Large File Storage https://git-lfs.github.com/
Git Annex https://git-annex.branchable.com/
Data Version Control (DVC) https://dvc.org/
All projects seem to be aimed at researchers trying to integrate data
versioning into their workflow and collaboration, and some seem to have a
few other bells and whistles.
Now, the only reason I settled on using git for my work is that it seems
to be the de facto standard version control just about the whole world
uses. Using this same reasoning, does anyone here have a keen insight into
which of the data versioning tools listed here or otherwise is (or will
most likely become) the standard for data version control?
*The Carpentries <https://carpentries.topicbox.com/latest>* / discuss /
see discussions <https://carpentries.topicbox.com/groups/discuss> +
participants <https://carpentries.topicbox.com/groups/discuss/members> + delivery
options <https://carpentries.topicbox.com/groups/discuss/subscription>
Permalink
<https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M26854e6b9b3500ea27de1bc9>

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mbb70aabc93d6ea28e6776e97
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Waldman, Simon

2018-07-20 22:34:21 UTC

Permalink

I donât have an answer, but Iâll be really interested to hear your experiences. Iâm in a slightly similar situation, in that I use git to store the inputs to hydrodynamic models â but while that worked nicely when I only had small inputs, some of the inputs are now many gigabytes and that ainât gonna work nicely withoutâŠ one of the things that you mention!

So if you do investigate, please report back to the list!

From: thompson.m.j via discuss <***@lists.carpentries.org>
Sent: 20 July 2018 17:08
To: discuss <***@lists.carpentries.org>
Subject: [discuss] Version control and collaboration with large datasets.

Hello all,
I am a member of a computational biology lab that models processes in developmental biology and cell signaling and calibrates these models with microscopy data. I've recently gotten into using version control using git for our codes, and I am now trying to determine the best course of action to take for the data. These are the tools I'm aware of but have not tested:

The Dat Project https://datproject.org/
Git Large File Storage https://git-lfs.github.com/
Git Annex https://git-annex.branchable.com/
Data Version Control (DVC) https://dvc.org/

All projects seem to be aimed at researchers trying to integrate data versioning into their workflow and collaboration, and some seem to have a few other bells and whistles.

Now, the only reason I settled on using git for my work is that it seems to be the de facto standard version control just about the whole world uses. Using this same reasoning, does anyone here have a keen insight into which of the data versioning tools listed here or otherwise is (or will most likely become) the standard for data version control?
The Carpentries<https://carpentries.topicbox.com/latest> / discuss / see discussions<https://carpentries.topicbox.com/groups/discuss> + participants<https://carpentries.topicbox.com/groups/discuss/members> + delivery options<https://carpentries.topicbox.com/groups/discuss/subscription> Permalink<https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M26854e6b9b3500ea27de1bc9>
________________________________

Heriot-Watt University is The Times & The Sunday Times International University of the Year 2018

Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences.

This email is generated from the Heriot-Watt University Group, which includes:

1. Heriot-Watt University, a Scottish charity registered under number SC000278
2. Edinburgh Business School a Charity Registered in Scotland, SC026900. Edinburgh Business School is a company limited by guarantee, registered in Scotland with registered number SC173556 and registered office at Heriot-Watt University Finance Office, Riccarton, Currie, Midlothian, EH14 4AS
3. Heriot- Watt Services Limited (Oriam), Scotland's national performance centre for sport. Heriot-Watt Services Limited is a private limited company registered is Scotland with registered number SC271030 and registered office at Research & Enterprise Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.

The contents (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mdef0f116ac8073d883b70fd3
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Allen Lee

2018-07-20 23:30:18 UTC

Permalink

If these data are an important piece of your computational workflow it
might make sense to archive & describe them in an actual data repository
(Dryad, figshare, OSF, etc.) and pull them / cache locally when you run
your calibration.

--
Allen Lee
Associate Research Professional
Center for Behavior, Institutions, and the Environment <http://cbie.asu.edu>
Network for Computational Modeling in the Social and Ecological Sciences
<http://comses.net>
Arizona State University
Mail Code: 4804
Tempe, AZ 85287
*p: *480-727-4646
*email: ****@asu.edu
*web: *https://github.com/alee

On Fri, Jul 20, 2018 at 2:48 PM thompson.m.j via discuss <

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M7f55c6036bee6740add39cce
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

David Nicholson via discuss

2018-07-21 00:50:38 UTC

Permalink

+1 for figshare
The site has an API, and there's various command line tools if you prefer
something git flavored, and they automatically version your data repo when
you make changes.
If you can make your data public or if you have access to an institutional
account then it's free unlimited space.

David Nicholson, Ph.D.
nickledave.github.io
https://github.com/NickleDave
Prinz lab <http://www.biology.emory.edu/research/Prinz/>, Emory University,
Atlanta, GA, USA

Post by Allen Lee
If these data are an important piece of your computational workflow it
might make sense to archive & describe them in an actual data repository
(Dryad, figshare, OSF, etc.) and pull them / cache locally when you run
your calibration.
--
Allen Lee
Associate Research Professional
Center for Behavior, Institutions, and the Environment
<http://cbie.asu.edu>
Network for Computational Modeling in the Social and Ecological Sciences
<http://comses.net>
Arizona State University
Mail Code: 4804
Tempe, AZ 85287
*p: *480-727-4646
*web: *https://github.com/alee
On Fri, Jul 20, 2018 at 2:48 PM thompson.m.j via discuss <

*The Carpentries <https://carpentries.topicbox.com/latest>* / discuss /
see discussions <https://carpentries.topicbox.com/groups/discuss> +
participants <https://carpentries.topicbox.com/groups/discuss/members> + delivery
options <https://carpentries.topicbox.com/groups/discuss/subscription>
Permalink
<https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M7f55c6036bee6740add39cce>

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M7bec8b0dbb705610c020040e
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Ben Marwick

2018-07-21 02:26:06 UTC

Permalink

I've found it quite practical to host biggish data on an OSF repository, which has it's own version control, and then use it in a local R session with theÂ https://github.com/CenterForOpenScience/osfrÂ pkg. The OSF repo also hooks into the github repo with the R code, so data and code can all be archived in the one OSF repo.
------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M3d3e4bb2f0a49fdf2391282c
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Carl Boettiger via discuss

2018-07-20 23:43:29 UTC

Permalink

Good question, I'd also be interested in comments on this.

I'd second Dav's comments that it depends on file size, and certainly for <
100 MB files, simply committing these to git seems like the most reasonable
way to go.

Workflow-wise, I find Git LFS very compelling, but in practice, I found it
not to be viable for public GitHub projects in which you expect forks and
PRs. GitHub's pricing model basically means that Git LFS breaks the fork /
PR workflow (see
https://medium.com/@megastep/github-s-large-file-storage-is-no-panacea-for-open-source-quite-the-opposite-12c0e16a9a91)
You can set up a different source (i.e. GitLab) to host the LFS part and
still have your repo on GitHub, see https://github.com/jimhester/test-glfs/,
but this was sufficiently cumbersome that I could not get it to work.

I have not experimented with Git Annex or Dat, but my understanding is that
while these provide version control solution, they do not provide a
file-storage solution. Dat is a peer-to-peer model, which I believe means
you need some 'peer' server always on and running somewhere when you want
to access your data. My own need is almost the inverse of this problem --
I am primarily looking for a mechanism to easily share data associated with
a project that already lives on GitHub (possibly public, possibly private),
and I want a way to give collaborators / students access to both download
and upload the data without asking them to adopt a workflow of tools that
is any more complicated than it needs to be. e.g. sticking the data on
Amazon S3 is often good enough -- I can version data linearly with file
names, I do not need git merge capabilities -- but this does impose a
significant overhead for new users with needing to use aws cli or similar
and set up more authentication tokens. A small barrier but enough to
discourage collaborators.

My recent approach has been to piggyback > 100 MB files directly on GitHub
as 'assets', which can be up to 2 GB in size. This is not a robust
versioning solution (I believe that public, archival research data ought to
be deposited in a *data archive* and versioned there), and may not be a
good idea at all, but can be remarkably convenient for certain use cases
(like keeping your 100mb ~ 2gb spatial data shape files associated with the
repo where you're analyzing them). Not to subvert this thread, but if
you're curious about this approach using R, I have a little package to
facilitate this workflow: https://github.com/cboettig/piggyback ;
feedback/critique welcome.

Cheers,

Carl

On Fri, Jul 20, 2018 at 2:48 PM thompson.m.j via discuss <

--
http://carlboettiger.info

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Me96e4d3cbc9ff7c08c4d2d76
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Dav Clark via discuss

2018-07-21 17:30:39 UTC

Permalink

Points of clarification...

On Fri, Jul 20, 2018 at 7:59 PM Carl Boettiger via discuss
We had a pretty top-notch engineer (Dean) dedicate serious effort to
our bespoke setup for this at Gigantum. Agreed - it is non-trivial to
leave behind GitHub for an inexpensive provider. I haven't checked in
with the various "enterprise" providers. If folks have specific
interest / questions, I can bug Dean about it.

I have not experimented with Git Annex or Dat, but my understanding is that while these provide version control solution, they do not provide a file-storage solution. Dat is a peer-to-peer model, which I believe means you need some 'peer' server always on and running somewhere when you want to access your data. My own need is almost the inverse of this problem -- I am primarily looking for a mechanism to easily share data associated with a project that already lives on GitHub (possibly public, possibly private), and I want a way to give collaborators / students access to both download and upload the data without asking them to adopt a workflow of tools that is any more complicated than it needs to be. e.g. sticking the data on Amazon S3 is often good enough -- I can version data linearly with file names, I do not need git merge capabilities -- but this does impose a significant overhead for new users with needing to use aws cli or similar and set up more authentication tokens. A small barrier but enough to discourage collaborators.

With specific regards to Git Annex, it does provide easy backup to a
variety of providers (s3, backblaze, rsync, ... see
http://git-annex.branchable.com/special_remotes/). It will even do
crazy things like "drop the file locally if at least 2 copies exist in
trusted repositories." You can use Git Annex also to track data that's
already backed up (e.g., at a URL that you trust) and it will still
checksum and verify it when you get a copy.

My understanding is that Dat is a bit more like bittorrent. You can
host stuff as much as you like, and drop when you want. But just like
with BitTorrent, it's not hard to set up a dedicated server that will
always host some content you care about.

These details remind me of another point, which is that no matter what
choice you make, the chances that it's a permanent solution seem hard.
Even with something as flexible as Git Annex. So part of the thinking
is what's your timeline for archival, assuming that no-one is finding
value in the data at-the-moment, and how easy would it be to
transition to something else. I'd argue that filesystem inclusive
solutions, or super-standardized API based systems (e.g. rsync, http,
SQL) are the best in that regard.

D

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M5c6f0a5f11ce5ae50994c6a9
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Brian Ballsun-Stanton

2018-07-21 03:33:11 UTC

Permalink

So this touches on something that Shawn Ross has been working on for MQ in terms of data repositories.

Dataverse (OSS) and Figshare are good institutional data repos. I prefer dataverse because of its quality of metadata and OSS heritage. Figshare, being commercial, worries me as a long term provider. I wouldn't use either as a living version control repo though.

OSF.io is excellent as a publishing archive, and is a great frontend to whatever storage you decide to use.

Git lfs is free on github if you authenticate with education.github.com (but that's not widely advertised). It is also supported on gitlab and bitbucket.

The main question here, however, is: is your data binary or ASCII?. Git doesn't really have many advantages on binary data.

It may be worth using a proper SQL database to maintain this data and to store data-dumps from the database.

But, before we can explore more deeply, we need to characterise your data and how you plan to use version control with it.

Is it:

Relational?

Text?

What size?

Sparse (lots of nulls?)

And in terms of questions you'll be asking of the data:

In present version with prior for recovery or how it changes over time?

What software tools will you be using with the data?

Will you be using "the cloud?" or other HPC?

Will you be accessing the full dataset every time, or will you be doing lookups on subsets of the data?

________________________________
From: thompson.m.j via discuss <***@lists.carpentries.org>
Sent: Saturday, 21 July 2018 2:08:01 AM
To: discuss
Subject: [discuss] Version control and collaboration with large datasets.

Hello all,
I am a member of a computational biology lab that models processes in developmental biology and cell signaling and calibrates these models with microscopy data. I've recently gotten into using version control using git for our codes, and I am now trying to determine the best course of action to take for the data. These are the tools I'm aware of but have not tested:

The Dat Project https://datproject.org/<https://protect-au.mimecast.com/s/IxL1CL7Eg9fRBjODHB89S_?domain=datproject.org>
Git Large File Storage https://git-lfs.github.com/<https://protect-au.mimecast.com/s/eWhwCMwGj8CqBO8yhk2ugp?domain=git-lfs.github.com>
Git Annex https://git-annex.branchable.com/<https://protect-au.mimecast.com/s/DEqRCNLJxki0klxqhjyNKa?domain=git-annex.branchable.com>
Data Version Control (DVC) https://dvc.org/<https://protect-au.mimecast.com/s/zfOGCOMK7Ycp1RXLTrETc6?domain=dvc.org>

All projects seem to be aimed at researchers trying to integrate data versioning into their workflow and collaboration, and some seem to have a few other bells and whistles.

Now, the only reason I settled on using git for my work is that it seems to be the de facto standard version control just about the whole world uses. Using this same reasoning, does anyone here have a keen insight into which of the data versioning tools listed here or otherwise is (or will most likely become) the standard for data version control?
The Carpentries<https://protect-au.mimecast.com/s/d8zKCRONg6sv4mlDHQC7IM?domain=carpentries.topicbox.com> / discuss / see discussions<https://protect-au.mimecast.com/s/-_iRCVARmOHxqwKnTEtZLi?domain=carpentries.topicbox.com> + participants<https://protect-au.mimecast.com/s/LOxUCWLVn6i5r9YLFOY3jP?domain=carpentries.topicbox.com> + delivery options<https://protect-au.mimecast.com/s/M8I4CQnM1WfkvAq1FAXxlQ?domain=carpentries.topicbox.com> Permalink<https://protect-au.mimecast.com/s/t9S1CP7L1NfKzGm1u6Oghk?domain=carpentries.topicbox.com>

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Ma7b92cfc00a5d9f102cfc2c2
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Claudia Beleites

2018-07-21 11:38:49 UTC

Permalink

Hi all,

I'm also very interested in learning solutions for this.

At the moment I distinguish two use cases:

- focus of project is coding (developing software/package/library) vs.

- focus of project is data analysis, with the sub-topic of projects
where various "slices" of the data are important.

**Code project**

I have one project where I use git-lfs on github (got a promo offer for
free use). The project is about *code* (R package) that however has some
100 MB binary data attached to it (it was larger at some point before I
could get smaller but equally suitable example files for some formats).
The binary data are example files in various file formats for the file
import filters the package provides. Initially, we had them in git as
well, but that horribly bloated the repo so it got unusable after a few
years. The files themselves, however, hardly need any versioning. I get
them and store them as they are, and only very occasionally is one of
those files replaced. The main point of the git-lfs storage is to make
sure that all files are where they are supposed to be without having too
much of manual hassle.
At some point I was lucky to get a github promo offer for free git-lfs
(test) usage and gave it a try - which is the current state.

Experiences:

- (due to free promo I don't have bandwidth billing trouble)

- Git is largely independent of git-lfs: you can still fork/clone the
git-only part of the repo and work with that. For the project in
question, the files stored in git-lfs are only needed for developing and
unit testing of file import filters, everything else does not need
git-lfs. I decided I don't want to force collaborators to install
git-lfs, so set up the project in a way that e.g. the file filter unit
tests check whether those files are available, and if not skips those
tests (visibly).
This does also make sense because of size restrictions for the R package
submission to CRAN, and as I'm the maintainer in the view of CRAN, I can
always make sure I properly run all tests.

- With this setup, I do not experience the collaboration trouble/broken
forking issues Peter Stéphane describes in the link in Carl's mail. At
least not for the parts of the project that are stored as "normal" git.
I've not yet had anyone trying to directly submit files that should go
into the lfs part of the repo.

- I tried to get git-lfs installed together with a private gitlab
instance (thinking we may want to use it for data-type projects), but
like Carl, I gave up. That was IIRC 3 years ago, so things may have
improved meanwhile.

For other "code-type" projects (model/algorithm development), I tend to
take a two-layered approach. Data sets that are small enough to be
shipped as example and unit test data, say, in an R package are kept
with the code. In fact, many of them are toy data computed from code,
and I just store that code. The 2nd layer are well-known example data
sets, and there I simply rely on those data sets staying available. (I'm
talking e.g. the NASA AVIRIS data sets
https://aviris.jpl.nasa.gov/data/free_data.html)
(Side note: I'm somewhat wary of papers proposing their own new
algorithm solely on their own data set, and of algorithm comparisons
based on one or few data sets)

**Data Project**

This is where I think things could be improved :-)

The majority of projects I work on are data analysis projects. I.e. we
have measurement data, do an analysis and draw conclusions, write a
report or paper.

For these projects, we tend to take a "raw data and code are real"
approach that also implies that the raw data is never changed (with the
only exception of renaming files - but the files I'm thinking off store
their orignal name, so even that can be reconstructed). So we basically
have storage and distribution needs, but not really versioning needs. We
sometimes produce pre-processed intermediate data, but that again is
defined by the code that produces this data from the raw data, and the
results are considered temporary files. If I do manual curation (mostly
excluding bad runs with certain artifacts), I produce code or data files
that say which files were excluded and for what reason. Most of this can
be and is done in an automated fashion, though.

Producing versions of this that are to be kept (such as making snapshots
of the state of data for a paper) is sufficiently infrequent to just zip
those data and have the version in the file name.

Recently, I tend to use nextcloud to share such data. We did use git for
a while, but with large amounts of data that does become cumbersome, and
we found that few collaborators were willing to learn even just the
level of git that lets them clone and pull. Owncloud/Nextcloud is a much
lower barrier in that respect.

At the moment I think what I'd like to see would be nextcloud with
commits, ignores and maybe a somewhat more distributed and less central
approach ...

Versioning binary data would be far more important for colleagues who
extensively use GUI software for their analyses: not all of the relevant
software does keep logs/recovery data (some do, though, as they are to
be used in fields like pharma where full audit trails are required).

**Data Projects II**

(Here I see huge possibilities for improvement)

OTOH, we also have some projects where it is clear that a large variety
of subsets of the data is to be requested and analysed, and we've set up
data bases for those purposes. Here again, I do dumps/backups, and in
the rare occasion that a version should be tagged that can be done to
the backup/dump. Again, these data bases are set up in a way that easily
allows adding/inserting, but changing or deleting requires admin rights
- and admin should make sure of the backup before doing any such
"surgery" to the data base.
I may say that I'm originally from a wet-lab field (chemistry): I'm
trained to work under conditions where mistakes irretrievably mess up
things. Version control and being able to undo mistakes is good and
important, but if these techniques (luxuries?) are not available at
every point, that's as it is right now.

I admit that I never bothered about implementing full audit trails - and
the supervisors I had were already suspicious whether it is worth while
bothering to set up a data base and very much against "waste of time"
such as (for code projects) unit testing and encapsulating code in
packages/libraries/their own namespace...

I've met one research institute, though, that run a full LIMS
(laboratory information management system) which however, is more suited
for situations where the same types of analyses are repeatedly done for
new samples rather than research questions where not only samples but
also analysis methods change from project to project.

But e.g. RedCap https://projectredcap.org/ produces data bases with
audit trails. (Never tried it, though).

Best,

Claudia
--
Claudia Beleites Chemometric Consulting
Södeler Weg 19
61200 Wölfersheim
Germany

phone: +49 (15 23) 1 83 74 18
e-mail: ***@chemometrix.eu
USt-ID: DE305606151

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M278751611e28840c648e49a9
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Tim Head via discuss

2018-07-21 17:29:31 UTC

Permalink

Hello all,

in the hopes of making it easier to use osf.io with large datasets last
summer we* had some time and funding to start building
http://osfclient.readthedocs.io/en/latest/cli-usage.html which is both a
command-line program and a Python library for osf.io. The tool works well
for gigabyte sized files and there is starting to be a small community of
people who contribute fixes and new features when something they need is
missing. It would be great to grow this further.

Maybe this removes that one last hurdle that was stopping you from putting
all your datasets on osf.io (when we asked about size limits they were
confident no one would ever reach them ... and I still don't know anyone
who has found it)

T

* we in this case is Titus Brown and me

On Sat, Jul 21, 2018 at 6:29 PM Claudia Beleites <

Post by Claudia Beleites
Hi all,
I'm also very interested in learning solutions for this.
- focus of project is coding (developing software/package/library) vs.
- focus of project is data analysis, with the sub-topic of projects
where various "slices" of the data are important.
**Code project**
I have one project where I use git-lfs on github (got a promo offer for
free use). The project is about *code* (R package) that however has some
100 MB binary data attached to it (it was larger at some point before I
could get smaller but equally suitable example files for some formats).
The binary data are example files in various file formats for the file
import filters the package provides. Initially, we had them in git as
well, but that horribly bloated the repo so it got unusable after a few
years. The files themselves, however, hardly need any versioning. I get
them and store them as they are, and only very occasionally is one of
those files replaced. The main point of the git-lfs storage is to make
sure that all files are where they are supposed to be without having too
much of manual hassle.
At some point I was lucky to get a github promo offer for free git-lfs
(test) usage and gave it a try - which is the current state.
- (due to free promo I don't have bandwidth billing trouble)
- Git is largely independent of git-lfs: you can still fork/clone the
git-only part of the repo and work with that. For the project in
question, the files stored in git-lfs are only needed for developing and
unit testing of file import filters, everything else does not need
git-lfs. I decided I don't want to force collaborators to install
git-lfs, so set up the project in a way that e.g. the file filter unit
tests check whether those files are available, and if not skips those
tests (visibly).
This does also make sense because of size restrictions for the R package
submission to CRAN, and as I'm the maintainer in the view of CRAN, I can
always make sure I properly run all tests.
- With this setup, I do not experience the collaboration trouble/broken
forking issues Peter StÃ©phane describes in the link in Carl's mail. At
least not for the parts of the project that are stored as "normal" git.
I've not yet had anyone trying to directly submit files that should go
into the lfs part of the repo.
- I tried to get git-lfs installed together with a private gitlab
instance (thinking we may want to use it for data-type projects), but
like Carl, I gave up. That was IIRC 3 years ago, so things may have
improved meanwhile.
For other "code-type" projects (model/algorithm development), I tend to
take a two-layered approach. Data sets that are small enough to be
shipped as example and unit test data, say, in an R package are kept
with the code. In fact, many of them are toy data computed from code,
and I just store that code. The 2nd layer are well-known example data
sets, and there I simply rely on those data sets staying available. (I'm
talking e.g. the NASA AVIRIS data sets
https://aviris.jpl.nasa.gov/data/free_data.html)
(Side note: I'm somewhat wary of papers proposing their own new
algorithm solely on their own data set, and of algorithm comparisons
based on one or few data sets)
**Data Project**
This is where I think things could be improved :-)
The majority of projects I work on are data analysis projects. I.e. we
have measurement data, do an analysis and draw conclusions, write a
report or paper.
For these projects, we tend to take a "raw data and code are real"
approach that also implies that the raw data is never changed (with the
only exception of renaming files - but the files I'm thinking off store
their orignal name, so even that can be reconstructed). So we basically
have storage and distribution needs, but not really versioning needs. We
sometimes produce pre-processed intermediate data, but that again is
defined by the code that produces this data from the raw data, and the
results are considered temporary files. If I do manual curation (mostly
excluding bad runs with certain artifacts), I produce code or data files
that say which files were excluded and for what reason. Most of this can
be and is done in an automated fashion, though.
Producing versions of this that are to be kept (such as making snapshots
of the state of data for a paper) is sufficiently infrequent to just zip
those data and have the version in the file name.
Recently, I tend to use nextcloud to share such data. We did use git for
a while, but with large amounts of data that does become cumbersome, and
we found that few collaborators were willing to learn even just the
level of git that lets them clone and pull. Owncloud/Nextcloud is a much
lower barrier in that respect.
At the moment I think what I'd like to see would be nextcloud with
commits, ignores and maybe a somewhat more distributed and less central
approach ...
Versioning binary data would be far more important for colleagues who
extensively use GUI software for their analyses: not all of the relevant
software does keep logs/recovery data (some do, though, as they are to
be used in fields like pharma where full audit trails are required).
**Data Projects II**
(Here I see huge possibilities for improvement)
OTOH, we also have some projects where it is clear that a large variety
of subsets of the data is to be requested and analysed, and we've set up
data bases for those purposes. Here again, I do dumps/backups, and in
the rare occasion that a version should be tagged that can be done to
the backup/dump. Again, these data bases are set up in a way that easily
allows adding/inserting, but changing or deleting requires admin rights
- and admin should make sure of the backup before doing any such
"surgery" to the data base.
I may say that I'm originally from a wet-lab field (chemistry): I'm
trained to work under conditions where mistakes irretrievably mess up
things. Version control and being able to undo mistakes is good and
important, but if these techniques (luxuries?) are not available at
every point, that's as it is right now.
I admit that I never bothered about implementing full audit trails - and
the supervisors I had were already suspicious whether it is worth while
bothering to set up a data base and very much against "waste of time"
such as (for code projects) unit testing and encapsulating code in
packages/libraries/their own namespace...
I've met one research institute, though, that run a full LIMS
(laboratory information management system) which however, is more suited
for situations where the same types of analyses are repeatedly done for
new samples rather than research questions where not only samples but
also analysis methods change from project to project.
But e.g. RedCap https://projectredcap.org/ produces data bases with
audit trails. (Never tried it, though).
Best,
Claudia
--
Claudia Beleites Chemometric Consulting
SÃ¶deler Weg 19
61200 WÃ¶lfersheim
Germany
phone: +49 (15 23) 1 83 74 18
USt-ID: DE305606151
------------------------------------------
The Carpentries: discuss
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M278751611e28840c648e49a9
https://carpentries.topicbox.com/groups/discuss/subscription

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M4bc60498dcb8d4d88fce6cb6
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Claudia Beleites

2018-07-22 13:42:49 UTC

Permalink

Tim, (and everyone who just has the same questions here)

OSF definitively is something I'll check out.

However, I note the privacy document explicitly spells out that as a
US-based repo, it does not meet the requirements of EU privacy
legislation (and I've been working with sensitive/patient data, so
privacy and related security aspects are an important consideration).
This, together with the experience that some research labs prefer to
keep their data in-house. My guess is, that a system that can be set up
in-house* would have much better chances to be approved by management
over here also because of legal considerations.

* or in a DMZ, giving the chance to expose their public project parts or
running two instances, an internal one very much in-house and one for
public parts that is exposed.

As OFS states it is FOSS, this should be possible, but I did not
immediately see instructions "how to run on your own server" nor
technical requirements. Could you point me to such information, or is
there even something like a "we run our own instances" user group?

Many thanks,

Clauida

Post by thompson.m.j via discuss
Hello all,
in the hopes of making it easier to use osf.io <http://osf.io> with
large datasets last summer we* had some time and funding to start
buildingÂ http://osfclient.readthedocs.io/en/latest/cli-usage.htmlÂ which
is both a command-line program and a Python library for osf.io
<http://osf.io>. The tool works well for gigabyte sized files and
there is starting to be a small community of people who contribute
fixes and new features when something they need is missing. It would
be great to grow this further.
Maybe this removes that one last hurdle that was stopping you from
putting all your datasets on osf.io <http://osf.io> (when we asked
about size limits they were confident no one would ever reach them ...
and I still don't know anyone who has found it)
T
* we in this case is Titus Brown and me
On Sat, Jul 21, 2018 at 6:29 PM Claudia Beleites
Hi all,
I'm also very interested in learning solutions for this.Â
- focus of project is coding (developing software/package/library) vs.
- focus of project is data analysis, with the sub-topic of projects
where various "slices" of the data are important.
**Code project**
I have one project where I use git-lfs on github (got a promo offer for
free use). The project is about *code* (R package) that however has some
100 MB binary data attached to it (it was larger at some point before I
could get smaller but equally suitable example files for some formats).
The binary data are example files in various file formats for the file
import filters the package provides. Initially, we had them in git as
well, but that horribly bloated the repo so it got unusable after a few
years. The files themselves, however, hardly need any versioning. I get
them and store them as they are, and only very occasionally is one of
those files replaced. The main point of the git-lfs storage is to make
sure that all files are where they are supposed to be without having too
much of manual hassle.
At some point I was lucky to get a github promo offer for free git-lfs
(test) usage and gave it a try - which is the current state.
- (due to free promo I don't have bandwidth billing trouble)
- Git is largely independent of git-lfs: you can still fork/clone the
git-only part of the repo and work with that. For the project in
question, the files stored in git-lfs are only needed for
developing and
unit testing of file import filters, everything else does not need
git-lfs. I decided I don't want to force collaborators to install
git-lfs, so set up the project in a way that e.g. the file filter unit
tests check whether those files are available, and if not skips those
tests (visibly).
This does also make sense because of size restrictions for the R package
submission to CRAN, and as I'm the maintainer in the view of CRAN, I can
always make sure I properly run all tests.
- With this setup, I do not experience the collaboration
trouble/broken
forking issues Peter StÃ©phane describes in the link in Carl's mail. At
least not for the parts of the project that are stored as "normal" git.
I've not yet had anyone trying to directly submit files that should go
into the lfs part of the repo.
- I tried to get git-lfs installed together with a private gitlab
instance (thinking we may want to use it for data-type projects), but
like Carl, I gave up. That was IIRC 3 years ago, so things may have
improved meanwhile.
For other "code-type" projects (model/algorithm development), I tend to
take a two-layered approach. Data sets that are small enough to be
shipped as example and unit test data, say, in an R package are kept
with the code. In fact, many of them are toy data computed from code,
and I just store that code. The 2nd layer are well-known example data
sets, and there I simply rely on those data sets staying
available. (I'm
talking e.g. the NASA AVIRIS data sets
https://aviris.jpl.nasa.gov/data/free_data.html)
(Side note: I'm somewhat wary of papers proposing their own new
algorithm solely on their own data set, and of algorithm comparisons
based on one or few data sets)
**Data Project**
This is where I think things could be improved :-)
The majority of projects I work on are data analysis projects. I.e. we
have measurement data, do an analysis and draw conclusions, write a
report or paper.
For these projects, we tend to take a "raw data and code are real"
approach that also implies that the raw data is never changed (with the
only exception of renaming files - but the files I'm thinking off store
their orignal name, so even that can be reconstructed). So we basically
have storage and distribution needs, but not really versioning needs. We
sometimes produce pre-processed intermediate data, but that again is
defined by the code that produces this data from the raw data, and the
results are considered temporary files. If I do manual curation (mostly
excluding bad runs with certain artifacts), I produce code or data files
that say which files were excluded and for what reason. Most of this can
be and is done in an automated fashion, though.
Producing versions of this that are to be kept (such as making snapshots
of the state of data for a paper) is sufficiently infrequent to just zip
those data and have the version in the file name.
Recently, I tend to use nextcloud to share such data. We did use git for
a while, but with large amounts of data that does become
cumbersome, and
we found that few collaborators were willing to learn even just the
level of git that lets them clone and pull. Owncloud/Nextcloud is a much
lower barrier in that respect.
At the moment I think what I'd like to see would be nextcloud with
commits, ignores and maybe a somewhat more distributed and less central
approach ...
Versioning binary data would be far more important for colleagues who
extensively use GUI software for their analyses: not all of the relevant
software does keep logs/recovery data (some do, though, as they are to
be used in fields like pharma where full audit trails are required).
**Data Projects II**
(Here I see huge possibilities for improvement)
OTOH, we also have some projects where it is clear that a large variety
of subsets of the data is to be requested and analysed, and we've set up
data bases for those purposes. Here again, I do dumps/backups, and in
the rare occasion that a version should be tagged that can be done to
the backup/dump. Again, these data bases are set up in a way that easily
allows adding/inserting, but changing or deleting requires admin rights
- and admin should make sure of the backup before doing any such
"surgery" to the data base.
I may say that I'm originally from a wet-lab field (chemistry): I'm
trained to work under conditions where mistakes irretrievably mess up
things. Version control and being able to undo mistakes is good and
important, but if these techniques (luxuries?) are not available at
every point, that's as it is right now.
I admit that I never bothered about implementing full audit trails - and
the supervisors I had were already suspicious whether it is worth while
bothering to set up a data base and very much against "waste of time"
such as (for code projects) unit testing and encapsulating code in
packages/libraries/their own namespace...
I've met one research institute, though, that run a full LIMS
(laboratory information management system) which however, is more suited
for situations where the same types of analyses are repeatedly done for
new samples rather than research questions where not only samples but
also analysis methods change from project to project.
But e.g. RedCap https://projectredcap.org/ produces data bases with
audit trails. (Never tried it, though).
Best,
Claudia
--
Claudia Beleites Chemometric Consulting
SÃ¶deler Weg 19
61200 WÃ¶lfersheim
Germany
phone:Â +49 (15 23) 1 83 74 18
USt-ID: DE305606151
------------------------------------------
The Carpentries: discuss
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M278751611e28840c648e49a9
https://carpentries.topicbox.com/groups/discuss/subscription
*The Carpentries <https://carpentries.topicbox.com/latest>* / discuss
/ see discussions <https://carpentries.topicbox.com/groups/discuss> +
participants <https://carpentries.topicbox.com/groups/discuss/members>
+ delivery options
<https://carpentries.topicbox.com/groups/discuss/subscription>
Permalink
<https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M4bc60498dcb8d4d88fce6cb6>

--
Claudia Beleites Chemometric Consulting
SÃ¶deler Weg 19
61200 WÃ¶lfersheim
Germany

phone: +49 (15 23) 1 83 74 18
e-mail: ***@chemometrix.eu
USt-ID: DE305606151

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mc4ed5415923925413699397a
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Tim Head via discuss

2018-07-22 18:56:50 UTC

Permalink

Hi Claudia,

On Sun, Jul 22, 2018 at 8:48 PM Claudia Beleites <

Post by Claudia Beleites
As OFS states it is FOSS, this should be possible, but I did not
immediately see instructions "how to run on your own server" nor technical
requirements. Could you point me to such information, or is there even
something like a "we run our own instances" user group?
I've not attempted to run a production grade instance of OSF.io myself.

Claudia Beleites

2018-07-23 06:14:45 UTC

Permalink

Hi Tim,

many thanks. I'll probably anyways start with a toy-level setup :-)

Best,

Claudia

Post by Tim Head via discuss
Hi Claudia,
On Sun, Jul 22, 2018 at 8:48 PM Claudia Beleites
As OFS states it is FOSS, this should be possible, but I did not
immediately see instructions "how to run on your own server" nor
technical requirements. Could you point me to such information, or
is there even something like a "we run our own instances" user group?
I've not attempted to run a production grade instance of OSF.io
myself. Last year when I did the work on osfclient I setup a
development instance on my laptop
followingÂ https://github.com/CenterForOpenScience/osf.io/blob/develop/README-docker-compose.mdÂ which
worked pretty well, but I think there is some ways to go to get to a
production level setup. Probably best to ask in an issue on their
repository.
T
*The Carpentries <https://carpentries.topicbox.com/latest>* / discuss
/ see discussions <https://carpentries.topicbox.com/groups/discuss> +
participants <https://carpentries.topicbox.com/groups/discuss/members>
+ delivery options
<https://carpentries.topicbox.com/groups/discuss/subscription>
Permalink
<https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M70f2084d304f26dbb7f16d00>

--
Claudia Beleites Chemometric Consulting
SÃ¶deler Weg 19
61200 WÃ¶lfersheim
Germany

phone: +49 (15 23) 1 83 74 18
e-mail: ***@chemometrix.eu
USt-ID: DE305606151

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mdad31fde145a58dbe70142b2
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Terri Yu

2018-07-23 19:31:15 UTC

Permalink

I use Git LFS, because I have open source projects on GitHub and run
continuous integration testing on them. Git LFS is supported on GitHub and
the major continuous integration services (Travis, Appveyor). It seems to
be ok, except for having to pay for Git LFS bandwidth on GitHub when I'm
doing a lot of testing. Currently, I'm the only developer on my projects,
so I don't know how easy it would be for collaborators to use Git LFS.
Sometimes I have trouble with Git LFS not downloading the files when I
clone a Git repository and it is really annoying.

I haven't tried other large file versioning systems besides Git LFS.

Terri

On Mon, Jul 23, 2018 at 2:14 AM, Claudia Beleites <

Post by Claudia Beleites
Hi Tim,
many thanks. I'll probably anyways start with a toy-level setup :-)
Best,
Claudia
Hi Claudia,
On Sun, Jul 22, 2018 at 8:48 PM Claudia Beleites <

Last year when I did the work on osfclient I setup a development instance
on my laptop following https://github.com/CenterForOpenScience/osf.io/
blob/develop/README-docker-compose.md which worked pretty well, but I
think there is some ways to go to get to a production level setup. Probably
best to ask in an issue on their repository.
T
--
Claudia Beleites Chemometric Consulting
SÃ¶deler Weg 19
61200 WÃ¶lfersheim
Germany
phone: +49 (15 23) 1 83 74 18
US <https://maps.google.com/?q=S+deler+Weg+19%0D%0A61200+W+lfersheim%0D%0AGermany&entry=gmail&source=g>t-ID: DE305606151
*The Carpentries <https://carpentries.topicbox.com/latest>* / discuss /
see discussions <https://carpentries.topicbox.com/groups/discuss> +
participants <https://carpentries.topicbox.com/groups/discuss/members> + delivery
options <https://carpentries.topicbox.com/groups/discuss/subscription>
Permalink
<https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mdad31fde145a58dbe70142b2>

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mce533b55e63c125a13f836fb
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Rémi Rampin

2018-07-23 19:49:52 UTC

Permalink

Post by Terri Yu
I use Git LFS, because I have open source projects on GitHub and run
continuous integration testing on them. Git LFS is supported on GitHub and
the major continuous integration services (Travis, Appveyor). It seems to
be ok, except for having to pay for Git LFS bandwidth on GitHub when I'm
doing a lot of testing. Currently, I'm the only developer on my projects,
so I don't know how easy it would be for collaborators to use Git LFS.
Sometimes I have trouble with Git LFS not downloading the files when I
clone a Git repository and it is really annoying.

Please be wary of Git-LFS on GitHub. Their pricing model is kind of
twisted, and it is very possible that by using it, you are preventing
people from forking it altogether (everything gets counted towards the
upstream owners' quota, which might run out quickly). Also note that they
charge for bandwidth, so you might lock your project for everyone if you
reach your quota (or if someone in a fork makes you reach your quota).

Until GitHub fixes their pricing
<https://help.github.com/articles/about-storage-and-bandwidth-usage/>, it
is a pretty terrible option for all but private projects.

My own experience is that Git-LFS is a pretty good option (when used with
anything but GitHub, e.g. GitLab), much faster than alternatives if you
have to deal with a huge number of medium-sized files (ipfs, dat, dvc don't
do very well in that situation). It integrates with diff and merge. However
it does have some gotchas e.g. impossible to purge specific files from the
local cache.

It also doesn't have a good stand-alone open-source server implementation
at the moment, which might matter to you.

Cheers

--
RÃ©mi

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mdfa6be610db0b6e0c8882c44
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Terri Yu

2018-07-23 20:31:51 UTC

Permalink

Post by RÃ©mi Rampin
Until GitHub fixes their pricing
<https://help.github.com/articles/about-storage-and-bandwidth-usage/>, it
is a pretty terrible option for all but private projects.
My own experience is that Git-LFS is a pretty good option (when used with
anything but GitHub, e.g. GitLab), much faster than alternatives if you
have to deal with a huge number of medium-sized files (ipfs, dat, dvc don't
do very well in that situation). It integrates with diff and merge. However
it does have some gotchas e.g. impossible to purge specific files from the
local cache.

I just realized that I don't really have many large files.

I'm only using Git LFS on about 50 MB worth of files, and most of them are
about 1 MB in size except for one 29 MB file. I don't know if Git LFS is
the best option for my use case, but I was thinking ahead to when I might
have more of those ~30 MB json data files. Having only 50 MB of files makes
paying for bandwidth ok for now.

Terri

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M3e2baf52a414b62bbad0629c
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Waldman, Simon

2018-07-23 21:26:49 UTC

Permalink

If theyâre not changing very often, you could just use Git for that ð

From: ***@gmail.com <***@gmail.com> On Behalf Of Terri Yu
Sent: 23 July 2018 21:32
To: discuss <***@lists.carpentries.org>
Subject: Re: [discuss] Version control and collaboration with large datasets.

I just realized that I don't really have many large files.

I'm only using Git LFS on about 50 MB worth of files, and most of them are about 1 MB in size except for one 29 MB file. I don't know if Git LFS is the best option for my use case, but I was thinking ahead to when I might have more of those ~30 MB json data files.
________________________________

Heriot-Watt University is The Times & The Sunday Times International University of the Year 2018

Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences.

This email is generated from the Heriot-Watt University Group, which includes:

1. Heriot-Watt University, a Scottish charity registered under number SC000278
2. Edinburgh Business School a Charity Registered in Scotland, SC026900. Edinburgh Business School is a company limited by guarantee, registered in Scotland with registered number SC173556 and registered office at Heriot-Watt University Finance Office, Riccarton, Currie, Midlothian, EH14 4AS
3. Heriot- Watt Services Limited (Oriam), Scotland's national performance centre for sport. Heriot-Watt Services Limited is a private limited company registered is Scotland with registered number SC271030 and registered office at Research & Enterprise Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.

The contents (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M19c77a6c6d95c44f9b2eb888
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Belinda Weaver

2018-07-23 22:01:26 UTC

Permalink

Hi all
Australia are looking at this issue already. There is an RDA working group on data versioning :Â https://www.rd-alliance.org/groups/data-versioning-wgÂ and they have developed use cases and are seeking new ones not already covered.Â https://docs.google.com/document/d/1TfBPlfjTVg0YcFxuw0UszAXPYrRmyZ6PCxtxKx8-uGg/edit#heading=h.41h61n5qswqc
Thanks to contributions from the RSE-AU-NZ mailing list (https://groups.google.com/forum/#!forum/rse-nz-au) for these ideas.Â

regards
Belinda
Belinda Weaver
Community and Communications Lead
The Carpentries
e:Â ***@carpentries.orgÂ | p: +61 408 841 882Â | t: @cloudaus
------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Me8705f78e3fcf64c2b9a42b7
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Bruce Becker via discuss

2018-07-23 22:14:29 UTC

Permalink

Hi all!

There are some good ideas here. My personal experience leads me to believe
that there are many options for communities, but it really depends on what
they try to optimise for.

I would start with CVMFS - this is a highly efficient way of delivering
data, and has very good versioning capabilities. You do however need some
infrastructure to use it - but you could also host one yourself as a
project.

The promise of versioning for data in containers died prematurely with the
halt of Flocker https://github.com/ClusterHQ/flocker - This had some great
promise, but for "reasons" the project died.

Pachyderm comes in close in this space I think - https://www.pachyderm.io/-
but it's a domain-specific tool. I don't know if it can be re-used for
other purposes, I'd love to see someone try.

Finally, I've had some fun with https://data.world/ over the last few
months. I've heard some very good things about Azure's Machine Learning
Dashboard (I think it's called that?) which has some good versioning
functionality as well.

All in all, this is a really good thing to be discussion. Ideally,
researchers should have infrastructure available to them to manage the
versioning of their data. For those who aren't physicists (who have all the
money and hence all the nice things), there is EUDAT which provides a
handle service to research data. One of EGI's data offerings
https://datahub.egi.eu will soon be able to assign and manage PIDs
assocated with research data too.

You can play around with these at the moment - just order them from the EGI
or EOSC catalogue - marketplace.egi.eu or marketplace.eosc-hub.eu

Cheers!
Bruce

Post by Waldman, Simon
If theyâre not changing very often, you could just use Git for that ð
*Sent:* 23 July 2018 21:32
*Subject:* Re: [discuss] Version control and collaboration with large
datasets.
I just realized that I don't really have many large files.
I'm only using Git LFS on about 50 MB worth of files, and most of them are
about 1 MB in size except for one 29 MB file. I don't know if Git LFS is
the best option for my use case, but I was thinking ahead to when I might
have more of those ~30 MB json data files.
------------------------------
*Heriot-Watt University is The Times & The Sunday Times International
University of the Year 2018*
Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With
campuses and students across the entire globe we span the world, delivering
innovation and educational excellence in business, engineering, design and
the physical, social and life sciences.
1. Heriot-Watt University, a Scottish charity registered under number
SC000278
2. Edinburgh Business School a Charity Registered in Scotland,
SC026900. Edinburgh Business School is a company limited by guarantee,
registered in Scotland with registered number SC173556 and registered
office at Heriot-Watt University Finance Office, Riccarton, Currie,
Midlothian, EH14 4AS
3. Heriot- Watt Services Limited (Oriam), Scotland's national
performance centre for sport. Heriot-Watt Services Limited is a private
limited company registered is Scotland with registered number SC271030 and
registered office at Research & Enterprise Services Heriot-Watt University,
Riccarton, Edinburgh, EH14 4AS.
The contents (including any attachments) are confidential. If you are not
the intended recipient of this e-mail, any disclosure, copying,
distribution or use of its contents is strictly prohibited, and you should
please notify the sender immediately and then delete it (including any
attachments) from your system.
*The Carpentries <https://carpentries.topicbox.com/latest>* / discuss /
see discussions <https://carpentries.topicbox.com/groups/discuss> +
participants <https://carpentries.topicbox.com/groups/discuss/members> + delivery
options <https://carpentries.topicbox.com/groups/discuss/subscription>
Permalink
<https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M19c77a6c6d95c44f9b2eb888>

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M579b5f99185ab6b5282f7e45
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

James Tocknell via discuss

2018-07-24 01:29:09 UTC

Permalink

For my fluid dynamics simulations, I've been using git annex with cloudstor
(
https://www.aarnet.edu.au/network-and-services/cloud-services-applications/cloudstor,
I've used the webdav interface, as that was easier than requesting S3
access to the files), which can be though of as an Australia-wide owncloud.
The outputs I produce are somewhere between a few 100s of megabytes and and
few 10s of gigabytes (size variation is due to length of time run and how
much debugging information is stored), stored in HDF5 files. I've found
that when dealing with laptops which may not have large amounts of space
for storing data, being able to only grab the files I need, and have the
rest stored on cloudstor (and desktops and NASs, as you can instruct git
annex to ensure there are at least N copies of files stored).

One thing you do want to ensure is that backups are being made of any data
you produce, I've had friends lose terabytes of simulation runs because
there was confusion over who was responsible for backups.

James

On 24 July 2018 at 08:14, Bruce Becker via discuss <

Post by Bruce Becker via discuss
Hi all!
There are some good ideas here. My personal experience leads me to believe
that there are many options for communities, but it really depends on what
they try to optimise for.
I would start with CVMFS - this is a highly efficient way of delivering
data, and has very good versioning capabilities. You do however need some
infrastructure to use it - but you could also host one yourself as a
project.
The promise of versioning for data in containers died prematurely with the
halt of Flocker https://github.com/ClusterHQ/flocker - This had some
great promise, but for "reasons" the project died.
Pachyderm comes in close in this space I think -
https://www.pachyderm.io/- but it's a domain-specific tool. I don't know
if it can be re-used for other purposes, I'd love to see someone try.
Finally, I've had some fun with https://data.world/ over the last few
months. I've heard some very good things about Azure's Machine Learning
Dashboard (I think it's called that?) which has some good versioning
functionality as well.
All in all, this is a really good thing to be discussion. Ideally,
researchers should have infrastructure available to them to manage the
versioning of their data. For those who aren't physicists (who have all the
money and hence all the nice things), there is EUDAT which provides a
handle service to research data. One of EGI's data offerings
https://datahub.egi.eu will soon be able to assign and manage PIDs
assocated with research data too.
You can play around with these at the moment - just order them from the
EGI or EOSC catalogue - marketplace.egi.eu or marketplace.eosc-hub.eu
Cheers!
Bruce

Post by Waldman, Simon
If theyâre not changing very often, you could just use Git for that ð
*Sent:* 23 July 2018 21:32
*Subject:* Re: [discuss] Version control and collaboration with large
datasets.
I just realized that I don't really have many large files.
I'm only using Git LFS on about 50 MB worth of files, and most of them
are about 1 MB in size except for one 29 MB file. I don't know if Git LFS
is the best option for my use case, but I was thinking ahead to when I
might have more of those ~30 MB json data files.
------------------------------
*Heriot-Watt University is The Times & The Sunday Times International
University of the Year 2018*
Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With
campuses and students across the entire globe we span the world, delivering
innovation and educational excellence in business, engineering, design and
the physical, social and life sciences.
1. Heriot-Watt University, a Scottish charity registered under number
SC000278
2. Edinburgh Business School a Charity Registered in Scotland,
SC026900. Edinburgh Business School is a company limited by guarantee,
registered in Scotland with registered number SC173556 and registered
office at Heriot-Watt University Finance Office, Riccarton, Currie,
Midlothian, EH14 4AS
3. Heriot- Watt Services Limited (Oriam), Scotland's national
performance centre for sport. Heriot-Watt Services Limited is a private
limited company registered is Scotland with registered number SC271030 and
registered office at Research & Enterprise Services Heriot-Watt University,
Riccarton, Edinburgh, EH14 4AS.
The contents (including any attachments) are confidential. If you are not
the intended recipient of this e-mail, any disclosure, copying,
distribution or use of its contents is strictly prohibited, and you should
please notify the sender immediately and then delete it (including any
attachments) from your system.
*The Carpentries <https://carpentries.topicbox.com/latest>* / discuss /

see discussions <https://carpentries.topicbox.com/groups/discuss> +
participants <https://carpentries.topicbox.com/groups/discuss/members> + delivery
options <https://carpentries.topicbox.com/groups/discuss/subscription>
Permalink
<https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M579b5f99185ab6b5282f7e45>

--
Don't send me files in proprietary formats (.doc(x), .xls, .ppt etc.). It
isn't good enough for Tim Berners-Lee
<http://opendotdotdot.blogspot.com/2010/04/rms-and-tim-berners-lee-separated-at.html>,
and it isn't good enough for me either. For more information visit
http://www.gnu.org/philosophy/no-word-attachments.html.

Truly great madness cannot be achieved without significant intelligence.
- Henrik Tikkanen

If you're not messing with your sanity, you're not having fun.
- James Tocknell

In theory, there is no difference between theory and practice; In practice,
there is.

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M106181f16a0ef0dcaa02da41
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Brian Ballsun-Stanton

2018-07-22 23:12:23 UTC

Permalink

Hey Claudia,

I know we've reached out to the osf for the same reason (data sovereignty rules in Australia). Ping me outside of this thread and I'll connect you with the person in charge of poking our institutional data repository? It's a bit larger scope than what you're thinking, but I suspect much of our discussion is commutable.

Cheers,

-Brian

________________________________
From: Claudia Beleites <***@chemometrix.eu>
Sent: Sunday, 22 July 2018 11:42:49 PM
To: Tim Head via discuss
Subject: Re: [discuss] Version control and collaboration with large datasets.

Tim, (and everyone who just has the same questions here)
OSF definitively is something I'll check out.

However, I note the privacy document explicitly spells out that as a US-based repo, it does not meet the requirements of EU privacy legislation (and I've been working with sensitive/patient data, so privacy and related security aspects are an important consideration). This, together with the experience that some research labs prefer to keep their data in-house. My guess is, that a system that can be set up in-house* would have much better chances to be approved by management over here also because of legal considerations.

* or in a DMZ, giving the chance to expose their public project parts or running two instances, an internal one very much in-house and one for public parts that is exposed.

As OFS states it is FOSS, this should be possible, but I did not immediately see instructions "how to run on your own server" nor technical requirements. Could you point me to such information, or is there even something like a "we run our own instances" user group?

Many thanks,

Clauida

Am 21.07.2018 um 19:29 schrieb Tim Head via discuss:
Hello all,

in the hopes of making it easier to use osf.io<https://protect-au.mimecast.com/s/eviECwV1jpSGlwV5SVLtP8?domain=osf.io> with large datasets last summer we* had some time and funding to start building http://osfclient.readthedocs.io/en/latest/cli-usage.html<https://protect-au.mimecast.com/s/ktMuCxngGkf1EWRpfv5HYy?domain=osfclient.readthedocs.io> which is both a command-line program and a Python library for osf.io<https://protect-au.mimecast.com/s/eviECwV1jpSGlwV5SVLtP8?domain=osf.io>. The tool works well for gigabyte sized files and there is starting to be a small community of people who contribute fixes and new features when something they need is missing. It would be great to grow this further.

Maybe this removes that one last hurdle that was stopping you from putting all your datasets on osf.io<https://protect-au.mimecast.com/s/eviECwV1jpSGlwV5SVLtP8?domain=osf.io> (when we asked about size limits they were confident no one would ever reach them ... and I still don't know anyone who has found it)

T

* we in this case is Titus Brown and me

On Sat, Jul 21, 2018 at 6:29 PM Claudia Beleites <***@chemometrix.eu<mailto:***@chemometrix.eu>> wrote:
Hi all,

I'm also very interested in learning solutions for this.

At the moment I distinguish two use cases:

- focus of project is coding (developing software/package/library) vs.

- focus of project is data analysis, with the sub-topic of projects
where various "slices" of the data are important.

**Code project**

I have one project where I use git-lfs on github (got a promo offer for
free use). The project is about *code* (R package) that however has some
100 MB binary data attached to it (it was larger at some point before I
could get smaller but equally suitable example files for some formats).
The binary data are example files in various file formats for the file
import filters the package provides. Initially, we had them in git as
well, but that horribly bloated the repo so it got unusable after a few
years. The files themselves, however, hardly need any versioning. I get
them and store them as they are, and only very occasionally is one of
those files replaced. The main point of the git-lfs storage is to make
sure that all files are where they are supposed to be without having too
much of manual hassle.
At some point I was lucky to get a github promo offer for free git-lfs
(test) usage and gave it a try - which is the current state.

Experiences:

- (due to free promo I don't have bandwidth billing trouble)

- Git is largely independent of git-lfs: you can still fork/clone the
git-only part of the repo and work with that. For the project in
question, the files stored in git-lfs are only needed for developing and
unit testing of file import filters, everything else does not need
git-lfs. I decided I don't want to force collaborators to install
git-lfs, so set up the project in a way that e.g. the file filter unit
tests check whether those files are available, and if not skips those
tests (visibly).
This does also make sense because of size restrictions for the R package
submission to CRAN, and as I'm the maintainer in the view of CRAN, I can
always make sure I properly run all tests.

- With this setup, I do not experience the collaboration trouble/broken
forking issues Peter StÃ©phane describes in the link in Carl's mail. At
least not for the parts of the project that are stored as "normal" git.
I've not yet had anyone trying to directly submit files that should go
into the lfs part of the repo.

- I tried to get git-lfs installed together with a private gitlab
instance (thinking we may want to use it for data-type projects), but
like Carl, I gave up. That was IIRC 3 years ago, so things may have
improved meanwhile.

For other "code-type" projects (model/algorithm development), I tend to
take a two-layered approach. Data sets that are small enough to be
shipped as example and unit test data, say, in an R package are kept
with the code. In fact, many of them are toy data computed from code,
and I just store that code. The 2nd layer are well-known example data
sets, and there I simply rely on those data sets staying available. (I'm
talking e.g. the NASA AVIRIS data sets
https://aviris.jpl.nasa.gov/data/free_data.html<https://protect-au.mimecast.com/s/vY93Cyoj8PurOB2GIQkguW?domain=aviris.jpl.nasa.gov>)
(Side note: I'm somewhat wary of papers proposing their own new
algorithm solely on their own data set, and of algorithm comparisons
based on one or few data sets)

**Data Project**

This is where I think things could be improved :-)

The majority of projects I work on are data analysis projects. I.e. we
have measurement data, do an analysis and draw conclusions, write a
report or paper.

For these projects, we tend to take a "raw data and code are real"
approach that also implies that the raw data is never changed (with the
only exception of renaming files - but the files I'm thinking off store
their orignal name, so even that can be reconstructed). So we basically
have storage and distribution needs, but not really versioning needs. We
sometimes produce pre-processed intermediate data, but that again is
defined by the code that produces this data from the raw data, and the
results are considered temporary files. If I do manual curation (mostly
excluding bad runs with certain artifacts), I produce code or data files
that say which files were excluded and for what reason. Most of this can
be and is done in an automated fashion, though.

Producing versions of this that are to be kept (such as making snapshots
of the state of data for a paper) is sufficiently infrequent to just zip
those data and have the version in the file name.

Recently, I tend to use nextcloud to share such data. We did use git for
a while, but with large amounts of data that does become cumbersome, and
we found that few collaborators were willing to learn even just the
level of git that lets them clone and pull. Owncloud/Nextcloud is a much
lower barrier in that respect.

At the moment I think what I'd like to see would be nextcloud with
commits, ignores and maybe a somewhat more distributed and less central
approach ...

Versioning binary data would be far more important for colleagues who
extensively use GUI software for their analyses: not all of the relevant
software does keep logs/recovery data (some do, though, as they are to
be used in fields like pharma where full audit trails are required).

**Data Projects II**

(Here I see huge possibilities for improvement)

OTOH, we also have some projects where it is clear that a large variety
of subsets of the data is to be requested and analysed, and we've set up
data bases for those purposes. Here again, I do dumps/backups, and in
the rare occasion that a version should be tagged that can be done to
the backup/dump. Again, these data bases are set up in a way that easily
allows adding/inserting, but changing or deleting requires admin rights
- and admin should make sure of the backup before doing any such
"surgery" to the data base.
I may say that I'm originally from a wet-lab field (chemistry): I'm
trained to work under conditions where mistakes irretrievably mess up
things. Version control and being able to undo mistakes is good and
important, but if these techniques (luxuries?) are not available at
every point, that's as it is right now.

I admit that I never bothered about implementing full audit trails - and
the supervisors I had were already suspicious whether it is worth while
bothering to set up a data base and very much against "waste of time"
such as (for code projects) unit testing and encapsulating code in
packages/libraries/their own namespace...

I've met one research institute, though, that run a full LIMS
(laboratory information management system) which however, is more suited
for situations where the same types of analyses are repeatedly done for
new samples rather than research questions where not only samples but
also analysis methods change from project to project.

But e.g. RedCap https://projectredcap.org/<https://protect-au.mimecast.com/s/mBRoCzvkmpfMWjwXHKdQEV?domain=projectredcap.org> produces data bases with
audit trails. (Never tried it, though).

Best,

Claudia

--

Claudia Beleites Chemometric Consulting
SÃ¶deler Weg 19
61200 WÃ¶lfersheim
Germany

phone: +49 (15 23) 1 83 74 18
e-mail: ***@chemometrix.eu<mailto:***@chemometrix.eu>
USt-ID: DE305606151

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M278751611e28840c648e49a9<https://protect-au.mimecast.com/s/bHX1CANpnDCNpDlxsQyjir?domain=carpentries.topicbox.com>
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription<https://protect-au.mimecast.com/s/s7C2CBNqgBC7WLRxH1gDzM?domain=carpentries.topicbox.com>

--
Claudia Beleites Chemometric Consulting
SÃ¶deler Weg 19
61200 WÃ¶lfersheim
Germany

phone: +49 (15 23) 1 83 74 18
e-mail: ***@chemometrix.eu<mailto:***@chemometrix.eu>
USt-ID: DE305606151

The Carpentries<https://protect-au.mimecast.com/s/1TlOCD1vRkC5Y8J7iR_2Q2?domain=carpentries.topicbox.com> / discuss / see discussions<https://protect-au.mimecast.com/s/X_MPCE8wlRC34Z0wHxFEvU?domain=carpentries.topicbox.com> + participants<https://protect-au.mimecast.com/s/VSeaCGv0Z6f13zOrfrja_r?domain=carpentries.topicbox.com> + delivery options<https://protect-au.mimecast.com/s/s7C2CBNqgBC7WLRxH1gDzM?domain=carpentries.topicbox.com> Permalink<https://protect-au.mimecast.com/s/SyXRCK1DOrC27RMQckkZLm?domain=carpentries.topicbox.com>

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M3d0f5f1ce18277b9088f646f
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

l***@cern.ch

2018-07-23 13:13:59 UTC

Permalink

Hi,Â

My five cents is, that it really depends on the characteristics of your data (e.g. size) and the goal you try to achieve by versioning your data.Â

Examples:

Size: If e.g. the datasets are "small", they can easily be handled by git. For larger datasets, it depends on what is important to you. E.g. a shared network file system with proper backup and well-defined naming scheme can be totally fine in some cases, while a proper data repository issuing DOIs or similar is needed in other cases. If synchronization speed, as well as optimized storage, is important, something like datÂ or IPFS is advisable.

Purpose: Similarly, if your goal is to share data with collaborators, then a simple HTTPS link is the easiest (hosted on e.g. GitHub, AWS, or a data repository).Â

Cheers,
Lars

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mf400c15b6dcfe37cf5d5379d
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Jon Pipitone

2018-07-25 15:19:45 UTC

Permalink

Post by l***@cern.ch
My five cents is, that it really depends on the characteristics of your
data (e.g. size) and the goal you try to achieve by versioning your
data.

+1 to thinking carefully about what your goals are here before jumping
to any particular tool.

My experience: I found myself re-organizing all my lab's neuroimaging
data starting from data collected when it was a single grad student up
to when it was housing data from multiple studies and multiple sites of
data collection. We opted to begin by first organizing the data with
sensible naming scheme on a shared drive, as Lars describes, because it
was immediately accessible to everyone in the lab regardless of their
tech know-how, and was also a necessary starting point regardless of
whether we later adopted a fancer data versioning/sharing technology. We
did later use a neuroimaging-specific system for sharing our data with
others, but retained the filesystem organization in addition because it
was familiar, and so darn convenient for scripting, documentation, etc.

Jon.

Post by l***@cern.ch
Hi,
My five cents is, that it really depends on the characteristics of your data (e.g. size) and the goal you try to achieve by versioning your data.
Size: If e.g. the datasets are "small", they can easily be handled by git. For larger datasets, it depends on what is important to you. E.g. a shared network file system with proper backup and well-defined naming scheme can be totally fine in some cases, while a proper data repository issuing DOIs or similar is needed in other cases. If synchronization speed, as well as optimized storage, is important, something like dat or IPFS is advisable.
Purpose: Similarly, if your goal is to share data with collaborators, then a simple HTTPS link is the easiest (hosted on e.g. GitHub, AWS, or a data repository).
Cheers,
Lars

Brian Ballsun-Stanton

2018-07-30 23:09:05 UTC

Permalink

Since this directly applies, I saw this note in my rss reader (newsblur) today: https://blog.github.com/2018-07-30-git-lfs-2.5.0-now-available/

Specifically their git lfs migrate import --fixup which allows for dealing with that dreaded "You cannot push to (whatever) because your repository is too big." Also, because this isn't advertised anywhere, if you authenticate with education.github.com, you can have unlimited private repos and space.

(And, since it trips me up every time, remember that after "installing" git-lfs from package cloud on linux, you still have to apt install git-lfs, since package cloud only adds the repo)

________________________________
From: Jon Pipitone <***@pipitone.ca>
Sent: Thursday, 26 July 2018 1:19:45 AM
To: discuss
Subject: Re: [discuss] Version control and collaboration with large datasets.

Post by l***@cern.ch
My five cents is, that it really depends on the characteristics of your
data (e.g. size) and the goal you try to achieve by versioning your
data.

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M5489167c4c6220100f4abc5a<https://protect-au.mimecast.com/s/SmSZC81Vq2C6DNXBI2ZUw7?domain=carpentries.topicbox.com>
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription<https://protect-au.mimecast.com/s/lwAIC91W8rCkL4zrhO4e1D?domain=carpentries.topicbox.com>

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M37515a5553a4c80373ac40d0
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

i***@sesync.org

2018-08-02 14:54:36 UTC

Permalink

Since this thread was highlighted in yesterday's Carpentry Clippings, I'll bet I'm not the last to jump in today, so I'll be brief.

DVC <http://github.com/iterative/dvc.git> was mentioned at the beginning, but I gather few here have given it a try. I encourage you to take a look. The tool is still in alpha, but developing quickly with a lot of potential. What I like about DVC:
* Works in parallel to git and is similar to git LFS in cloning/pushing/pulling references to data files
* Data files are not tracked by git; your code repository remains just that
* Supports external data sources (since 0.10.0 <https://github.com/iterative/dvc/releases/tag/0.10.0>); do you really want a copy of your data *within* every repo that reads it?
* Supports multiple cloud data sources (e.g. Amazon S3)
* Does not default to "publishing" data on GitHub. GitHub is no Dataverse or Figshare (... data discoverability, yada yada)
* It's a makefile alternative too!

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Ma216656f062405087a5f69ae
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Dav Clark via discuss

2018-08-02 16:04:44 UTC

Permalink

Thank you for re-raising DVC. I hadn't looked seriously at it, but it
seems a nice balance between git LFS and git annex. I'm recommending
that our team look at it for potential inclusion in Gigantum.

While I'm at it, I think now is also a good time to pitch the
graphical tool we're building: https://gigantum.com

I'm a little hesitant, but I've checked in-person with a few deeply
invested folks in the community and it seems reasonable to "advertise"
this on the list - our client is open source (and always will be) and
we are hoping we can build a sustainable model where scientists and
educators can use the thing for free (much like GitHub has done -
though I understand not everyone loves GitHub either).

We are still in beta, but we've put together something that we believe
does a good job of managing git and docker, along with a cloud
synchronization back-end. While this may rankle more experienced
coders, I think it can potentially empower folks who won't (or can't)
learn the whole set of data / software management skills. Part of
being in Beta is that the feature set is still somewhat open and I'd
truly love to get input from other carpentry instructors on this. A
concrete idea I have is that you could save 1/4 of the time if you
don't need to teach command line git, and then perhaps you could get
more students to the point that they wrote good functions.

Triggered by this thread... data management piece is particularly of
interest. Currently, we have a relatively coarse set of options - each
project is scoped to a docker-bind-mounted folder on your host OS.
There are "input" and "output" directories there managed via git LFS
by default. However, you can also disable tracking of these
directories and then manage them however you like (you shouldn't
currently use a strategy that extends the existing git repo like doing
LFS yourself - though now that I think of it, manual git annex MIGHT
work... I'll have to check when I get some time). One of the major
hobgoblins is the windows filesystem, of course... and we could
potentially eliminate that by shifting to docker volumes instead of
bind-mounts (but then you lose Host OS access).

Kicking the tires is super easy via the demo server link, and all you
need to use it locally is download the electron GUI or alternatively
install a pip package. It would be great to get input on what would be
valuable and if folks would be interested in talking about using this
in workshops (or just developing resources around transparent and open
science strategies and tools), please let me know! I will of course
support anyone who is interested in working with Gigantum and would
love to run some workshops in partnership with some other folks (we're
currently working on an open/reproducible neuro workshop with folks at
Stanford and Columbia - so if anyone is interested in that
specifically, please let me know soon!).

Best,
Dav

ps - To be clear, it's super-easy to walk away from Gigantum.
Everything is in a git repo, and the Dockerfile is usable (with a bit
of work - which I'd be happy to walk people through) outside of the
platform.

Post by i***@sesync.org
Since this thread was highlighted in yesterday's Carpentry Clippings, I'll bet I'm not the last to jump in today, so I'll be brief.
Works in parallel to git and is similar to git LFS in cloning/pushing/pulling references to data files
Data files are not tracked by git; your code repository remains just that
Supports external data sources (since 0.10.0); do you really want a copy of your data *within* every repo that reads it?
Supports multiple cloud data sources (e.g. Amazon S3)
Does not default to "publishing" data on GitHub. GitHub is no Dataverse or Figshare (... data discoverability, yada yada)
It's a makefile alternative too!
The Carpentries / discuss / see discussions + participants + delivery options Permalink

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M6ec28f85bc59ba4ad6e66d6b
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Rémi Rampin

2018-08-02 18:33:23 UTC

Permalink

Post by i***@sesync.org
Since this thread was highlighted in yesterday's Carpentry Clippings, I'll
bet I'm not the last to jump in today, so I'll be brief.
DVC <http://github.com/iterative/dvc.git> was mentioned at the beginning,
but I gather few here have given it a try. I encourage you to take a look.
The tool is still in alpha, but developing quickly with a lot of potential.
- Works in parallel to git and is similar to git LFS in
cloning/pushing/pulling references to data files
- Data files are not tracked by git; your code repository remains just
that
- Supports external data sources (since 0.10.0
<https://github.com/iterative/dvc/releases/tag/0.10.0>); do you really
want a copy of your data *within* every repo that reads it?
- Supports multiple cloud data sources (e.g. Amazon S3)
- Does not default to "publishing" data on GitHub. GitHub is no
Dataverse or Figshare (... data discoverability, yada yada)
- It's a makefile alternative too!
DVC looks nice and easy to grasp, and is not that far from Git-LFS. Being

able to use S3 or whatever else for storage is huge, because there are very
few options for LFS servers (only opensource option I know of is built into
GitLab). Adding new backends looks very straightforward (easier than
patching git-annex).

It keeping track of workflows might or might not matter to you. It's nice
to have, and definitely useful if your project happens to be a data science
kind of workflow, but if you just need to share data files you won't use
it. But it won't get in the way.

Likewise, absence of integration with Git might or might not be a good
thing. It is nice to be able to see changes to CSVs right from git-diff
when using LFS. If your files are not diffable, you won't miss it. Git
operations are certainly faster without this machinery.

I personally like that the pointer files (.dvc) have a different filename
than the data files. This causes me constant headaches when using Git-LFS
(do I need to "lfs checkout"? Do I need to "git reset" the data out of my
Git index?).

DVC seems close to Datalad, which has been mentioned once in this thread.
Has anyone here used that in practice? It seems to be a more complex
option, though it might be more powerful. It seems more deeply integrated
with Git, in that runs will create commits and branches directly, more than
just updating .dvc files that it is your responsibility to check in.

Best

--
RÃ©mi

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M096cd2663242ccc1a93693ca
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

James Tocknell via discuss

2018-08-03 01:03:15 UTC

Permalink

I've used datalad (as it's a fairly thin wrapper around git annex),
though I haven't really pushed the reproducibility parts of it (my
main use for it is connecting different repositories together when
pushing). datalad run $script for me is just a convenience over
running the script then adding the output file (though I can see if
you were creating a large number of files in each run, datalad run is
going to be a major improvement). The unique feature of datalad seem
to be the ease at which subdatasets can be managed (e.g.
http://datasets.datalad.org/?dir=/openfmri is a single dataset, with
quite a number of subdatasets), which means it's likely something I'm
going to use more often.

James

Post by i***@sesync.org

Post by i***@sesync.org
Since this thread was highlighted in yesterday's Carpentry Clippings, I'll
bet I'm not the last to jump in today, so I'll be brief.
DVC was mentioned at the beginning, but I gather few here have given it a
try. I encourage you to take a look. The tool is still in alpha, but
Works in parallel to git and is similar to git LFS in
cloning/pushing/pulling references to data files
Data files are not tracked by git; your code repository remains just that
Supports external data sources (since 0.10.0); do you really want a copy
of your data *within* every repo that reads it?
Supports multiple cloud data sources (e.g. Amazon S3)
Does not default to "publishing" data on GitHub. GitHub is no Dataverse or
Figshare (... data discoverability, yada yada)
It's a makefile alternative too!

DVC looks nice and easy to grasp, and is not that far from Git-LFS. Being
able to use S3 or whatever else for storage is huge, because there are very
few options for LFS servers (only opensource option I know of is built into
GitLab). Adding new backends looks very straightforward (easier than
patching git-annex).
It keeping track of workflows might or might not matter to you. It's nice to
have, and definitely useful if your project happens to be a data science
kind of workflow, but if you just need to share data files you won't use it.
But it won't get in the way.
Likewise, absence of integration with Git might or might not be a good
thing. It is nice to be able to see changes to CSVs right from git-diff when
using LFS. If your files are not diffable, you won't miss it. Git operations
are certainly faster without this machinery.
I personally like that the pointer files (.dvc) have a different filename
than the data files. This causes me constant headaches when using Git-LFS
(do I need to "lfs checkout"? Do I need to "git reset" the data out of my
Git index?).
DVC seems close to Datalad, which has been mentioned once in this thread.
Has anyone here used that in practice? It seems to be a more complex option,
though it might be more powerful. It seems more deeply integrated with Git,
in that runs will create commits and branches directly, more than just
updating .dvc files that it is your responsibility to check in.
Best
--
Rémi
The Carpentries / discuss / see discussions + participants + delivery
options Permalink

--
Don't send me files in proprietary formats (.doc(x), .xls, .ppt etc.).
It isn't good enough for Tim Berners-Lee, and it isn't good enough for
me either. For more information visit
http://www.gnu.org/philosophy/no-word-attachments.html.

Truly great madness cannot be achieved without significant intelligence.
- Henrik Tikkanen

If you're not messing with your sanity, you're not having fun.
- James Tocknell

In theory, there is no difference between theory and practice; In
practice, there is.

------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mc391b14e70952e72cff01775
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

thompson.m.j via discuss

2018-08-10 21:22:43 UTC

Permalink

There is a lot of information here. Thanks to everyone offering insights! I'll offer some more context and detail for those who asked for my motivation for posting.

As a student in an academic science lab that uses computers and code to science, I am interested in learning and adopting the tools and practices software developers use for their work to make *good* code that's easy to share, easy for other lab members and collaborators to read and pick up, and that's resilient to my screw-ups. I.e. I'd like to do things *right* even though I don't know about anyone else on campus who does. I am very motivated by the open science phenomenon, and want the tools that are necessary to be a part of that as well.

I learned git (to a point), so that's cool. Now I'm trying to prod my lab mates and advisor to pick it up too. I also started thinking, "Well what about the data? I could just gitignore it all, but sometimes it changes, branches, and needs to be reset too. And it'd be great if I didn't have to have to track that all by file names." In my current case, I'm using large (>100MB) image stacks. Versioning in this sense would ideally look something like recording a macro to track the operations done (basically diffs) between one version and the next. Probably technically impossible actually... Other data includes analysis and simulation data (.csv, .mat, etc.) This was when I posted this question.

Currently, I am foraying into transitioning from having all data organized next to everything else in a file system to integrating it into databases. I am new to the database universe, so forgive me for any improper understandings here. I'm averse to SQL because I am certain that a single table would have tons of blanks, and I don't like the idea of complicated joins. I am a believer that all data should be dynamic, and by that I mean I have a vague notion that any new (or really old) data should be able to be integrated into a data model to further inform the analysis. MongoDB strikes me as a useful tool for just about all scenarios in this respect.

The data repositories suggested here are certainly useful (particularly OSF), but that brings up another issue I've been thinking about, which is discoverability. As an exemplar of the kind of solution to this problem I'm interested in, take a startup company I recently learned about called BenchSci <https://www.benchsci.com/>. Though they still have errors in the reported data, they are trying to solve a big problem in data discoverability regarding the use of antibodies in research. They're making a one-stop-shop where you can see vendor data and publication data for antibodies and targets, seriously reducing the leg work needed to hunt for all this information manually, and making it less likely that a good option will go undiscovered. Back to the more general data question, with so many repository options and so many formats, they all need to be tied together somehow. There should also be a way to incorporate 'legacy' data to get data that's currently only available behind a paywall as a crappy jpg in supplemental figure 17 ... but the high res raw data might still exist on a hard drive somewhere and might be useful for some other analysis not done in the original paper.

Obviously I'm starting to get a bit ahead of myself. I have a hard time not getting carried away.
------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Ma044d0880bb7896449f24aed
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

s***@onerussian.com

2018-09-06 16:03:10 UTC

Permalink

Really nice to see DataLad being mentioned!
Disclaimer: I am one of the DataLad founders/developers, so my opinions are obviously biased.

First of all, I want to say that development of DataLad was largely inspired by both software distributions (and our experience in Debian and our NeuroDebian projects) which make software tools readily available with clean versioning and through unified interface, and version control systems which we have been using for all daily work for probably almost two decades by now (started to use Git in 2007, before that CVS and then SVN, little bzr).Â With DataLad we first of all wanted to establish a data distribution, but because we use and love Git, while working on DataLad we realized that now we got a full blown data version control/management system, rather just a "data distribution".

Let me go through some bullet points, which relate to other posts in the thread, with the hope that it might be informative and help you to make an informative decision on the choice of the Version control system for data.Â To no degree this is a complete presentation of git-annex and/or DataLad.Â I would refer you to our (possibly incomplete) documentation and examples at http://datalad.org .

But whatever system you choose, my main message would be
*
*
*"PLEASE DO VERSION CONTROL YOUR DATA!"*

and do not leave data a "2-nd class citizen" in the digital objects of your research/work.

* "Distributed data" - the main aspect to choose git-annex was to be able to provide access to already available publicly data.

When we started working on DataLad, there were no other general solution which would make it possible to "hook" into existing data portals.Â We simply could not duplicate on our server all data for re-distribution, similarly to how we do with software in e.g. Debian. ATM almost all data for datasets provided from http://datasets.datalad.org come from a wide variety of original locations (S3 buckets, http websites, custom data portals, ...) through the unified git-annex/DataLad interfaces. datalad crawlÂ command could be used to quickly establish yourself a git/git-annex repository from standard S3 bucket or a simple website, or you could provide custom crawling pipelines (https://github.com/datalad/datalad-crawler). The majority of the datasets on http://datasets.datalad.org are created and updated via datalad crawlÂ command, and later published to that website via ssh usingÂ datalad publish.Â So you could get yourself a similar "data portal" within minutes if you have something to share.

** Experimentation*

We often love git for code development since it allows to experiment easily:Â create a branch, through new functionality against the wall, see if it sticks, if does - merge.Â The same practice often applies to data analyses: we want to try different processing parameters, new algorithm, etc. Keeping incoming data, code, and output results under the same VCS allows to establish a clear track of how any specific results were obtained.Â The beauty of git-annex (and DataLad ;-)) that you still use Git while working with your data.Â Some git functions listed below become a "god blessing" for experimentation:

** git checkout BRANCH

I guess I should not describe the utility of this functionality in git.Â But what is great when working with git-annex, is that checkouts of alternative branches are SUPER FAST regardless of how big your data files (under git-annex control) are, because they are just symlinks.Â You can also literally re-layout the entire tree within seconds, if data files are annotated with metadata in git-annex.Â If you would like to see/try it, just do

Â Â Â git clone http://datasets.datalad.org/labs/openneurolab/metasearch/.git
Â Â Â cd metasearch; ls # or tree
Â Â Â git annex view species=* sex=* handedness=*
Â Â Â ls # or tree
Â Â Â git checkout masterÂ # to appreciate the speed
Â Â Â git checkout -

So you could "explore" and manipulate the dataset even without fetching any data yet (use datalad get or git-annex get to get files of interest).

** git reset --hard SOMEWHEREINTHEPASTÂ is great!

So many times I do something, possibly still in master, and then want to get rid of it, or rerun it.
git reset --hard is my friend, and it works just wonderful with git-annexed files -- super fast, etc.Â git clean -dfxÂ helps to keep everything tidy.

** datalad run (mentioned above)Â andÂ datalad rerunÂ commands

I useÂ datalad runÂ more and more now, whenever I get any outputs produced by running a command.Â It just makes it so easy to make a clear record in the git history identifying what command lead to that change.Â datalad rerunÂ could then be used to reconstruct the entire history (merges support is WiP), or rerun commands on top of current point in git, happen my tool changed or I am in a different environment.

Related is the https://github.com/datalad/datalad-container extension.Â The idea is that if reproducibility is the goal, we should keep also entire computing environments (singularity or docker images) under VCS as well!Â And since git-annex does not care what kind of file you keep there - everything works smooth.Â Now we can be sure that we use the same environment locally and on HPC, and all changes recorded in git history, and we have clean ways to transfer between our computing infrastructure.
*
*
** Integrations/Collaborations*

The power and the curse (somewhat) of git-annex is its breadth of coverage of external storage solutions.Â You could manage data content spread across a variety of "remotes" -- from regular ssh-reachable hosts, S3 buckets, google drive etc (see https://github.com/DanielDent/git-annex-remote-rclone), etc.Â And you could provide custom additions, like we did for accessing data provided online in tar/zip-balls by many portals.Â Literally any available online dataset could be made accessible via git-annex, and it could support any available online/cloud storage portal.Â The main beauty is that repository remains a Git repository, so you could publish it on github.Â Try e.g.

Â Â Â datalad install -g https://github.com/psychoinformatics-de/studyforrest-data-phase2

to get yourself all 14.5GB of that dataset hosted on github with data flowing from some regular http servers.Â Other example could be OpenNeuro project, which is switching to use DataLad for data management backend where data offloaded to a versioned S3 bucket (s3://openneuro.org), while git repos shared on github (https://github.com/OpenNeuroDatasets; still WiP so some rough corners to polish.

** figshare - "publish" your datasets using datalad

http://docs.datalad.org/en/latest/generated/man/datalad-export-to-figshare.html

So you can publish your dataset as a tarball to figshare, and then your content locally would be linked to that tarball (so you can publish your git repo to github etc).Â Figshare (as well as zenodo) are suboptimal for data which is actively changing since published dataset there cannot be changed. Also they do not support directory structure, that is why we publish tarballs.Â Our export to figshare could be improved to provide a flattened collection of files instead of a tarball though (contributions are welcome).

** OSF - someone needs to find time to provide support for it I guess

** Internal - students use DataLad to obtain/transfer/sync data between incoming data server, lab cluster, institutional HPC cluster.

The beauty of git annex allows to keep the entire git repository on
HPC pretty much indefinitely while just dropping it from HPC to not
consume precious space at HPC, while being able to get it there again
happen they need to rerun analysis.Â All the changes are strongly
version controlled, so they never loose track of "which version of
data I need" or "on what version of data I have ran the analysis"

** https://web.gin.g-node.org/

have a look at this "github clone" which was extended with git-annex support, happen you want to have git-annex aware github instance of your own.

** Caching - only recently exercised as the opportunity. See https://github.com/datalad/datalad/issues/2763 and references there in.

Relates also to experimentation.Â It is very quick to clone a dataset locally or from the web, but then you might be duplicating data which is already available on the filesystem.Â With that "caching remote" it would be possible to take advantage of hardlinks/CoW filesystems and experiment with datasets as quickly as experimenting with the code, without fear of ruining some original centralized original version of the dataset.

** Modularity (full study history/objects)*

** standard git submodules - similar to per-file (get/drop), you can install/uninstall subdatasets

In DataLad, after trying a few other ideas, we decided just to use a standard git submodule mechanism for "composition" of subdatasets in something bigger (we call them super-datasets):

- as pointed out in above comments, the entire http://datasets.datalad.org is just a git repository with submodules, establishing the full hierarchy of datasets with clear versioned associations
- if you have a study-dedicated VCS, you can then install (as a submodule) your input datasets (from other places), provide new sub-datasets for derived data (e.g. preprocessed) and results, possibly reusing those as independent pieces in follow up studies.Â Everything is version controlled, clear versioning association, etc
- include as submodule a dataset with your favorite singularity/docker images! ;)

** "Open standard" *

all git-annex'ed information is within git-annex branch, easy to understand happen someone would want to reimplement some git-annex functionality

Sorry that it came out a bit too long, but hopefully some people might find it useful.
------------------------------------------
The Carpentries: discuss
Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M015d6539e85b09244e739bff
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Continue reading on narkive:

Search results for '[discuss] Version control and collaboration with large datasets.' (Questions and Answers)

replies

What gene causes Melanoma?

started 2008-02-05 13:47:59 UTC

skin conditions

replies

What is it MPEG4 ?

started 2006-10-19 12:10:16 UTC

computer networking