[FR] supporting submodules with alternate version control systems (new contributor)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello all,

I'm familiar with opensource software development through Github, but
have not contributed to git before so apologies if I'm using the wrong
avenues. Please point me in the right direction if that is the case. I
saw this mailing list mentioned on the
[mirror](https://github.com/git/git) repository, so it seemed like the
right place to start.

I have a feature request I'd like some feedback on. The core idea is
to support submodules with alternate (i.e. non-git based) version
control systems.

* **Why:** Git is excellent for versioning code and I don't need
another VCS for that purpose. However, in machine learning (ML)
workflows it has become more
[standard](https://opendatascience.com/how-data-versioning-can-be-used-in-machine-learning/)
to version your datasets, and for this purpose many git-like tools
have been developed. See [Dolt](https://www.dolthub.com/),
[LakeFS](https://lakefs.io/), and [DVC](https://dvc.org/) for a few
examples. Currently, ML practitioners have to bifurcate their
development process - code is committed/managed with git and datasets
are committed/managed with a 3rd party VCS (and often cloned in a
different folder outside the git repository). My proposal is to unify
the data versioning tools with git submodules so that they can act as
any other 3rd party library inside a parent repository

* **How:** Most data versioning tools already define a git-like CLI.
For instance, you have "dolt commit", "dvc push", "lakectl diff", etc.
The set of commands and options is usually a subset of the full list
available in git, but the important ones are there. My approach would
require a few steps

1. Git defines an API for configuring 3rd party VCS tools. It's
essentially a mapping from git command to the equivalent in the 3rd
party library. This should also account for which options/flags are
supported
2. Developers from the 3rd party library integrate with this git API
by maintaining a config file for the mapping that gets installed
alongside their binaries
3. The .gitmodules syntax is extended to include a "type" field which
defaults to git but can be set to other supported values
4. Then end-users can add submodules with an alternate VCS. Once
added, the CLI interaction would appear like normal git but under the
hood it would be using a different engine (and remote storage)

Is something along these lines feasible? If so, could someone who is
more familiar with the code base give me a rough idea how one might go
about this? I would like to author the PR to implement this - just
looking for some help getting started.

Thank you for the help,

Addison



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux