On June 3, 2022 7:07 PM, Philip Oakley wrote: >On 01/06/2022 13:44, Addison Klinke wrote: >>> rsbecker: move code into a submodule from your own VCS system >> into a git repository and the work with the submodule without the git >> code-base knowing about this >> >>> Philip: uses a proper sub-module that within it then has >> the single 'large' file git-lfs style that hosts the hash reference >> for the data VCS >> >> The downside I see with both of these approaches is that translating >> the native data VCS to git (or LFS) negates all the benefits of having >> a VCS purpose-built for data. That's why the majority of data >> versioning tools exist - because git (or LFS) are not ideal for >> handling machine learning datasets > >The key aspect is deciding which of the two storage systems (the Data & the Code) >will be the overall lead system that contains the linked reference to the other >storage system to ensure the needed integrity. >That is not really a technical question. Rather its somewhat of a social discussion >(workflows, trust, style of integration, etc). > >It maybe that one of the systems does have less long-term integrity, as has been >seen in many versioning systems over the last century (both manual and >computer), but the UI is also important. > >IIRC Junio did note that having a suitable API to access the other storage system >(to know its status, etc.) is likely to be core to the ability to combine the two. It >may be that a top level 'gui' is used control both systems and ensure >synchronisation to hide the complexities of both systems. > >I'm still thinking that the "git-lfs like" style could be the one to use, but that is very >dependant on the API that is available for capturing the Data state into the git >entry that records that state, whether that is a file (git-lfs like) or a 'sub-module' >(directory as state ) style. Either way it still need reifying (i.e. coded to make the >abstract concept into a concrete implementation). > >Which ever route is chosen, it still sounds to me like a worthwhile enterprise. It's >all still very abstract. >> >> On Tue, May 10, 2022 at 2:54 PM Philip Oakley <philipoakley@iee.email> wrote: >>> On 10/05/2022 18:20, Jason Pyeron wrote: >>>>> -----Original Message----- >>>>> From: Junio C Hamano >>>>> Sent: Tuesday, May 10, 2022 1:01 PM >>>>> To: Addison Klinke <addison@xxxxxxxxx> >>>>> >>>>> Addison Klinke <addison@xxxxxxxxx> writes: >>>>> >>>>>> Is something along these lines feasible? >>>>> Offhand, I only think of one thing that could make it fundamentally >>>>> infeasible. >>>>> >>>>> When you bind an external repository (be it stored in Git or >>>>> somebody else's system) as a submodule, each commit in the >>>>> superproject records which exact commit in the submodule is used >>>>> with the rest of the superproject tree. And that is done by >>>>> recording the object name of the commit in the submodule. >>>>> >>>>> What it means for the foreign system that wants to "plug into" a >>>>> superproject in Git as a submodule? It is required to do two >>>>> things: >>>>> >>>>> * At the time "git commit" is run at the superproject level, the >>>>> foreign system has to be able to say "the version I have to be >>>>> used in the context of this superproject commit is X", with X >>>>> that somehow can be stored in the superproject's tree object >>>>> (which is sized 20-byte for SHA-1 repositories; in SHA-256 >>>>> repositories, it is a bit wider). >>>>> >>>>> * At the time "git chekcout" is run at the superproject level, the >>>>> superproject will learn the above X (i.e. the version of the >>>>> submodule that goes with the version of the superproject being >>>>> checked out). The foreign system has to be able to perform a >>>>> "checkout" given that X. >>>>> >>>>> If a foreign system cannot do the above two, then it fundamentally >>>>> would be incapable of participating in such a "superproject and >>>>> submodule" relationship. >>> The sub-modules already have that problem if the user forgets publish >>> their sub-module (see notes in the docs ;-). >>>> The submodule "type" could create an object (hashed and stored) that >contains the needed "translation" details. The object would be hashed using SHA1 >or SHA256 depending on the git config. The format of the object's contents would >be defined by the submodule's "code". >>>> >>> Another way of looking at the issue is via a variant of Git-LFS with >>> a smudge/clean style filter. I.e. the DataVCS would be treated as a 'file'. >>> >>> The LFS already uses the .gitattributes to define a 'type', while the >>> submodules don't yet have that capability. There is just a single >>> special type within a tree object of "sub-module" being a mode 16000 >>> commit (see https://longair.net/blog/2010/06/02/git-submodules-explained/). >>> >>> One thought is that one uses a proper sub-module that within it then >>> has the single 'large' file git-lfs style that hosts the hash >>> reference for the data VCS >>> (https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md). It would >>> be the regular sub-modules .gitattributes file that handles the data >>> conversion. >>> >>> It may be converting an X-Y problem into an X-Y-Z solution, or just >>> extending the problem. The most salient issue I have with this is that signatures cannot be validated across VCS systems. Within git, a submodule commit can be signed. This ensures that the contents of the commit in the super-project can also be signed. If someone hacks an underlying VCS that is not git, either: a) git can never sign a commit from an underlying VCS, or b) git can never trust a commit from an underlying VCS. This pollutes a fundamental capability of git, being multiple signers the contents of a commit, and invalidates the integrity of the Merkel tree that underlies git contents. I do not see that this concept contributes positively to the ecosystem. I do feel strongly about this and hope my points are understood. Sincerely, Randall