Re: git, monorepos, and access control

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 06 Dec 2018 10:17:24 +0100

On Thu, Dec 06 2018, Jeff King wrote:

> On Thu, Dec 06, 2018 at 10:08:57AM +0900, Junio C Hamano wrote:
>
>> Jeff King <peff@xxxxxxxx> writes:
>>
>> > In my opinion this feature is so contrary to Git's general assumptions
>> > that it's likely to create a ton of information leaks of the supposedly
>> > protected data.
>> > ...
>>
>> Yup, with s/implemented/designed/, I agree all you said here
>> (snipped).
>
> Heh, yeah, I actually scratched my head over what word to use. I think
> Git _could_ be written in a way that is both compatible with existing
> repositories (i.e., is still recognizably Git) and is careful about
> object access control. But either way, what we have now is not close to
> that.
>
>> > Sorry I don't have a more positive response. What you want to do is
>> > perfectly reasonable, but I just think it's a mismatch with how Git
>> > works (and because of the security impact, one missed corner case
>> > renders the whole thing useless).
>>
>> Yup, again.
>>
>> Storing source files encrypted and decrypting with smudge filter
>> upon checkout (and those without the access won't get keys and will
>> likely to use sparse checkout to exclude these priviledged sources)
>> is probably the only workaround that does not involve submodules.
>> Viewing "diff" and "log -p" would still be a challenge, which
>> probably could use the same filter as smudge for textconv.
>
> I suspect there are going to be some funny corner cases there. I use:
>
>   [diff "gpg"]
>   textconv = gpg -qd --no-tty
>
> which works pretty well, but it's for files which are _never_ decrypted
> by Git. So they're encrypted in the working tree too, and I don't use
> clean/smudge filters.
>
> If the files are already decrypted in the working tree, then running
> them through gpg again would be the wrong thing. I guess for a diff
> against the working tree, we would always do a "clean" operation to
> produce the encrypted text, and then decrypt the result using textconv.
> Which would work, but is rather slow.
>
>> I wonder (and this is the primary reason why I am responding to you)
>> if it is common enough wish to use the same filter for smudge and
>> textconv?  So far, our stance (which can be judged from the way the
>> clean/smudge filters are named) has been that the in-repo
>> representation is the canonical, and the representation used in the
>> checkout is ephemeral, and that is why we run "diff", "grep",
>> etc. over the in-repo representation, but the "encrypted in repo,
>> decrypted in checkout" abuse would be helped by an option to do the
>> reverse---find changes and look substrings in the representation
>> used in the checkout.  I am not sure if there are other use cases
>> that is helped by such an option.
>
> Hmm. Yeah, I agree with your line of reasoning here. I'm not sure how
> common it is. This is the first I can recall it. And personally, I have
> never really used clean/smudge filters myself, beyond some toy
> experiments.
>
> The other major user of that feature I can think of is LFS. There Git
> ends up diffing the LFS pointers, not the big files. Which arguably is
> the wrong thing (you'd prefer to see the actual file contents diffed),
> but I think nobody cares in practice because large files generally don't
> have readable diffs anyway.

I don't use this either, but I can imagine people who use binary files
via clean/smudge would be well served by dumping out textual metadata of
the file for diffing instead of showing nothing.

E.g. for a video file I might imagine having lines like:

    duration-seconds: 123
    camera-model: Shiny Thingamabob

Then when you check in a new file your "git diff" will show (using
normal diff view) that:

   - duration-seconds: 123
   + duration-seconds: 321
    camera-model: Shiny Thingamabob

etc.