Re: git, monorepos, and access control

Jeff King <peff@xxxxxxxx> · Thu, 6 Dec 2018 02:20:03 -0500

On Thu, Dec 06, 2018 at 10:08:57AM +0900, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
> 
> > In my opinion this feature is so contrary to Git's general assumptions
> > that it's likely to create a ton of information leaks of the supposedly
> > protected data.
> > ...
> 
> Yup, with s/implemented/designed/, I agree all you said here
> (snipped).

Heh, yeah, I actually scratched my head over what word to use. I think
Git _could_ be written in a way that is both compatible with existing
repositories (i.e., is still recognizably Git) and is careful about
object access control. But either way, what we have now is not close to
that.

> > Sorry I don't have a more positive response. What you want to do is
> > perfectly reasonable, but I just think it's a mismatch with how Git
> > works (and because of the security impact, one missed corner case
> > renders the whole thing useless).
> 
> Yup, again.
> 
> Storing source files encrypted and decrypting with smudge filter
> upon checkout (and those without the access won't get keys and will
> likely to use sparse checkout to exclude these priviledged sources)
> is probably the only workaround that does not involve submodules.
> Viewing "diff" and "log -p" would still be a challenge, which
> probably could use the same filter as smudge for textconv.

I suspect there are going to be some funny corner cases there. I use:

  [diff "gpg"]
  textconv = gpg -qd --no-tty

which works pretty well, but it's for files which are _never_ decrypted
by Git. So they're encrypted in the working tree too, and I don't use
clean/smudge filters.

If the files are already decrypted in the working tree, then running
them through gpg again would be the wrong thing. I guess for a diff
against the working tree, we would always do a "clean" operation to
produce the encrypted text, and then decrypt the result using textconv.
Which would work, but is rather slow.

> I wonder (and this is the primary reason why I am responding to you)
> if it is common enough wish to use the same filter for smudge and
> textconv?  So far, our stance (which can be judged from the way the
> clean/smudge filters are named) has been that the in-repo
> representation is the canonical, and the representation used in the
> checkout is ephemeral, and that is why we run "diff", "grep",
> etc. over the in-repo representation, but the "encrypted in repo,
> decrypted in checkout" abuse would be helped by an option to do the
> reverse---find changes and look substrings in the representation
> used in the checkout.  I am not sure if there are other use cases
> that is helped by such an option.

Hmm. Yeah, I agree with your line of reasoning here. I'm not sure how
common it is. This is the first I can recall it. And personally, I have
never really used clean/smudge filters myself, beyond some toy
experiments.

The other major user of that feature I can think of is LFS. There Git
ends up diffing the LFS pointers, not the big files. Which arguably is
the wrong thing (you'd prefer to see the actual file contents diffed),
but I think nobody cares in practice because large files generally don't
have readable diffs anyway.

-Peff