Re: [PATCH] [RFC] list-objects-filter: introduce new filter sparse:buffer=<spec>

ZheNing Hu <adlternative@xxxxxxxxx> · Fri, 26 Aug 2022 13:10:59 +0800

Derrick Stolee <derrickstolee@xxxxxxxxxx> 于2022年8月9日周二 21:37写道：
>
> On 8/8/2022 12:15 PM, Junio C Hamano wrote:
> > "ZheNing Hu via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> >
> >> From: ZheNing Hu <adlternative@xxxxxxxxx>
> >>
> >> Although we already had a `--filter=sparse:oid=<oid>` which
> >> can used to clone a repository with limited objects which meet
> >> filter rules in the file corresponding to the <oid> on the git
> >> server. But it can only read filter rules which have been record
> >> in the git server before.
> >
> > Was the reason why we have "we limit to an object we already have"
> > restriction because we didn't want to blindly use a piece of
> > uncontrolled arbigrary end-user data here?  Just wondering.
>
> One of the ideas here was to limit the opportunity of sending an
> arbitrary set of data over the Git protocol and avoid exactly the
> scenario you mention.
>

I find that sparse-checkout uses a "cone" mode to limit the set of send
data, which can achieve performance improvement. I don't know if we can
use this mode here? With a brief look, it seems that the "cone" mode is
ensuring that the filter rule we add is directory and does not contain some
special rule '!', '?', '*', '[', ']'. But now if we transport the
filter rule to git server,
git server cannot check if the filter rule is a directory, because it involves
paths in multiple commits. e.g. in 9e6f67, "test" can be a directory, but in
e5e154e, "test" can be a file... I don't know how to solve this problem...

> Another was that it is incredibly expensive to compute the set of
> reachable objects within an arbitrary sparse-checkout definition,
> since it requires walking trees (bitmaps do not help here). This
> is why (to my knowledge) no Git hosting service currently supports
> this mechanism at scale. At minimum, using the stored OID would
> allow the host to keep track of these pre-defined sets and do some
> precomputing of reachable data using bitmaps to keep clones and
> fetches reasonable at all.
>
> The other side of the issue is that we do not have a good solution
> for resolving how to change this filter in the future, in case the
> user wants to expand their sparse-checkout definition and update
> their partial clone filter.
>
> There used to be a significant issue where a 'git checkout'
> would fault in a lot of missing trees because the index needed to
> reference the files outside of the sparse-checkout definition. Now
> that the sparse index exists, this is less of an impediment, but
> it can still cause some pain.
>
> At this moment, I think path-scoped filters have a lot of problems
> that need solving before they can be used effectively in the wild.
> I would prefer that we solve those problems before making the
> feature more complicated. That's a tall ask, since these problems
> do not have simple solutions.
>

I have a good idea that if we can let such path-scoped filters work,
we can apply sparse-checkout with it... Maybe one day, users can
use:

git clone --sparse --filter="sparse:buffer=dir" xxx.git

to have the repo with sparse-checkout results...
Needless to say, this is very tempting.

> Thanks,
> -Stolee

Thanks,
ZheNing Hu