Re: [PATCH 4/4] fetch-pack: print and use dangling .gitmodules

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 17 Feb 2021 03:10:13 +0100

On Thu, Jan 28 2021, Jonathan Tan wrote:

>> On Sun, Jan 24 2021, Jonathan Tan wrote:
>> >  --fsck-objects::
>> > -	Die if the pack contains broken objects. For internal use only.
>> > +	For internal use only.
>> > ++
>> > +Die if the pack contains broken objects. If the pack contains a tree
>> > +pointing to a .gitmodules blob that does not exist, prints the hash of
>> > +that blob (for the caller to check) after the hash that goes into the
>> > +name of the pack/idx file (see "Notes").
>> 
>> [I should have waited a bit and sent one E-Mail]
>> 
>> Is this really generally usable as an IPC mechanism, what if we need
>> another set of OIDs we care about? Shouldn't it at least be hidden
>> behind some option so you don't get a deluge of output from index-pack
>> if you're not in this packfile-uri mode?
>
> --fsck-objects is only for internal use, and it's only used by
> fetch-pack.c. So its only consumer does want the output.
>
> Junio also mentioned the possibility of another set of OIDs, and I
> replied [1].
>
> [1] https://lore.kernel.org/git/20210128003536.3874866-1-jonathantanmy@xxxxxxxxxx/
>
>> But, along with my other E-Mail...
>> 
>> > [...]
>> > +static void parse_gitmodules_oids(int fd, struct oidset *gitmodules_oids)
>> > +{
>> > +	int len = the_hash_algo->hexsz + 1; /* hash + NL */
>> > +
>> > +	do {
>> > +		char hex_hash[GIT_MAX_HEXSZ + 1];
>> > +		int read_len = read_in_full(fd, hex_hash, len);
>> > +		struct object_id oid;
>> > +		const char *end;
>> > +
>> > +		if (!read_len)
>> > +			return;
>> > +		if (read_len != len)
>> > +			die("invalid length read %d", read_len);
>> > +		if (parse_oid_hex(hex_hash, &oid, &end) || *end != '\n')
>> > +			die("invalid hash");
>> > +		oidset_insert(gitmodules_oids, &oid);
>> > +	} while (1);
>> > +}
>> > +
>> 
>> Doesn't this IPC mechanism already exist in the form of fsck.skipList?
>> See my 1f3299fda9 (fsck: make fsck_config() re-usable, 2021-01-05) on
>> "next". I.e. as noted in my just-sent-E-Mail you could probably just
>> re-use skiplist as-is.
>
> I'm not sure how fsck.skipList could be used here. Before running
> fsck_finish() for the first time, we don't know which .gitmodules are
> missing and which are not. And when running fsck_finish() for the second
> time, we definitely do not want to skip any blobs.
>
>> Or if not it seems to me that this whole IPC mechanism would be better
>> done with a tempfile and passing it along like we already pass the
>> fsck.skipList between these processes.
>> 
>> I doubt it's going to be large enough to matter, we could just put it in
>> .git/ somewhere, like we put gc.log etc (but created with a mktemp()
>> name...).
>> 
>> Or if we want to keep the "print <list> | process" model we can refactor
>> the existing fsck IPC noted in 1f3299fda9 a bit, so e.g. you pass some
>> version of "lines prefixed with "fsck-skiplist: " go into list xyz via a
>> command-line option. And then existing option(s) and your potential new
>> list (which as noted, I think is probably redundant to the skiplist) can
>> use it.
>
> I think using stdout is superior to using a tempfile - we don't have to
> worry about interrupted invocations, for example.
>
> What do you mean by "the existing fsck IPC noted in 1f3299fda9"? If you
> mean the ability to pass a list of OIDs, for example using "-c
> fsck.skipList=filename.txt", I'm not sure that it solves anything.
> Firstly, I don't think that the skipList is useful here (as I said
> earlier). And secondly, I don't think that OID input is the issue -
> right now, the design is a process (index-pack, calling fsck_finish())
> writing to its output which is then picked up by the calling process
> (fetch-pack). We are not sending the dangling .gitmodules through stdin
> anywhere.

Sorry for being unclear here. I don't think (honestly I don't remember,
it's been almost a month) that I meant to you should use the skipList.

Looking at that code again we use object_on_skiplist() to do an early
punt in report(), but also fsck_blob(), presumably you never want the
latter, and that early punting wouldn't be needed if your report()
function intercepted the modules blob id for stashing it away / later
reporting / whatever.

So yeah, I'm 99% sure now that's not what I meant :)

What I meant with:

    Or if we want to keep the "print <list> | process"[...]

Is that we have an existing ad-hoc IPC model for these commands in
passing along the skipList, which is made more complex because sometimes
the initial process reads the file, sometimes it passes it along as-is
to the child.

And then there's this patch that passes OIDs too, but through a
different mechanism.

I was suggesting that perhaps it made more sense to refactor both so
they could use the same mechanism, because we're potentially passing two
lists of OIDs between the two. Just one goes via line-at-a-time in the
output, the other via a config option on the command-line.