Re: [PATCH v3] ls-files.c: add --dedup option

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 14 Jan 2021 16:59:40 -0800

"阿德烈 via GitGitGadget"  <gitgitgadget@xxxxxxxxx> writes:

> From: ZheNing Hu <adlternative@xxxxxxxxx>
>
> In order to provide users a better experience
> when viewing information about files in the index
> and the working tree, the `--dedup` option will suppress
> some duplicate options under some conditions.
>
> In a merge conflict, one item of "git ls-files" output may
> appear multiple times. For example,now the file `a.c` has
> a conflict,`a.c` will appear three times in the output of
> "git ls-files".We can use "git ls-files --dedup" to output
> `a.c` only one time.(unless `--stage` or `--unmerged` is
> used to view all the detailed information in the index)

Unlike these option names we see in the description, "dedup" is not
a full word.  Perhaps spell it fully "--deduplicate" while letting
parse-options machinery to accept unique prefix (including
"--dedup"?

> In addition, if you use both `--delete` and `--modify` in
> the same time, The `--dedup` option can also suppress modified

"at the same time", I think.

> entries output.

[let's call this point "point A"]

> `--dedup` option relevant descriptions in
> `Documentation/git-ls-files.txt`,

I am not sure what this means.

> the test script in `t/t3012-ls-files-dedup.sh`
> prove the correctness of the `--dedup` option.

No amount of tests "proves" any correctness, but that is OK.  I
think you meant to say "a few tests have been added to t3012 to
protect the new feature from future breakage" or something like
that.

In any case, I think everything after "point A" and before your sign
off does not belong to the log message.  The diffstat shows that
documentation and tests have been added already.

> +--dedup::
> +	Suppress duplicate entries when conflict happen

"conflict happen" -> "there are unmerged paths", as the term
"unmerged" is already shown to readers of "ls-files --help".

> +	or `--deleted` and `--modified` are combined.

I somehow thought that you refrained from deduping when you are
showing the stages with "ls-files -u" and "ls-files -s", or you are
showing status with "ls-files -t", because you will otherwise lose
information.  In other words, showing only one cache entry out of
many that share the same name makes sense only when we are showing
name and nothing else.

Has that been changed from the previous rounds?

> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> index c8eae899b82..bc4eded19ab 100644
> --- a/builtin/ls-files.c
> +++ b/builtin/ls-files.c
> @@ -316,6 +318,20 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
>  		for (i = 0; i < repo->index->cache_nr; i++) {
>  			const struct cache_entry *ce = repo->index->cache[i];
>  
> +			if (show_cached && delete_dup) {
> +				switch (ce_stage(ce)) {
> +				case 0:
> +				default:
> +					break;

This part looks somewhat strange for two reasons:

 - The code enumerates ALL the possible stage numbers from 0 to 3;
   if we were to have "default", I'd expect it would be a separate
   switch arm from the possible values that calls out an programming
   error, e.g. BUG("at stage #%d???", ce_stage(ce)).  Simply removing
   the "default" arm would be another way out of this strangeness.

 - When we see a stage #0 entry, we know we will not have higher
   stage entries with the same name.  Not clearing last_stage here
   feels wrong, as the primary reason why last_stage variable is
   used is to remember the last ce that was shown, so that other
   entries with the same name can be skipped.

By the way, "last_shown_ce" may be a much better name for the
variable, as you do not really care what stage number the ce you
showed last was at (you care about its name).

Also, I do not see a good reason why the last_shown_ce variable
should have lifetime longer than the block that contains this for()
loop (and the other for loop for deleted/modified codepath we see
later).  Especially since you initialize the variable that you made
visible to the entire function to NULL before entering the first for
loop, but you do not set it back to NULL before entering the second
for loop, it is inviting a subtle bug.  You may have been given
show_cached and show_modified at the same time, so you enter the
first loop and have shown the first stage of the last conflicted
path, whose cache entry is left in the last_stage variable.  Since
the variable has longer lifespan than necessary, when the second
loop is entered, it still points at the cache entry for the highest
stage of the last conflicted path.  That is because the code forgets
to clear it to NULL before entering the second for loop.

Having said all that, I suspect that we may be much better off if we
can somehow merge the two loops into one.  You may be dedup adjacent
entries in each loop separately with the approach taken by this
patch, but I do not think the patch would work to deduplicate across
two loops.  For example, what happens if you do this?

    $ git reset --hard
    $ echo >>builtin/ls-files.c
    $ git ls-files -c -m builtin/ls-files.c
    $ git ls-files -t -c -m builtin/ls-files.c

I think you see the path twice in the output, with or without your
--dedup option (remember what I said about proving, by the way? ;-)).

Once we successfully merged two loops into one, the part that shows
tracked paths in the function would have only one loop, and it would
become a lot cleaner to add the logic to "skip showing the ce if it
has the same name as the previously shown one, only when doing so
won't lose information", by doing something like this:

	static void show_files(....)
	{
		/* show_others || show_killed done here */
		...

		/* leave early if not showing anything */
		if (! (show_cached || show_stage || show_deleted || show_modified))
			return;

		last_shown_ce = NULL;
		for (i = 0; i < repo->index->cache_nr; i++) {
			const struct cache_entry *ce = repo->index->cache[i];

			if (skipping_duplicates && last_shown_ce)
				if (!strcmp(ce->name, last_shown_ce->name))
					continue;

			construct_fullname();

                        /* various reasons to skip the entry tested */
			if (showing ignored directory and ce is excluded)
				continue;
			if (show_unmerged && !ce_stage(ce))
				continue;
			if (ce->ce_flags & CE_UPDATE)
				continue;
			... other reasons may appear here ...

			/* now we are committed to show it */
			last_shown_ce = ce;

			... various different ways to show ce come here ... 
			show_ce(...);
		}
	}

where "skipping_duplicates" would be set when "--deduplicate" is
asked and we are not showing information other than the pathname
via various options e.g. the tags (-t) or stages (-s/-u).

> +			if (delete_dup && show_deleted && show_modified && err)
>  				show_ce(repo, dir, ce, fullname.buf, tag_removed);

I actually think the original code that is still shown here ...

> +			else {
> +				if (show_deleted && err)
> +					show_ce(repo, dir, ce, fullname.buf, tag_removed);
> +				if (show_modified && ie_modified(repo->index, ce, &st, 0))

... about modified file is buggy.  If lstat() failed, then &st has
no useful information, so it is wrong to feed it to ie_modified().

Perhaps a three-patch series that is structured like this may be in
order?

 #1: bugfix for --deleted and --modified.

	err = lstat(fullname.buf, &st);
	if (err) {
		/* deleted? */
		if (errno != E_NOENT)
			error_errno("cannot lstat '%s'", fullname.buf);
		else {
			if (show_deleted)
                        	show_ce(..., tag_removed);
			if (show_modified)
                        	show_ce(..., tag_modified);
		}
	} else if (show_modified && ie_modified(...))
		show_ce(..., tag_modified);

     This hopefully should not change the semantics.  If you ask
     --deleted and --modified, a deleted path would be listed twice.

 #2: consolidate two for loops into one.

     The two loops have slightly different condition to skip a ce,
     and different logic on what tag each path is shown with.  When
     --cached and --modified or --deleted are asked for at the same
     time, we'd show them multiple times (this is done inside the
     loop for each ce)

	if (show_cached || show_stage)
		show_ce(... ce_stage(ce) ? tag_unmerged : ...);
	err = lstat(fullname.buf, &st);
	if (err) {
        	/* deleted? */
                ... code that corresponds to the
		... illustration in #1 above come here.
	} else if (...)
		show_ce(..., tag_modified);

     This changes the semantics.  The original iterates the index
     twice, so you may see the same entry from --cached once and
     then again from --modified.  The updated one still will show
     the same entry twice but next to each other.

 #3: optionally deduplicate.

     Once we have a single loop, deduplicationg based on names is
     trivial, as we seen before.

Hmm?