Re: [PATCH v3] ls-files.c: add --dedup option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Junio, thank you for your patience to review
my patch and guide me how to modify it.

Junio C Hamano <gitster@xxxxxxxxx> 于2021年1月15日周五 上午8:59写道:
>
> "阿德烈 via GitGitGadget"  <gitgitgadget@xxxxxxxxx> writes:
>
> > From: ZheNing Hu <adlternative@xxxxxxxxx>
> >
> > In order to provide users a better experience
> > when viewing information about files in the index
> > and the working tree, the `--dedup` option will suppress
> > some duplicate options under some conditions.
> >
> > In a merge conflict, one item of "git ls-files" output may
> > appear multiple times. For example,now the file `a.c` has
> > a conflict,`a.c` will appear three times in the output of
> > "git ls-files".We can use "git ls-files --dedup" to output
> > `a.c` only one time.(unless `--stage` or `--unmerged` is
> > used to view all the detailed information in the index)
>
> Unlike these option names we see in the description, "dedup" is not
> a full word.  Perhaps spell it fully "--deduplicate" while letting
> parse-options machinery to accept unique prefix (including
> "--dedup"?
>
Ok i have modified "--dedup" to "--deduplicate".
> > In addition, if you use both `--delete` and `--modify` in
> > the same time, The `--dedup` option can also suppress modified
>
> "at the same time", I think.
>
My poor English grammar :-)
> > entries output.
>
> [let's call this point "point A"]
>
> > `--dedup` option relevant descriptions in
> > `Documentation/git-ls-files.txt`,
>
> I am not sure what this means.
>
> > the test script in `t/t3012-ls-files-dedup.sh`
> > prove the correctness of the `--dedup` option.
>
> No amount of tests "proves" any correctness, but that is OK.  I
> think you meant to say "a few tests have been added to t3012 to
> protect the new feature from future breakage" or something like
> that.
>
Alright, I understand!
> In any case, I think everything after "point A" and before your sign
> off does not belong to the log message.  The diffstat shows that
> documentation and tests have been added already.
>
> > +--dedup::
> > +     Suppress duplicate entries when conflict happen
>
> "conflict happen" -> "there are unmerged paths", as the term
> "unmerged" is already shown to readers of "ls-files --help".
>
Well, maybe I'm not good enough with these proper nouns.
> > +     or `--deleted` and `--modified` are combined.
>
> I somehow thought that you refrained from deduping when you are
> showing the stages with "ls-files -u" and "ls-files -s", or you are
> showing status with "ls-files -t", because you will otherwise lose
> information.  In other words, showing only one cache entry out of
> many that share the same name makes sense only when we are showing
> name and nothing else.
>
You are right, "--deduplicate" should only work on duplicate file names,
so "ls-files -t" also needs to be corrected.
Well,This is true a bug I haven't notice.
> Having said all that, I suspect that we may be much better off if we
> can somehow merge the two loops into one.  You may be dedup adjacent
> entries in each loop separately with the approach taken by this
> patch, but I do not think the patch would work to deduplicate across
> two loops.  For example, what happens if you do this?
>
>     $ git reset --hard
>     $ echo >>builtin/ls-files.c
>     $ git ls-files -c -m builtin/ls-files.c
>     $ git ls-files -t -c -m builtin/ls-files.c
>
> I think you see the path twice in the output, with or without your
> --dedup option (remember what I said about proving, by the way? ;-)).
>
Yeah,This is because I may have missed the -c option with other options
at the same time.
Here I may disagree with your point of view:
                 if (errno != E_NOENT)
                         error_errno("cannot lstat '%s'", fullname.buf);
With this sentence included, the patch will fail the test:
t/t3010-ls-files-killed-modified.sh.
the errno maybe ENOTDIR when you try to lstat a file`r` with `lstat("r/f",&st);`
So I temporarily removed the judgment of errno.
>  #2: consolidate two for loops into one.
>
>      The two loops have slightly different condition to skip a ce,
>      and different logic on what tag each path is shown with.  When
>      --cached and --modified or --deleted are asked for at the same
>      time, we'd show them multiple times (this is done inside the
>      loop for each ce)
>
>         if (show_cached || show_stage)
>                 show_ce(... ce_stage(ce) ? tag_unmerged : ...);
>         err = lstat(fullname.buf, &st);
>         if (err) {
>                 /* deleted? */
>                 ... code that corresponds to the
>                 ... illustration in #1 above come here.
>         } else if (...)
>                 show_ce(..., tag_modified);
>
>      This changes the semantics.  The original iterates the index
>      twice, so you may see the same entry from --cached once and
>      then again from --modified.  The updated one still will show
>      the same entry twice but next to each other.
>
Well,This does change the semantics. I think people who used two
for loops before may want to separate different outputs.
Now, if you don’t use "--deduplicate", You may see six consecutive
items under a combination of multiple options.
>  #3: optionally deduplicate.
>
>      Once we have a single loop, deduplicationg based on names is
>      trivial, as we seen before.
>
>
Indeed so.
> Hmm?
THANKS.

Junio C Hamano <gitster@xxxxxxxxx> 于2021年1月15日周五 上午8:59写道:
>
> "阿德烈 via GitGitGadget"  <gitgitgadget@xxxxxxxxx> writes:
>
> > From: ZheNing Hu <adlternative@xxxxxxxxx>
> >
> > In order to provide users a better experience
> > when viewing information about files in the index
> > and the working tree, the `--dedup` option will suppress
> > some duplicate options under some conditions.
> >
> > In a merge conflict, one item of "git ls-files" output may
> > appear multiple times. For example,now the file `a.c` has
> > a conflict,`a.c` will appear three times in the output of
> > "git ls-files".We can use "git ls-files --dedup" to output
> > `a.c` only one time.(unless `--stage` or `--unmerged` is
> > used to view all the detailed information in the index)
>
> Unlike these option names we see in the description, "dedup" is not
> a full word.  Perhaps spell it fully "--deduplicate" while letting
> parse-options machinery to accept unique prefix (including
> "--dedup"?
>
> > In addition, if you use both `--delete` and `--modify` in
> > the same time, The `--dedup` option can also suppress modified
>
> "at the same time", I think.
>
> > entries output.
>
> [let's call this point "point A"]
>
> > `--dedup` option relevant descriptions in
> > `Documentation/git-ls-files.txt`,
>
> I am not sure what this means.
>
> > the test script in `t/t3012-ls-files-dedup.sh`
> > prove the correctness of the `--dedup` option.
>
> No amount of tests "proves" any correctness, but that is OK.  I
> think you meant to say "a few tests have been added to t3012 to
> protect the new feature from future breakage" or something like
> that.
>
> In any case, I think everything after "point A" and before your sign
> off does not belong to the log message.  The diffstat shows that
> documentation and tests have been added already.
>
> > +--dedup::
> > +     Suppress duplicate entries when conflict happen
>
> "conflict happen" -> "there are unmerged paths", as the term
> "unmerged" is already shown to readers of "ls-files --help".
>
> > +     or `--deleted` and `--modified` are combined.
>
> I somehow thought that you refrained from deduping when you are
> showing the stages with "ls-files -u" and "ls-files -s", or you are
> showing status with "ls-files -t", because you will otherwise lose
> information.  In other words, showing only one cache entry out of
> many that share the same name makes sense only when we are showing
> name and nothing else.
>
> Has that been changed from the previous rounds?
>
> > diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> > index c8eae899b82..bc4eded19ab 100644
> > --- a/builtin/ls-files.c
> > +++ b/builtin/ls-files.c
> > @@ -316,6 +318,20 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
> >               for (i = 0; i < repo->index->cache_nr; i++) {
> >                       const struct cache_entry *ce = repo->index->cache[i];
> >
> > +                     if (show_cached && delete_dup) {
> > +                             switch (ce_stage(ce)) {
> > +                             case 0:
> > +                             default:
> > +                                     break;
>
> This part looks somewhat strange for two reasons:
>
>  - The code enumerates ALL the possible stage numbers from 0 to 3;
>    if we were to have "default", I'd expect it would be a separate
>    switch arm from the possible values that calls out an programming
>    error, e.g. BUG("at stage #%d???", ce_stage(ce)).  Simply removing
>    the "default" arm would be another way out of this strangeness.
>
>  - When we see a stage #0 entry, we know we will not have higher
>    stage entries with the same name.  Not clearing last_stage here
>    feels wrong, as the primary reason why last_stage variable is
>    used is to remember the last ce that was shown, so that other
>    entries with the same name can be skipped.
>
> By the way, "last_shown_ce" may be a much better name for the
> variable, as you do not really care what stage number the ce you
> showed last was at (you care about its name).
>
> Also, I do not see a good reason why the last_shown_ce variable
> should have lifetime longer than the block that contains this for()
> loop (and the other for loop for deleted/modified codepath we see
> later).  Especially since you initialize the variable that you made
> visible to the entire function to NULL before entering the first for
> loop, but you do not set it back to NULL before entering the second
> for loop, it is inviting a subtle bug.  You may have been given
> show_cached and show_modified at the same time, so you enter the
> first loop and have shown the first stage of the last conflicted
> path, whose cache entry is left in the last_stage variable.  Since
> the variable has longer lifespan than necessary, when the second
> loop is entered, it still points at the cache entry for the highest
> stage of the last conflicted path.  That is because the code forgets
> to clear it to NULL before entering the second for loop.
>
> Having said all that, I suspect that we may be much better off if we
> can somehow merge the two loops into one.  You may be dedup adjacent
> entries in each loop separately with the approach taken by this
> patch, but I do not think the patch would work to deduplicate across
> two loops.  For example, what happens if you do this?
>
>     $ git reset --hard
>     $ echo >>builtin/ls-files.c
>     $ git ls-files -c -m builtin/ls-files.c
>     $ git ls-files -t -c -m builtin/ls-files.c
>
> I think you see the path twice in the output, with or without your
> --dedup option (remember what I said about proving, by the way? ;-)).
>
> Once we successfully merged two loops into one, the part that shows
> tracked paths in the function would have only one loop, and it would
> become a lot cleaner to add the logic to "skip showing the ce if it
> has the same name as the previously shown one, only when doing so
> won't lose information", by doing something like this:
>
>         static void show_files(....)
>         {
>                 /* show_others || show_killed done here */
>                 ...
>
>                 /* leave early if not showing anything */
>                 if (! (show_cached || show_stage || show_deleted || show_modified))
>                         return;
>
>                 last_shown_ce = NULL;
>                 for (i = 0; i < repo->index->cache_nr; i++) {
>                         const struct cache_entry *ce = repo->index->cache[i];
>
>                         if (skipping_duplicates && last_shown_ce)
>                                 if (!strcmp(ce->name, last_shown_ce->name))
>                                         continue;
>
>                         construct_fullname();
>
>                         /* various reasons to skip the entry tested */
>                         if (showing ignored directory and ce is excluded)
>                                 continue;
>                         if (show_unmerged && !ce_stage(ce))
>                                 continue;
>                         if (ce->ce_flags & CE_UPDATE)
>                                 continue;
>                         ... other reasons may appear here ...
>
>                         /* now we are committed to show it */
>                         last_shown_ce = ce;
>
>                         ... various different ways to show ce come here ...
>                         show_ce(...);
>                 }
>         }
>
> where "skipping_duplicates" would be set when "--deduplicate" is
> asked and we are not showing information other than the pathname
> via various options e.g. the tags (-t) or stages (-s/-u).
>
> > +                     if (delete_dup && show_deleted && show_modified && err)
> >                               show_ce(repo, dir, ce, fullname.buf, tag_removed);
>
> I actually think the original code that is still shown here ...
>
> > +                     else {
> > +                             if (show_deleted && err)
> > +                                     show_ce(repo, dir, ce, fullname.buf, tag_removed);
> > +                             if (show_modified && ie_modified(repo->index, ce, &st, 0))
>
> ... about modified file is buggy.  If lstat() failed, then &st has
> no useful information, so it is wrong to feed it to ie_modified().
>
> Perhaps a three-patch series that is structured like this may be in
> order?
>
>  #1: bugfix for --deleted and --modified.
>
>         err = lstat(fullname.buf, &st);
>         if (err) {
>                 /* deleted? */
>                 if (errno != E_NOENT)
>                         error_errno("cannot lstat '%s'", fullname.buf);
>                 else {
>                         if (show_deleted)
>                                 show_ce(..., tag_removed);
>                         if (show_modified)
>                                 show_ce(..., tag_modified);
>                 }
>         } else if (show_modified && ie_modified(...))
>                 show_ce(..., tag_modified);
>
>      This hopefully should not change the semantics.  If you ask
>      --deleted and --modified, a deleted path would be listed twice.
>
>  #2: consolidate two for loops into one.
>
>      The two loops have slightly different condition to skip a ce,
>      and different logic on what tag each path is shown with.  When
>      --cached and --modified or --deleted are asked for at the same
>      time, we'd show them multiple times (this is done inside the
>      loop for each ce)
>
>         if (show_cached || show_stage)
>                 show_ce(... ce_stage(ce) ? tag_unmerged : ...);
>         err = lstat(fullname.buf, &st);
>         if (err) {
>                 /* deleted? */
>                 ... code that corresponds to the
>                 ... illustration in #1 above come here.
>         } else if (...)
>                 show_ce(..., tag_modified);
>
>      This changes the semantics.  The original iterates the index
>      twice, so you may see the same entry from --cached once and
>      then again from --modified.  The updated one still will show
>      the same entry twice but next to each other.
>
>  #3: optionally deduplicate.
>
>      Once we have a single loop, deduplicationg based on names is
>      trivial, as we seen before.
>
>
> Hmm?




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux