Re: [PATCH v2 1/3] diff-files --raw: handle intent-to-add files correctly

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 24 Jun 2020 08:26:33 -0700

Johannes Schindelin <Johannes.Schindelin@xxxxxx> writes:

> Sure, but my intention was to synchronize the `--raw` vs the `--patch`
> output: the latter _already_ shows the correct hash. This here patch makes
> the hash in the former's output match the latter's.

That is shooting for a wrong uniformity and breaking consistency
among the `--raw` modes.

 $ git reset --hard
 $ echo "/* end */" >cache.h ;# taint
 $ git diff-files --raw
 ... this shows (M)odified with 0{40} on the postimage
 ... 0{40} for side that is known to have contents from low-level diff
 ... means "object name unknown; figure it out yourself if you need it"
 $ git update-index cache.h
 $ git diff-files --raw
 ... of course we see nothing here.  Wait for a bit.
 $ touch cache.h ;# smudge
 $ git diff-files --raw
 ... this shows (M)odified with 0{40} on the postimage
 ... again, it says "it is stat dirty so I do not bother to compute"
 $ git update-index --refresh
 $ git diff-files --raw
 ... again we see nothing.

Any tools that work on "--raw" output must be already prepared to
see 0{40} on the side that is known to have contents and must know
to grab the contents from the working tree file if they need them,
so showing the 0{40} for i-t-a entry (whose definition is "the user
said in the past that the final contents of the file will be added
later, but Git does not know what object it will be yet") cannot
break them.  And the behaviour of giving 0{40} in such a case aligns
well with what is already done for paths already added to the index
when Git does not have an already-computed object name handy.

> Besides, we're talking about the post-image of `diff-files`, i.e. the
> worktree version, here. I seem to remember that the pre-image already uses
> the all-zero hash to indicate precisely what you mentioned above.

The 0{40} you see for pre-image for (A)dded paths means a completely
different thing from the 0{40} I have been explaining in the above,
so that is not relevant here.

By definition, there is *no* contents for the pre-image side of
(A)dded paths (that is why I stressed the "side that must have
contents" in the above description---it is determined by the type of
the change), but because the format requires us to place some
hexadecimal there, we fill the space with 0{40}.  

When we do not know the object name for the side that is known to
have contents without performing extra computation (including "stat
dirty so we cannot tell without rehashing"), we also use 0{40} as a
signal to say "we do not know the actual contents", but the consumer
of "--raw" format is expected to know the difference between "this
side is known to have no data and 0{40} is here as filler" and "this
side must have contents but we see 0{40} because Git does not have
it handy in precomputed form".

The above is the same for "diff-index --raw" without "--cached";
when we have to hash before we can give the object name (e.g. the
path is stat-dirty), we give 0{40} and let the consumer figure it
out if it needs to.

 $ git reset --hard
 $ touch COPYING
 $ git diff-index --raw HEAD
 ... we see (M)odified with 0{40} on the right hand side.

When the caller asks for "--patch" or any other output format that
actually needs contents for output, however, these low-level tools
do read the contents, and as a side effect, they may hash to obtain
the object name and show it [*1*].

By the way, as I do not want to see you waste your time going in a
wrong direction just to be "different", let me make it clear that as
far as the design of low level diff plumbing is concerned, what I
said here is final.  Please don't waste your time on arguing for
changing the design now after 15 years.  I want to see your time
used in a more productive way for the project.

Thanks.

[Footnote]

*1* This division of labor to free "--raw" mode of anything remotely
    unnecessary stems from the original diff plumbing design in May
    2005 where the "--raw" mode was the only output mode, and there
    was a separate "git-diff-helper" (look for it in the mailing
    list archive if you want to learn more) that reads a "--raw"
    output and transforms it into the patch form.  That "once we
    have the raw diff, we can pipe it to post-processing and do more
    interesting things" eventually led to the design of the diffcore
    pipeline where we match up (A)dded and (D)eleted entries to
    detect renames, etc.