Re: [PATCH] travis-ci: run previously failed tests first, then slowest to fastest

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 27 Jan 2016 12:49:31 -0800

Junio C Hamano <gitster@xxxxxxxxx> writes:

> One way to solve (1) I can think of is to change the definition of
> ce_compare_data(), which is called by the code that does not trust
> the cached stat data (including but not limited to the Racy Git
> codepath).  The current semantics of that function asks this
> question:
>
>     We do not know if the working tree file and the indexed data
>     match.  Let's see if "git add" of that path would record the
>     data that is identical to what is in the index.
>
> This definition was cast in stone by 29e4d363 (Racy GIT, 2005-12-20)
> and has been with us since Git v1.0.0.  But that does not have to be
> the only sensible definition of this check.  I wonder what would
> break if we ask this question instead:
>
>     We do not know if the working tree file and the indexed data
>     match.  Let's see if "git checkout" of that path would leave the
>     same data as what currently is in the working tree file.
>
> If we did this, "reset --hard HEAD" followed by "diff HEAD" will by
> definition always report "is clean" as long as nobody changes files
> in the working tree, even with the inconsistent data in the index.
>
> This still requires that convert_to_working_tree(), i.e. your smudge
> filter, is deterministic, though, but I think that is a sensible
> assumption for sane people, even for those with inconsistent data in
> the index.

Just a few additional comments.

The primary reason why I originally chose "does 'git add' of what is
in the working tree give us the same blob in the index?" as opposed
to "does 'git checkout' from the index again will give the same
result in the working tree?" is because it is a lot less resource
intensive and also is simpler.  Back then I do not think we had a
streaming interface to hash huge contents from a file in the working
tree, but it requires us to read the entire file from the filesystem
just once, apply the convert_to_git() processing and then hash the
result, whether we keep the whole thing in core at once or process
the data in streaming fashion.  Doing the other check will have to
inflate the blob data and apply the convert_to_working_tree()
processing, and also read the whole thing from the filesystem and
compare, which is more work at runtime.  And for a sane set-up where
the data in the index does not contradict with the clean/smudge
filter and EOL settings, both would yield the same result.

If we were to switch the semantics of ce_compare_data(), we would
want a new sibling interface next to stream_blob_to_fd() that takes
a file descriptor opened on the file in the working tree for reading
(fd), the object name (sha1), and the output filter, and works very
similarly to stream_blob_to_fd().  The difference would be that we
would be reading from the fd (i.e. the file in the working tree) as
we read from the istream (i.e. the contents of the blob in the
index, after passing the convert_to_working_tree() filter) and
comparing them in the main loop.  The filter parameter to the
function would be obtained by calling get_stream_filter() just like
how write_entry() uses it to prepare the filter parameter to call
streaming_write_entry() with.  That way, we can rely on future
improvement of the streaming interface to make sure we keep the data
we have to keep in core to the minimum.

IOW, I am saying that the "add --fix-index" lunchbreak patch I sent
earlier in the thread that has to hold the data in-core while
processing is not a production quality patch ;-)

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html