Junio C Hamano <gitster@xxxxxxxxx> writes: > One way to solve (1) I can think of is to change the definition of > ce_compare_data(), which is called by the code that does not trust > the cached stat data (including but not limited to the Racy Git > codepath). The current semantics of that function asks this > question: > > We do not know if the working tree file and the indexed data > match. Let's see if "git add" of that path would record the > data that is identical to what is in the index. > > This definition was cast in stone by 29e4d363 (Racy GIT, 2005-12-20) > and has been with us since Git v1.0.0. But that does not have to be > the only sensible definition of this check. I wonder what would > break if we ask this question instead: > > We do not know if the working tree file and the indexed data > match. Let's see if "git checkout" of that path would leave the > same data as what currently is in the working tree file. > > If we did this, "reset --hard HEAD" followed by "diff HEAD" will by > definition always report "is clean" as long as nobody changes files > in the working tree, even with the inconsistent data in the index. > > This still requires that convert_to_working_tree(), i.e. your smudge > filter, is deterministic, though, but I think that is a sensible > assumption for sane people, even for those with inconsistent data in > the index. Just a few additional comments. The primary reason why I originally chose "does 'git add' of what is in the working tree give us the same blob in the index?" as opposed to "does 'git checkout' from the index again will give the same result in the working tree?" is because it is a lot less resource intensive and also is simpler. Back then I do not think we had a streaming interface to hash huge contents from a file in the working tree, but it requires us to read the entire file from the filesystem just once, apply the convert_to_git() processing and then hash the result, whether we keep the whole thing in core at once or process the data in streaming fashion. Doing the other check will have to inflate the blob data and apply the convert_to_working_tree() processing, and also read the whole thing from the filesystem and compare, which is more work at runtime. And for a sane set-up where the data in the index does not contradict with the clean/smudge filter and EOL settings, both would yield the same result. If we were to switch the semantics of ce_compare_data(), we would want a new sibling interface next to stream_blob_to_fd() that takes a file descriptor opened on the file in the working tree for reading (fd), the object name (sha1), and the output filter, and works very similarly to stream_blob_to_fd(). The difference would be that we would be reading from the fd (i.e. the file in the working tree) as we read from the istream (i.e. the contents of the blob in the index, after passing the convert_to_working_tree() filter) and comparing them in the main loop. The filter parameter to the function would be obtained by calling get_stream_filter() just like how write_entry() uses it to prepare the filter parameter to call streaming_write_entry() with. That way, we can rely on future improvement of the streaming interface to make sure we keep the data we have to keep in core to the minimum. IOW, I am saying that the "add --fix-index" lunchbreak patch I sent earlier in the thread that has to hold the data in-core while processing is not a production quality patch ;-) -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html