On Mon, Feb 08, 2016 at 11:55:37PM +0000, Anatoly Borodin wrote: > unfortunately, `--tree-filter true` doesn't solve the problem with the > repo at my work: not all old blobs are replaced with the new ones. I've > made a test repository to demonstrate it; it's a huge one (115M), but I > couldn't make it much smaller, because the bug fails to reproduce if the > repo is not big enough: > > https://github.com/anatolyborodin/git-filter-branch-bug > > There are some description and instructions in `README.md`. I hope you > will be able to reproduce it on your machine, if not - just add more > files :) > > I've debugged the test case and found the place where `git diff-index` > behaves differently regarding `aa/bb.dat`: > > read-cache.c +351 ie_match_stat(): > ... > if (!changed && is_racy_timestamp(istate, ce)) { > if (assume_racy_is_modified) > changed |= DATA_CHANGED; > else > changed |= ce_modified_check_fs(ce, st); > } > ... > > After `git-checkout-index` the blob hash for `aa/bb.dat` in the index is > 88a0f09b9b2e4ccf2faec89ab37d416fba4ee79d (the huge binary), but the file > on disk is a text file "This file was to big, and it has been removed." > with the hash 16e0939430610600301680d5bf8a24a22ff8b6c4. Right, that makes sense. The index doesn't know anything about the replace mechanism. So it assumes that what it wrote matches what is in the stat cache (i.e., some sha1 and the matching stat info). Later, when git wants to know "should I bother reading this file back in and computing its sha1", the stat cache says no. And then as you noticed, it sometimes "works" because if the file and index timestamps are the same, we err on the side of re-reading. So more files means more likelihood of crossing the one-second boundary. > I don't know if it should be considered to be a bug in in the logic of > `git checkout-index`, or `git diff-index` / `git update-index`. I'd say if any, it is the fault of checkout-index for writing out content that does not match the sha1 in the index, but writing out the stat information as if it is clean. For your case, you'd obviously prefer that the file be marked dirty and re-read later. But I'm not sure whether other users of replace refs would want the same behavior. They may want to silently pretend as if the data is unmodified. I'm not sure if anyone is doing that in practice, though; as you noted, the results are not deterministic, so it probably ends up just being a huge pain. So perhaps it would make sense to make the checkout operation aware of replaced blobs. As a workaround for your filter-branch, I suspect you could do `--tree-filter 'git ls-files | xargs touch'` or something, but that is going to be rather inefficient. By now you've probably realized that replaced blobs are not widely used, and there are a lot of corner cases around checking them out. So let's take a step back for a moment. What are you trying to accomplish with your filter-branch? If you just want to replace blob A with blob B, I think it might be easier to skip refs/replace entirely and just do so explicitly in an index-filter, like: --index-filter ' git ls-files -s | sed "s/$old_sha1/$new_sha1/" | git update-index --index-info ' -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html