On 12/7/2022 6:27 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > >> Writing the index is a critical action that takes place in multiple Git >> commands. The recent performance improvements available with the sparse >> index show how often the I/O costs around the index can affect different Git >> commands, although reading the index takes place more often than a write. > > The sparse-index work is great in that it offers correctness while > taking advantage of the knowledge of which part of the tree is > quiescent and unused to boost performance. I am not sure a change > to reduce file safety can be compared with it, in that one is pure > improvement, while the other is trade-off. I agree that this is a trade-off, and we should both be careful about whether or not we even make this a possibility for certain file formats. The index is an interesting case for a couple reasons: 1. Writes block users. Writing the index takes place in many user- blocking foreground operations. The speed improvement directly impacts their use. Other file formats are typically written in the background (commit-graph, multi-pack-index) or are super- critical to correctness (pack-files). 2. Index files are short lived. It is rare that a user leaves an index for a long time with many staged changes. That's the condition that's required for losing an index file to cause a loss of work (or maybe I'm missing something). Outside of staged changes, the index can be completely destroyed and rewritten with minimal impact to the user. > As long as we will keep the "create into a new file, write it fully > and fsync + rename to the final" pattern, we do not need the trailing > checksum to protect us from a truncated output due to index-writing > process dying in the middle, so I do not mind that trade-off, though. > > Protecting files from bit flipping filesystem corruption is a > different matter. Folks at hosting sites like GitHub would know how > often they detect object corruption (I presume they do not have to > deal with the index file on the server end that often, but loose and > pack object files have the trailing checksums the same way) thanks > to the trailing checksum, and what the consequences are if we lost > that safety (I am guessing it would be minimum, though). I agree that we need to be careful about which files get this treatement. But I also want to point out that I'm not using hosting servers as evidence that this has worked in practice, but instead many developer machines in large monorepos who have had this enabled (via the microsoft/git fork) for years. We've not come across an instance where this loss of a trailing hash has been an issue. Thanks, -Stolee