On Fri, Sep 28, 2018 at 06:24:58PM +0200, SZEDER Gábor wrote: > When unpack_trees() constructs a new index, it copies cache entries > from the original index [1]. prepare_to_write_split_index() has to > deal with this, and it has a dedicated code path for copied entries > that are present in the shared index, where it compares the cached > data in the corresponding copied and original entries. If the cached > data matches, then they are considered the same; if it differs, then > the copied entry will be marked for inclusion as a replacement entry > in the just about to be written split index by setting the > CE_UPDATE_IN_BASE flag. > > However, a cache entry already has its CE_UPDATE_IN_BASE flag set upon > reading the split index, if the entry already has a replacement entry > there, or upon refreshing the cached stat data, if the corresponding > file was modified. The state of this flag is then preserved when > unpack_trees() copies a cache entry from the shared index. > > So modify prepare_to_write_split_index() to check the copied cache > entries' CE_UPDATE_IN_BASE flag first, and skip the thorough > comparison of cached data if the flag is already set. OK so this is an optimization, not a bug fix. Right? > Note that comparing the cached data in copied and original entries in s/cached data/cached stat data/ ? I was confused for a bit. > the shared index might actually be entirely unnecessary. In theory > all code paths refreshing the cached stat data of an entry in the > shared index should set the CE_UPDATE_IN_BASE flag in that entry, and > unpack_trees() should preserve this flag when copying cache entries. > This means that the cached data is only ever changed if the > CE_UPDATE_IN_BASE flag is set as well. Our test suite seems to > confirm this: instrumenting the conditions in question and running the > test suite repeatedly with 'GIT_TEST_SPLIT_INDEX=yes' showed that the > cached data in a copied entry differs from the data in the shared > entry only if its CE_UPDATE_IN_BASE flag is indeed set. Yes I was probably just being paranoid (or sticking to simpler checks). I was told that split index is computation expensive and not doing unnecesary/expensive checks may help. But let's leave it for later. > + } else { > + /* > + * Thoroughly compare the cached data to see > + * whether it should be marked for inclusion > + * in the split index. > + * > + * This comparison might be unnecessary, as > + * code paths modifying the cached data do > + * set CE_UPDATE_IN_BASE as well. > + */ > + const unsigned int ondisk_flags = > + CE_STAGEMASK | CE_VALID | > + CE_EXTENDED_FLAGS; > + unsigned int ce_flags, base_flags, ret; > + ce_flags = ce->ce_flags; > + base_flags = base->ce_flags; > + /* only on-disk flags matter */ > + ce->ce_flags &= ondisk_flags; > + base->ce_flags &= ondisk_flags; > + ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data, > + offsetof(struct cache_entry, name) - > + offsetof(struct cache_entry, ce_stat_data)); > + ce->ce_flags = ce_flags; > + base->ce_flags = base_flags; Maybe make this block a separate function (compare_ce_content or something). The amount of indentation is getting too high. > + if (ret) > + ce->ce_flags |= CE_UPDATE_IN_BASE; > + } > discard_cache_entry(base); > si->base->cache[ce->index - 1] = ce; > } > -- > 2.19.0.361.gafc87ffe72 >