Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache

Duy Nguyen <pclouds@xxxxxxxxx> · Tue, 6 Feb 2018 19:27:07 +0700

On Tue, Feb 6, 2018 at 8:48 AM, Ben Peart <peartben@xxxxxxxxx> wrote:
> With the new behavior, making a change in dir1/, then calling status would
> update the dir1/ untracked cache entry but not write it out. On the next
> status, git would detect the change in dir1/ again and update the untracked

Thing only missing piece here I would add is, this dir1/ detection is
done by watchman. We have to contact watchman and ask the set of
changed paths since $TIME where $TIME is the last time we updated
untracked cache and invalidate those paths in core. Then it should
work correctly. I checked the watchman query in the fsmonitor hook and
I think it's correct so far.

Except one point. What if you activate fsmonitor, write down the
fsmonitor_last_update field there but _not_ create UNTR extension for
the same timestamp? UNTR extension is created only when
read_directory() is executed and we don't do that in every command. I
haven't checked fsmonitor.c carefully, maybe I'm missing something.

One thing I'd like to point out in this patch is untracked cache will
start degrading if you just make the initial UNTR extension once and
don't ever update it again. Dirty paths hit slow path and will start
poking the disk. If somebody accidentally change a minor thing in
every single directory (worst case scenario), the untracked cache
becomes useless. We need a threshold or something to start repairing
UNTR extension if it gets damaged too much.

If you rely on random index updates (e.g. the main entries got updated
and .git/index must be updated) to write UNTR extension down together.
Please don't do that, at least not this way. cache_changed mask should
reflect all dirty parts in .git/index. If UNTR extension is not marked
updated, it's legit to just skip generating/writing it down (e.g. if I
kept the old UNTR extension from the last time I read .git/index
around in memory)

> cache.  All of the other cached entries are still valid and the cache would
> be used for them.  The updated cache entry for dir1/ would not get persisted
> to disk until something that required the index to be written out.
>
> The behavior is correct in both cases.  You just don't get the benefit of
> the updated cache for the dir1/ entry until the index is persisted again.
> What you gain in exchange is that you don't have to write out the index
> which is (typically) a lot more expensive than checking dir1/ for changes.

This is another thing that bugs me. I know you're talking about huge
index files, but at what size should we start this sort of
optimization? Writing down a few MBs on linux is cheap enough that I
won't bother optimizing (and I get my UNTR extension repaired all the
time, so reduced lstat calls and stuff). This "typically" only comes
at certain size, what size?
-- 
Duy