On Mon, May 22, 2017 at 6:18 PM, Ben Peart <peartben@xxxxxxxxx> wrote: > On 5/20/2017 8:10 AM, Ævar Arnfjörð Bjarmason wrote: >>> >>> +== File System Monitor cache >>> + >>> + The file system monitor cache tracks files for which the >>> query-fsmonitor >>> + hook has told us about changes. The signature for this extension is >>> + { 'F', 'S', 'M', 'N' }. >>> + >>> + The extension starts with >>> + >>> + - 32-bit version number: the current supported version is 1. >>> + >>> + - 64-bit time: the extension data reflects all changes through the >>> given >>> + time which is stored as the seconds elapsed since midnight, >>> January 1, 1970. >>> + >>> + - 32-bit bitmap size: the size of the CE_FSMONITOR_DIRTY bitmap. >>> + >>> + - An ewah bitmap, the n-th bit indicates whether the n-th index entry >>> + is CE_FSMONITOR_DIRTY. >> >> >> We already have a uint64_t in one place in the codebase (getnanotime) >> which uses a 64 bit time for nanosecond accuracy, and numerous >> filesystems already support nanosecond timestamps (ext4, that new >> Apple thingy...). >> >> I don't know if any of the inotify/fsmonitor APIs support that yet, >> but it seems inevitable that that'll be added if not, in some >> pathological cases we can have a lot of files modified in 1 second, so >> using nanosecond accuracy means there'll be a lot less data to >> consider in some cases. >> >> It does mean this'll only work until the year ~2500, but that seems >> like an acceptable trade-off. >> > > I really don't think nano-second resolution is needed in this case for a few > reasons. > > The number of files that can change within a given second is limited by the > IO throughput of the underlying device. Even assuming a very fast device and > very small files and changes, this won't be that many files. > > Without this patch, git would have scanned all those files every time. With > this patch, git will only scan those files a 2nd time that are modified in > the same second that it did the first scan *that came before the first scan > started* (the "lots of files modified" section in the 1 second timeline > below). > > |------------------------- one second ---------------------| > |-lots of files modified - git status - more file modified-| > > Yes, some duplicate status checks can be made but its still a significant > win in any reasonable scenario. Especially when you consider that it is > pretty unusual to do git status/add/commit calls in the middle of making > lots of changes to files. I understand that we get most of the wins either way. I'm just wondering why not make it nanosecond-resolution now from the beginning since that's where major filesystems are heading already, and changing stuff like this can be a hassle once it's initially out there, whereas just dividing by 10^9 for APIs that need seconds from the beginning is easy & future-proof. There are some uses of git where this would probably matter in practice. E.g. consider a git-annex backed fileserver using ext4, currently git-annex comes with its own FS watcher as a daemon invoked via `git annex watch` to constantly add new files without killing performance via a status/add in a loop, with this a daemon like that could just run status/add in a loop, but on a busy repo the 1s interval size might start to matter as you're constantly inspecting larger intervals. More importantly though, I can't think of any case where having it in nanoseconds to begin with would do any harm. > In addition, the backing file system monitor (Watchman) supports number of > seconds since the unix epoch (unix time_t style). This means any support of > nano seconds by git is academic until someone provides a file system watcher > that does support nano second granularity. I haven't used watchman for anything non-trivial, but the documentation for the query API you seem to be using says you can request a {ctime,mtime}_ns field: https://github.com/facebook/watchman/blob/master/website/_docs/cmd.query.markdown#user-content-available-fields And isn't this the documentation for the "since" query you're using, saying you can specify any arbitrary fs timing field, such as a _ns time field: https://github.com/facebook/watchman/blob/master/website/_docs/expr.since.md ? > Finally, the fsmonitor index extension is versioned so that we can > seamlessly upgrade to nano second resolution later if we desire. Aside from my nanosecond bikeshedding, let's say we change the semantics of any of this in the future: The index has the version, but there's one argument passed to the hook without a version. Is the hook expected to potentially be reading the version from the index to make sense of whether (in this case) the argument is a mtime or mtime_ns? Wouldn't this make more sense than that on top, i.e. pass the version to the hook, untested (and probably whines about the sprintf format...): $ git diff -U1 diff --git a/cache.h b/cache.h index 6eafd34fff..3c63f179f8 100644 --- a/cache.h +++ b/cache.h @@ -346,2 +346,3 @@ struct index_state { struct untracked_cache *untracked; + uint32_t fsmonitor_version; time_t fsmonitor_last_update; diff --git a/fsmonitor.c b/fsmonitor.c index f5a9f818e8..7236b538ac 100644 --- a/fsmonitor.c +++ b/fsmonitor.c @@ -60,2 +60,4 @@ int read_fsmonitor_extension(struct index_state *istate, const void *data, return error("bad fsmonitor version %d", hdr_version); + else + istate->fsmonitor_version = hdr_version; @@ -137,2 +139,3 @@ static int query_fsmonitor(time_t last_update, struct strbuf *query_result) struct child_process cp = CHILD_PROCESS_INIT; + char version[1]; char date[64]; @@ -143,2 +146,3 @@ static int query_fsmonitor(time_t last_update, struct strbuf *query_result) + snprintf(version, sizeof(version), "%d", istate->fsmonitor_version); snprintf(date, sizeof(date), "%" PRIuMAX, (uintmax_t)last_update)