Re: [PATCH] Teach "git add" and friends to be paranoid

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 18, 2010 at 07:04:56PM -0600, Jonathan Nieder wrote:
> Nicolas Pitre wrote:
> > On Thu, 18 Feb 2010, Junio C Hamano wrote:
> >> I suspect that opening to mmap(2), hashing once to compute the object
> >> name, and deflating it to write it out, will all happen within the same
> >> second, unless you are talking about a really huge file, or you started at
> >> very near a second boundary.
> >
> > How is the index dealing with this?  Surely if a file is added to the 
> > index and modified within the same second then 'git status' will fail to 
> > notice the changes.  I'm not familiar enough with that part of Git.
> 
> See Documentation/technical/racy-git.txt and t/t0010-racy-git.sh.
> 
> Short version: in the awful case, the timestamp of the index is the
> same as (or before) the timestamp of the file.  Git will notice this
> and re-hash the tracked file.

As far as I can tell, the index doesn't handle this case at all.

Suppose the file is modified during git add near the beginning of the
file, after git add has read that part of the file, but the modifications
finish before git add does.  Now the mtime of the file is earlier
than the index timestamp, but the file contents don't match the index.
This holds even if the objects git adds to the index aren't corrupted.
Actually right now you can have all four combinations:  index up to date
or not, and object matching its sha1 hash or not, depending on where and
when you modify data during an index update.

racy-git.txt doesn't discuss concurrent modification of files with the
index.  It only discusses low-resolution file timestamps and modifications
at times that are close to, but not concurrent with, index modifications.

Git probably also doesn't handle things like NTP time corrections
(especially those where time moves backward by sub-second intervals) and
mismatched server/client clocks on remote filesystems either (mind you,
I know of no SCM that currently handles that case, and CVS in particular
is unusually bad at it).

Personally, I find the combination of nanosecond-precision timestamps
and network file systems amusing.  At nanosecond precision, relativistic
effects start to matter across a volume of space the size of my laptop.
I'm not sure how timestamps at any resolution could be a reliable metric
for detecting changes to file contents in the general case.  A valuable
hint in many cases, but not authoritative (unless they all come from a
single monotonic high-resolution clock guaranteed to increment faster than
git--but they don't).

rsync solves this sort of problem with a 'modification window' parameter,
which is a time interval that is "close enough" to consider two timestamps
to be equal.  Some of rsync's use cases set that window to six months.
Git would use a modification window for the opposite reason rsync
does--rsync uses the window to avoid unnecessarily examining files that
have different timestamps, while git would use it to re-examine files
even when it appears to be unnecessary.

Git probably wants the modification window to be the maximum clock
offset between a network filesystem client and server plus the minimum
representable interval in the filesystem's timestamp data type--which
is a value git couldn't possibly know for some cases, so it needs input
from the user.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]