Re: Windows performance / threading file access

Stefan Zager <szager@xxxxxxxxxx> · Thu, 10 Oct 2013 22:35:47 -0700

On Thu, Oct 10, 2013 at 5:51 PM, Karsten Blees <karsten.blees@xxxxxxxxx>wrote:

> >> I've noticed that when working with a very large repository using msys
> >> git, the initial checkout of a cloned repository is excruciatingly
> >> slow (80%+ of total clone time).  The root cause, I think, is that git
> >> does all the file access serially, and that's really slow on Windows.
> >>
>
> What exactly do you mean by "excruciatingly slow"?
>
> I just ran a few tests with a big repo (WebKit, ~2GB, ~200k files). A full
> checkout with git 1.8.4 on my SSD took 52s on Linux and 81s on Windows.
> Xcopy /s took ~4 minutes (so xcopy is much slower than git). On a 'real' HD
> (WD Caviar Green) the Windows checkout took ~9 minutes.

I'm using blink for my test, which should be more or less indistinguishable
from WebKit.  I'm using a standard spinning disk, no SSD.  For my purposes,
I need to optimize this for "standard"-ish hardware, not best-in-class.

For my test, I first run 'git clone -n <repo>', and then measure the
running time of 'git checkout --force HEAD'.  On linux, the checkout
command runs in 0:12; on Windows, it's about 3:30.

> If your numbers are much slower, check for overeager virus scanners and
> probably the infamous "User Account Control" (On Vista/7 (8?), the
> luafv.sys driver slows down things on the system drive even with UAC turned
> off in control panel. The driver can be disabled with "sc config luafv
> start= disabled" + reboot. Reenable with "sc config luafv start= auto").

I confess that I am pretty ignorant about Windows, so I'll have to research
these.

>> Has anyone considered threading file access to speed this up?  In
> >> particular, I've got my eye on this loop in unpack-trees.c:
> >>
>
> Its probably worth a try, however, in my experience, doing disk IO in
> parallel tends to slow things down due to more disk seeks.

> I'd rather try to minimize seeks, ...
>

In my experience, modern disk controllers are very very good at this; it
rarely, if ever, makes sense to try and outsmart them.

But, from talking to Windows-savvy people, I believe the issue is not disk
seek time, but rather the fact that Windows doesn't cache file stat
information.  Instead, it goes all the way to the source of truth (i.e.,
the physical disk) every time it stats a file or directory.  That's what
causes the checkout to be so slow: all those file stats run serially.

Does that sound right?  I'm prepared to be wrong about this; but if no one
has tried it, then it's probably at least worth an experiment.

Thanks,

Stefan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html