Re: performance problem: "git commit filename"

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 14 Jan 2008 13:00:38 -0800 (PST)

On Mon, 14 Jan 2008, Junio C Hamano wrote:
> 
> If we are using different types anyway, we might want to start
> using time_t (a worse alternative is ulong which we use for
> timestamps everywhere else, which we probably want to convert to
> time_t as well).

Careful.

There are two issues, one trivial one and one important one:
 (a) trivially, right now, the code depends on the fact that the in-memory 
     structure is actually smaller than the on-disk one, to avoid having 
     to estimate the size of the allocation for the in-memory array. That 
     was a matter of gettign a quickly working and efficient patch (we do 
     *not* want to allocate those initial "struct cache_entry" entries one 
     by one, we want to allocate one big block!)

     This should be pretty easy to fix up, by just taking the sizes and 
     number of entries (which we do know) into account of the initial 
     allocation. However, it's made a bit more interesting by the 
     differing alignment of the "name" part (and the fact that we align 
     each individual on-disk and in-memory structure).

 (b) More importantly, the on-disk structures DO NOT CONTAIN the whole 
     stat information! The classic example of this is "ce_size": it's 
     32-bit, but it works even if you have a file that is larger than 32 
     bits in size! It just means that from a stat comparison standpoint, 
     we only compare the low 32 bits!

     This means that if you make "ce_size" be a "loff_t", for example, you 
     still need to then *compare* it in just an "unsigned int",  because 
     the upper bits aren't zero - they are "nonexistent".

that (b) is important, and is why some of the code changed from

	-       if (ce->ce_ino != htonl(st->st_ino))
	+       if (ce->ce_ino != (unsigned int) st->st_ino)

ie note how this didn't just remove the "htonl()", it replaced it by a 
"truncate to 'unsigned int'"!

So the fact that the types aren't necessarily the "native" types is 
actually *important*.

> Is there still a reason to insist that ce_flags should be a
> single field that is multi-purposed for storing stage, namelen
> and other flags?  Wouldn't the code become even simpler and
> safer if we separated them into individual fields?  For example,
> a piece like this:

No reason for that part, except I wanted to make this particular initial 
patch be as minimal as possible.

> I somehow had this impression that it was a huge deal to you
> that we do not have to read and populate each cache entry when
> reading from the existing index file, and thought that was the
> reason why we mmap and access the fields in network byte order.
> If that was my misconception, then I agree this is a good change
> to make everything else easier to write and much less error
> prone.

I was a bit worried about it, but I did make sure that the allocation is 
done as one single allocation, and I did time it. Doing a 

	git update-index --refresh

seems to be identical before and after, so the costs of conversion are 
either very small or are possibly counteracted by the fact that we then 
can avoid the byte-order conversion of individual words less at run-time.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html