Shawn O. Pearce wrote: > We might be able to fix this by altering the sort_pack function > in sha1_file.c to not only order by mtime, but also by the ratio > of the size of the .pack to the number of objects stored in it. > Any packfile with a high size/object ratio is likely to be what > Dana has been calling a "metadata" pack, holding things like tags, > commits, trees and small blobs. Its these packfiles that we want > to search first, as they are the most likely to be accessed. > > By pushing the megablob packs to the end of our packed_git search > list we won't tend to scan their indexes, as most of our objects > will be found earlier in the search list. Hence we will generally > avoid any costs associated with their indexes. So change the sort keys in rearrange_packed_git()/sort_pack() to local then alternate, increasing "deviation", <== NEW decreasing mtime (new then old) Each packfile has a "rank", which is the log10 of its average stored object size. "Deviation" is the number of standard deviations this number exceeds its mean, rounded down and clipped at zero. Deviation should be 0 in normal use, and positive for packfiles with significant populations of huge blobs. This definition of deviation is intended to override mtime only when sufficiently significant. Putting mtime before deviation in the sort key list would cause deviation to be irrelevant. Signed-off-by: Dana L. How <danahow@xxxxxxxxx> --- Makefile | 2 +- cache.h | 1 + sha1_file.c | 26 +++++++++++++++++++++++++- 3 files changed, 27 insertions(+), 2 deletions(-) diff --git a/Makefile b/Makefile index 29243c6..45f0a52 100644 --- a/Makefile +++ b/Makefile @@ -383,7 +383,7 @@ BUILTIN_OBJS = \ builtin-pack-refs.o GITLIBS = $(LIB_FILE) $(XDIFF_LIB) -EXTLIBS = -lz +EXTLIBS = -lz -lm # # Platform specific tweaks diff --git a/cache.h b/cache.h index cd875bc..630cd89 100644 --- a/cache.h +++ b/cache.h @@ -437,6 +437,7 @@ extern struct packed_git { uint32_t num_objects; int index_version; time_t mtime; + int deviation; int pack_fd; int pack_local; unsigned char sha1[20]; diff --git a/sha1_file.c b/sha1_file.c index 12d2ef2..1565d91 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -6,6 +6,7 @@ * This handles basic git sha1 object files - packing, unpacking, * creation etc. */ +#include <math.h> #include "cache.h" #include "delta.h" #include "pack.h" @@ -869,6 +870,14 @@ static int sort_pack(const void *a_, const void *b_) return -st; /* + * Packs with large "deviation" should be searched last. + * Such packs have significantly larger blobs. + */ + st = a->deviation - b->deviation; + if (st) + return st; + + /* * Younger packs tend to contain more recent objects, * and more recent objects tend to get accessed more * often. @@ -884,6 +893,7 @@ static void rearrange_packed_git(void) { struct packed_git **ary, *p; int i, n; + float sum1 = 0, sum2 = 0, mean, sigma; for (n = 0, p = packed_git; p; p = p->next) n++; @@ -892,8 +902,22 @@ static void rearrange_packed_git(void) /* prepare an array of packed_git for easier sorting */ ary = xcalloc(n, sizeof(struct packed_git *)); - for (n = 0, p = packed_git; p; p = p->next) + for (n = 0, p = packed_git; p; p = p->next) { + /* this is the log10 of the average object size (almost) */ + float rank = log10((p->pack_size + 1.0) / (p->num_objects + 1.0)); + sum1 += rank; + sum2 += rank * rank; ary[n++] = p; + } + mean = sum1 / n; + sigma = sqrt(sum2 / n - mean * mean); + for (p = packed_git; p; p = p->next) { + float rank = log10((p->pack_size + 1.0) / (p->num_objects + 1.0)); + int deviation = sigma > 0 ? floor((rank - mean) / sigma) : 0; + if ( deviation < 0 ) + deviation = 0; + p->deviation = deviation; + } qsort(ary, n, sizeof(struct packed_git *), sort_pack); -- 1.5.2.762.gd8c6-dirty - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html