Re: performance on repack

mkoegler@xxxxxxxxxxxxxxxxx (Martin Koegler) · Wed, 15 Aug 2007 19:11:52 +0200

On Wed, Aug 15, 2007 at 11:08:38AM -0400, Jon Smirl wrote:
> On 8/15/07, Shawn O. Pearce <spearce@xxxxxxxxxxx> wrote:
> > > Also... read_sha1_file() is currently not thread safe, and threaded
> > > delta searching would requires that its usage be serialized with a
> > > global mutex (not done in this patch which is a bug), or ideally be made
> > > thread aware.
> 
> You can avoid making all the low level calls thread safe by using the
> main thread to get everything into RAM before starting to search for
> the deltas. The worker threads would only deal with things completely
> in memory. You may need to ref count these in-memory objects if they
> are shared between worker threads. For simplicity the in-memory input
> objects should be read only by the threads. The worker threads create
> new structures to hand their results back to the main thread for
> writing to disk.

git-pack-objects knows the order, in which it will use the objects.  A
seperate thread could pre-read the next object and wait until the main
thread starts processing it. After the read is complete, another
thread could start computing the delta index.

git-pack-objects currently reads an object (and computes the delta
index), if it is really necessary. With the pre-read, unnecessary
operations would happen.

> Initially I would just ignore very large objects while getting the
> basic code to work. After the basic code is working if a very large
> object is encountered when the main thread is faulting objects in, the
> main thread should just process this object on the spot using the
> existing low memory code.

I expect that the biggest gain will be for big objects, as they
require more time to read+unpack the source objects and compute the
delta index as well as the delta.

> > @@ -1862,10 +1863,12 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
> >  void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
> >                      unsigned long *size)
> >  {
> > +       static pthread_mutex_t locky = PTHREAD_MUTEX_INITIALIZER;
> >         unsigned long mapsize;
> >         void *map, *buf;
> >         struct cached_object *co;
> >
> > +       pthread_mutex_lock(&locky);
> >         co = find_cached_object(sha1);
> >         if (co) {
> >                 buf = xmalloc(co->size + 1);
> > @@ -1873,20 +1876,26 @@ void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
> >                 ((char*)buf)[co->size] = 0;
> >                 *type = co->type;
> >                 *size = co->size;
> > +               pthread_mutex_unlock(&locky);
> >                 return buf;
> >         }
> >
> >         buf = read_packed_sha1(sha1, type, size);

Couldn't we release the mutex at this point?

Why do we need to protect from concurrent access, when we are reading
a loose object?

> > -       if (buf)
> > +       if (buf) {
> > +               pthread_mutex_unlock(&locky);
> >                 return buf;
> > +       }
> >         map = map_sha1_file(sha1, &mapsize);
> >         if (map) {
> >                 buf = unpack_sha1_file(map, mapsize, type, size, sha1);
> >                 munmap(map, mapsize);
> > +               pthread_mutex_unlock(&locky);
> >                 return buf;
> >         }
> >         reprepare_packed_git();
> > -       return read_packed_sha1(sha1, type, size);
> > +       buf = read_packed_sha1(sha1, type, size);
> > +       pthread_mutex_unlock(&locky);
> > +       return buf;
> >  }

mfg Martin Kögler
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html