Re: [PATCH v3] Threaded grep

Fredrik Kuivinen <frekui@xxxxxxxxx> · Tue, 19 Jan 2010 01:12:40 +0100

[I am adding git@vger to the Cc list as this may be interesting to
others as well.]

On Mon, Jan 18, 2010 at 23:10, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Fredrik Kuivinen <frekui@xxxxxxxxx> writes:
>
>> diff --git a/builtin-grep.c b/builtin-grep.c
>> index 24ae1ce..dc07e9e 100644
>> --- a/builtin-grep.c
>> +++ b/builtin-grep.c
>> @@ -15,11 +15,257 @@
>>  #include "grep.h"
>>  #include "quote.h"
>>
>> +#ifndef NO_PTHREADS
>> +#include "thread-utils.h"
>> +#include <pthread.h>
>> +#endif
>> +
>>  static char const * const grep_usage[] = {
>>       "git grep [options] [-e] <pattern> [<rev>...] [[--] path...]",
>>       NULL
>>  };
>>
>> +static int use_threads = 1;
>> +
>> +#ifndef NO_PTHREADS
>> +#define THREADS 8
>> +static pthread_t threads[THREADS];
>
> I had to wonder why builtin-pack-objects.c doesn't have an array like
> this.  It uses an array of structure each of which bundles per-thread
> work item data and pthread_t.  Would an arrangement like that make it
> easier to read for this patch as well?

>From a cursory look at builtin-pack-objects.c I think it depends on
how you structure the work you are about to do.

In builtin-pack-objects.c the work seems to be divided evenly among
the threads at the start. When some thread don't have anything to do
it then steals work from some other thread.

In the current patch we instead add work to a FIFO from one thread and
then the consumer threads pick work from the FIFO. I think the current
structure suits the FIFO style quite well.

>> +static void* load_file(const char *filename, size_t *sz);
>
> Style: asterisks stick to identifiers, not types.  There are other
> places with the same issue in the patch, e.g.
>> +static struct work_item* get_work()
> that I'll omit from the rest of my comments.

Will fix. I looked carefully at the arguments, but forgot the return types.

>> +enum work_type {WORK_BUF, WORK_FILE};
>> +
>> +/* We use one producer thread and number_of_threads consumer
>> +   threads. The producer adds struct work_items to 'todo' and the
>> +   consumers pick work items from the same array. */
>
> Style:
>
>        /*
>         * We write multi line comments
>         * like this.
>         */

Will fix.

>> +#define TODO_SIZE 128
>
> How was this size tuned?  Can everybody (from teeny machines to beefy
> ones) use the same setting?

I simply increased it until I didn't see any more significant
performance improvements on two the boxes I have access to. As I only
have access to two boxes, it can most certainly be better tuned for
boxes with significantly different characteristics. However, I think
that 128 is reasonable in the sense that it gives a nice improvement
for everyone with more than one core.

>> +static struct work_item todo[TODO_SIZE];
>> +static int todo_start = 0;
>
> We try not to explicitly initialize statics to 0 and let BSS segment
> take care of them.

Will fix.

>> +/* This lock protects all the variables above. */
>> +static pthread_mutex_t grep_lock = PTHREAD_MUTEX_INITIALIZER;
>
> J6t had comments on these initializers; do they also apply to
> builtin-pack-objects.c?

I believe so, but I do not know. Johannes?

>> +/* This function takes ownership of 'name' and 'buf'. They will be
>> +   deallocated with free. */
>> +static int grep_buffer_async(struct grep_opt *opt, char *name, char *buf,
>> +                          size_t size)
>> +{
>> +     add_work(WORK_BUF, name, buf, size);
>> +     return 0;
>> +}
>> +
>> +static int grep_file_async(struct grep_opt *opt, char *name,
>> +                        const char *filename)
>> +{
>> +     add_work(WORK_FILE, name, (char*) filename, 0);
>> +     return 0;
>> +}
>
> If I am reading the code correctly, it seems that you read quite many
> blobs in core and queue them in todo[] (up to the size limit of the
> array), and the worker bees in run() consumes them.  Shouldn't the
> rate of feeding blobs be limited in some way (other than just having a
> fixed TODO_SIZE, but somehow linked to the rate the worker bees finish
> their work) to avoid wasting memory like this?  One approach might be
> to queue only the blob SHA-1 not the data in work item queue just like
> you queue only the filename not the contents, and have the worker bee
> to read the blob into memory.

Hm, yes that would be an improvement. One just have serialize the
calls to read_sha1_file (as I don't think read_sha1_file safely can be
called simultaneously from multiple threads).

>> ...
>> +static int grep_buffer_internal(struct grep_opt *opt, char *name, char *buf,
>> +                             size_t size)
>> +{
>> +#ifndef NO_PTHREADS
>> +     if (use_threads)
>> +             return grep_buffer_async(opt, name, buf, size);
>> +#endif
>> +     {
>> +             int hit = grep_buffer(opt, name, buf, size);
>> +             free(name);
>> +             free(buf);
>> +             return hit;
>> +     }
>> +}
>> +
>>  static int grep_config(const char *var, const char *value, void *cb)
>>  {
>>       struct grep_opt *opt = cb;
>> @@ -159,21 +405,21 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1, const char
>>       if (opt->relative && opt->prefix_length) {
>>               quote_path_relative(name + tree_name_len, -1, &pathbuf, opt->prefix);
>>               strbuf_insert(&pathbuf, 0, name, tree_name_len);
>> -             name = pathbuf.buf;
>> +     } else {
>> +             strbuf_addstr(&pathbuf, name);
>>       }
>> -     hit = grep_buffer(opt, name, data, size);
>> -     strbuf_release(&pathbuf);
>> -     free(data);
>> +
>> +     hit = grep_buffer_internal(opt, strbuf_detach(&pathbuf, NULL),
>> +                                data, size);
>> +
>>       return hit;
>>  }
>
> This is quite a puzzling code.  Your grep_buffer_internal() returns
> the return value from grep_buffer_async() which always return 0 but
> then that 0 is returned to the caller as "we didn't see any hit".
> What happens if we start optimizing for a case where the end user is
> interested only in one bit of information "does it find anything, or
> is there absolutely no hits?" and wanted to add an early return
> codepath to grep_tree() and grep_cache() so that we say "Ah, we have
> seen a hit, so we don't have to look in others"?
>
> IOW, this patch makes the variable "hit" meaningless and confuses the
> readers.  This actually is the point that most worries me from longer
> term maintenance point of view.

The idea was to keep the current calling conventions as much as
possible, so that grep_cache and grep_tree could stay nearly
unchanged. So hit = 1 now means "we found something" and hit = 0 means
"we didn't find anything right now, but we may find something later
for this entry". The early return you sketched above will still
produce correct results, but it will not be an optimization for the
threaded code.

With the patch the return value from grep_file and grep_sha1 has to be
ORed with the return value from wait_all to determine if we found
something.

> How is the order of output controlled by the way?  I see you are
> keeping more than one threads from doing write_or_die() of the
> accumulated output at the same time with a mutex in work_done(), but
> if we have large files whose names come near the beginning and then a
> small file whose name come later in the input, we should defer the
> output of hits from the later smaller one before the earlier larger
> ones even if the worker bee assigned to the smaller one finishes while
> the other worker bees are still searching in the larger ones; it is
> not obvious to me where you are guaranteeing that ordering.

The output order is controlled by todo_done, todo_start, and work_item::done.

The rule is that the range [todo_done, todo_start) (modulo
ARRAY_SIZE(todo)) in "todo" contains work_items have been or are
processed by a consumer thread. If a particular work_item has been
processed, then .done = 1 and otherwise .done = 0. When a work_item is
marked as done in "work_done" we traverse a prefix of the range
[todo_done, todo_start): we stop as soon as .done = 0. We then set
todo_done to the first work_item with .done = 0.

In your example todo_done = 0 and todo_start = 2. todo[0] represents a
large file and todo[1] represents a small file. If the small file gets
done first we set todo[1].done to 1 in work_done. As todo_done = 0 and
todo[0].done = 0 we will not write any output now. When work_done is
called for todo[0], we set todo[0].done to 1. We will then write the
output for todo[0] (as todo_done = 0 and todo[0].done = 1) and after
that we write the output for todo[1].

In the current patch the mutex grep_lock is held when the output is
written. I'm guessing that this unnecessary serialization is the cause
of the waste of CPU time Linus is seeing. (Strangely enough I haven't
observed this phenomena on the boxes I have tested it on.)

> I am wondering if this can somehow be made into a change with alot
> smaller inpact, in spirit similar to "preloading".  The higher level
> loop may walk the input (be it cache, tree, or directory), issuing one
> grep_file() or grep_sha1() at a time just like the existing code, but
> the thread-aware code adds "priming" phase that (1) populates a "work
> queue" with a very small memory footprint, and that (2) starts more
> than one underlying grep on different files and blobs (up to the
> number of threads you are allowed, like your "#define THREADS 8"), and
> that (3) upon completion of one "work queue" item, the work queue item
> is marked with its result.
>
> Then grep_file() and grep_sha1() _may_ notice that the work queue item
> hasn't completed, and would wait in such a case, or just emits the
> output produced earlier (could be much earlier) by the worker bee.
>
> The low-memory-footprint work queue could be implemented as a lazy
> one, and may be very different depending on how the input is created.
> If we are grepping in the index, it could be a small array of <array
> index in active_cache[], done-bit, result-bit, result-strbuf> with a
> single integer that is a pointer into the index to indicate up to
> which index entry the work queue has been populated.

I will have to think about this a bit more. It's getting late here.

- Fredrik
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html