Re: [PATCH v3] Threaded grep

Fredrik Kuivinen <frekui@xxxxxxxxx> · Tue, 19 Jan 2010 08:34:54 +0100

On Mon, Jan 18, 2010 at 10:05:19AM -0800, Linus Torvalds wrote:
> On my machine (4 cores with HT, so 8 threads total), I get the 
> following:
> 
>  [ Note: the --no-ext-grep is because I'm comparing with a git that has 
>   the original grep optimization, but hasn't removed the external grep 
>   functionality yet. I just used the same command line when then comparing 
>   to next+your patch, so don't mind it.
> 
>   Also, the three examples are chosen to be: "trivial single match", 
>   "regex single match", and "trivial lots of matches" ]


You may be testing slightly different things: next has 885d211 (grep:
rip out pessimization to use fixmatch()) and bbc09c2 (grep: rip out
support for external grep) was added before. So next+patch uses
regexec and the version with support for external greps uses
strstr. IIRC strstr was a bit faster than regexec on your
box. However, this only explains a small part of the results you are
seeing.

> 
> Before (all best-of-five), with the non-threaded internal grep:
> 
>  - git grep --no-ext-grep qwerty
> 
> 	real	0m0.311s
> 	user	0m0.164s
> 	sys	0m0.144s
> 
>  - git grep --no-ext-grep qwerty.*as
> 
> 	real	0m0.555s
> 	user	0m0.416s
> 	sys	0m0.132s
> 
>  - git grep --no-ext-grep function
> 
> 	real	0m0.461s
> 	user	0m0.276s
> 	sys	0m0.180s
> 
> After, with threading:
> 
>  - git grep --no-ext-grep qwerty
> 
> 	real	0m0.152s
> 	user	0m0.788s
> 	sys	0m0.212s
> 
>  - git grep --no-ext-grep qwerty.*as
> 
> 	real	0m0.148s
> 	user	0m0.724s
> 	sys	0m0.284s
> 
>  - git grep --no-ext-grep function
> 
> 	real	0m0.241s
> 	user	0m1.436s
> 	sys	0m0.216s
> 
> so now it's a clear win in all cases. And as you'd expect, the "complex 
> case with single output" is the one that wins most, since it's the one 
> that should have the most CPU usage, with the least synchronization.
> 
> That said, it still does waste quite a bit of time doing this, and while 
> it almost doubles performance in the last case too, it does so at the cost 
> of quite a bit of overhead.
> 
> (Some overhead is natural, especially since I have HT. Running two threads 
> on the same core does _not_ give double the performance, so the CPU time 
> almost doubling - because the threads themselves run slower - is not 
> unexpected. It's when the CPU time more than quadruples that I suspect 
> that it could be improved a bit).


I haven't seen this overhead during my tests. But I'm _guessing_ that
the problem is that the mutex grep_lock is held while the result is
written to stdout. So when we are writing, no significant amount of
work can be done by any thread. Here is a patch to fix this (applies
on to of v3).

diff --git a/builtin-grep.c b/builtin-grep.c
index 8fca7a6..422b0ec 100644
--- a/builtin-grep.c
+++ b/builtin-grep.c
@@ -139,21 +139,48 @@ static int grep_file_async(struct grep_opt *opt, char *name,
 	return 0;
 }
 
+/* Mark the work_item as done. The function guarantees that w->done is
+ * set to 1 and that if w is included in a prefix of the range
+ * [todo_done, todo_start) where each work_item has .done == 1, then
+ * w->out will eventually be written to stdout. The writing is done
+ * either by this thread or by the thread which has set 'writing' to
+ * 1.
+ */
 static void work_done(struct work_item *w)
 {
-	int old_done;
+	/* 1 if there is a thread which is writing data to stdout and
+	   0 otherwise. */
+	static int writing = 0;
+	int old_done, write_idx, write_to;
 
 	pthread_mutex_lock(&grep_lock);
 	w->done = 1;
 	old_done = todo_done;
-	for(; todo[todo_done].done && todo_done != todo_start;
-	    todo_done = (todo_done+1) % ARRAY_SIZE(todo)) {
-		w = &todo[todo_done];
-		write_or_die(1, w->out.buf, w->out.len);
-		if (w->type == WORK_BUF)
-			free(w->buf);
-
-		free(w->name);
+
+	/* We need a loop here if todo_start is changed while we write
+	 * some data. */
+	while (!writing && todo[todo_done].done && todo_done != todo_start) {
+		write_idx = todo_done;
+		write_to = todo_start;
+		writing = 1;
+
+		/* Unlock the mutex while we write the data. This is
+		 * safe as writing == 1 and hence there is at most one
+		 * thread which writes data at any time. */
+		pthread_mutex_unlock(&grep_lock);
+		for(; todo[write_idx].done && write_idx != write_to;
+		    write_idx = (write_idx+1) % ARRAY_SIZE(todo)) {
+			w = &todo[write_idx];
+			write_or_die(1, w->out.buf, w->out.len);
+			if (w->type == WORK_BUF)
+				free(w->buf);
+		
+			free(w->name);
+		}
+
+		pthread_mutex_lock(&grep_lock);
+		writing = 0;
+		todo_done = write_idx;
 	}
 
 	if (old_done != todo_done)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html