On Sun, May 29, 2011 at 11:33:38PM -0700, Junio C Hamano wrote: > Jeff King <peff@xxxxxxxx> writes: > > > @@ -852,6 +862,29 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent, > > } > > } > > > > + if (userdiff->binary != -1) > > + is_binary = userdiff->binary; > > + else { > > + is_binary = buffer_is_binary(result, result_size); > > + for (i = 0; !is_binary && i < num_parent; i++) { > > + char *buf; > > + unsigned long size; > > + buf = grab_blob(elem->parent[i].sha1, > > + elem->parent[i].mode, > > + &size); > > + if (buffer_is_binary(buf, size)) > > + is_binary = 1; > > + free(buf); > > + } > > + } > > Two comments. > > - This loop will grab the blob from all parents just to peek and discard > for most of commits. It feels somewhat wasteful, especially because > "binary" that is not marked as binary is a rare minor case (most of the > paths are text). Stopping immediately at the first binary blob does not > optimize the loop even though it is better than not having one. Yeah, I am not too happy with that bit. Our choices are: 1. Grab each blob, check binary-ness, and free. This double-loads in the common, non-binary case. 2. Grab each blob, check binary-ness, and keep it in memory. This means using N times as much memory, where N is the number of parents. In practice, N generally equals 2, and the file sizes aren't gigantic, so it's not that bad. But there are some corner cases. 3. Choose (1) or (2) based on file size, or based on the number of parents. The problem cases for (2) are going to be big files (bigger than bigFileThreshold?), and gigantic numbers of parents. I'd also eventually like to do a "peek" binary-ness check on top of your streaming work, as I showed for the non-combined diff case (speaking of which, I need to polish that now that the streaming series seems to have settled). And we would probably want to peek only in the big file case. I'll try to take a look at it this week and get some measurements on (1) versus (2) for both speed and peak memory usage. And then see if I can do better with (3), and implement the "peek" solution both here and in regular diff. > - It may make sense to compare [i].sha1 with earlier [j].sha1 (j < i) and > avoid grab_blob() altogether? Cf. 3c39e9b (combine-diff: reuse diff > from the same blob., 2006-02-01). Yeah, that is probably worth doing, as it collapses the "N" above into "how many parents all modified the same file". I'll add it to my list to measure. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html