Re: [PATCH] sha1_file: use strbuf_add() instead of strbuf_addf()

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 1 Dec 2017 14:50:05 -0500

On 12/1/2017 1:22 PM, Jeff King wrote:
On Fri, Dec 01, 2017 at 12:49:56PM -0500, Derrick Stolee wrote:
[snip]

diff --git a/sha1_file.c b/sha1_file.c
index 8ae6cb6285..2160323c4a 100644
This overall looks good, but I noticed one bug and a few cosmetic
improvements.

Thanks for finding quality improvements to this patch. I'll let it sit 
over the weekend for more feedback before sending a v2.


--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1914,17 +1914,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
  	}
  
  	oid.hash[0] = subdir_nr;
+	strbuf_add(path, "/", 1);
Maybe:

   strbuf_addch(path, '/');

would be a bit more readable (it's also a tiny bit faster, but this
isn't part of the tight loop).

+	baselen = path->len;
We set this here so that the '/' is included as part of the base. Makes
sense, but can we now drop the earlier setting of baselen before the
opendir() call?

Yeah, probably. I had briefly considered just adding the '/' before the 
first assignment of "baselen", but didn't want to change the error 
output. I also don't know if there are side effects for other platforms 
by calling opendir() with a '/'-terminated path.

  	while ((de = readdir(dir))) {
  		if (is_dot_or_dotdot(de->d_name))
  			continue;
  
-		strbuf_setlen(path, baselen);
-		strbuf_addf(path, "/%s", de->d_name);
-
  		if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 &&
  		    !hex_to_bytes(oid.hash + 1, de->d_name,
  				  GIT_SHA1_RAWSZ - 1)) {
+			strbuf_setlen(path, baselen);
+			strbuf_add(path, de->d_name, GIT_SHA1_HEXSZ - 2);
Moving this code into the conditional makes sense here, since that's
where we know the actual size. But what about the case when we _don't_
hit this conditional. We still do:

   if (cruft_cb)
         cruft_cb(path->buf);

which is now broken (it will just get the directory name without the
entry appended).

Good catch! A big reason to pull it inside and use strbuf_add over 
strbuf_addstr is to avoid a duplicate strlen() calculation. However, I 
can store the length before the conditional.

I think the optimized versions probably needs to be something more like:

   while (de = readdir(...)) {
       strbuf_setlen(path, baselen);
       if (size is HEXSZ - 2) {
           strbuf_add(path, de->d_name, HEXSZ - 2);
	  obj_cb(path->buf);
       } else if (cruft_cb) {
           strbuf_addstr(path, de->d_name);
	  cruft_cb(path->buf);
       }
   }

Small change by storing the length in advance of the conditional:

while (de = readdir(...)) {
    int namelen = strlen(de->d_name);
    strbuf_setlen(path, baselen);
    strbuf_add(path, de->d_name, namelen);

    if (namelen == HEXSZ - 2)
        obj_cb(path->buf)
    else
        cruft_cb(path->buf);
}

Two other thoughts:

   - from past discussions, I suspect most of your performance
     improvement actually comes from avoiding addf(), and the difference
     between addstr() and add() may be negligible here. It might be worth
     timing strbuf_addstr(). If that's similarly fast, that means we
     could keep the logic out of the conditional entirely.

addstr() duplicates the strlen(), which isn't much but we might as well 
save cycles where we can if it isn't too hard.

   - there's an extra micro-optimization there, which is that if there's
     no obj_cb, we have no need to assemble the full path at all. I doubt
     it makes much of a difference, as most callers would pass an object
     callback (I'd be surprised if we even have one that doesn't).

After doing a few 'git grep' commands, I found several that include a 
NULL cruft_cb but none that have a NULL obj_cb.

Thanks,
-Stolee