On 10/03/2014 10:29 PM, Jeff King wrote: > Prune has to walk $GIT_DIR/objects/?? in order to find the > set of loose objects to prune. Other parts of the code > (e.g., count-objects) want to do the same. Let's factor it > out into a reusable for_each-style function. > > Note that this is not quite a straight code movement. There > are two differences: > > 1. The original code iterated from 0 to 256, trying to > opendir("$GIT_DIR/%02x"). The new code just does a > readdir() on the object directory, and descends into > any matching directories. This is faster on > already-pruned repositories, and should not ever be > slower (nobody ever creates other files in the object > directory). This would change the order that the objects are processed. I doubt that matters to anybody, but it's probably worth mentioning in the commit message. > 2. The original code had strange behavior when it found a > file of the form "[0-9a-f]{2}/.{38}" that did _not_ > contain all hex digits. It executed a "break" from the > loop, meaning that we stopped pruning in that directory > (but still pruned other directories!). This was > probably a bug; we do not want to process the file as > an object, but we should keep going otherwise. > > Signed-off-by: Jeff King <peff@xxxxxxxx> > --- > I admit the speedup in (1) almost certainly doesn't matter. It is real, > and I found out about it while writing a different program that was > basically "count-objects" across a large number of repositories. However > for a single repo it's probably not big enough to matter (calling > count-objects in a loop while get dominated by the startup costs). The > end result is a little more obvious IMHO, but that's subjective. > > builtin/prune.c | 87 ++++++++++++++++------------------------------------ > cache.h | 31 +++++++++++++++++++ > sha1_file.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 152 insertions(+), 61 deletions(-) > > [...] > diff --git a/cache.h b/cache.h > index cd16e25..7abe7f6 100644 > --- a/cache.h > +++ b/cache.h > @@ -1239,6 +1239,37 @@ extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsig > extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t); > extern int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *); > > +/* > + * Iterate over the files in the loose-object parts of the object > + * directory "path", triggering the following callbacks: > + * > + * - loose_object is called for each loose object we find. > + * > + * - loose_cruft is called for any files that do not appear to be > + * loose objects. > + * > + * - loose_subdir is called for each top-level hashed subdirectory > + * of the object directory (e.g., "$OBJDIR/f0"). It is called > + * after the objects in the directory are processed. > + * > + * Any callback that is NULL will be ignored. Callbacks returning non-zero > + * will end the iteration. > + */ > +typedef int each_loose_object_fn(const unsigned char *sha1, > + const char *path, > + void *data); > +typedef int each_loose_cruft_fn(const char *basename, > + const char *path, > + void *data); > +typedef int each_loose_subdir_fn(const char *basename, > + const char *path, > + void *data); > +int for_each_loose_file_in_objdir(const char *path, > + each_loose_object_fn obj_cb, > + each_loose_cruft_fn cruft_cb, > + each_loose_subdir_fn subdir_cb, > + void *data); > + > struct object_info { > /* Request */ > enum object_type *typep; > diff --git a/sha1_file.c b/sha1_file.c > index bae1c15..9fdad47 100644 > --- a/sha1_file.c > +++ b/sha1_file.c > @@ -3218,3 +3218,98 @@ void assert_sha1_type(const unsigned char *sha1, enum object_type expect) > die("%s is not a valid '%s' object", sha1_to_hex(sha1), > typename(expect)); > } > + > +static int opendir_error(const char *path) > +{ > + if (errno == ENOENT) > + return 0; > + return error("unable to open %s: %s", path, strerror(errno)); > +} > + > +static int for_each_file_in_obj_subdir(struct strbuf *path, > + const char *prefix, > + each_loose_object_fn obj_cb, > + each_loose_cruft_fn cruft_cb, > + each_loose_subdir_fn subdir_cb, > + void *data) > +{ > + size_t baselen = path->len; > + DIR *dir = opendir(path->buf); > + struct dirent *de; > + int r = 0; > + > + if (!dir) > + return opendir_error(path->buf); OK, so if there is a non-directory named $GIT_DIR/objects/33, then we emit an "unable to open" error rather than treating it as cruft. I think this is reasonable. > + > + while ((de = readdir(dir))) { > + if (is_dot_or_dotdot(de->d_name)) > + continue; > + > + strbuf_setlen(path, baselen); > + strbuf_addf(path, "/%s", de->d_name); > + > + if (strlen(de->d_name) == 38) { > + char hex[41]; > + unsigned char sha1[20]; > + > + memcpy(hex, prefix, 2); > + memcpy(hex + 2, de->d_name, 38); > + hex[40] = 0; > + if (!get_sha1_hex(hex, sha1)) { > + if (obj_cb) { > + r = obj_cb(sha1, path->buf, data); > + if (r) > + break; > + } > + continue; > + } > + } > + > + if (cruft_cb) { > + r = cruft_cb(de->d_name, path->buf, data); So, files *and* directories at the $GIT_DIR/objects/XX/ level are reported as cruft (as opposed to, say, descending into the directories and reporting any files found deeper in the hierarchy). This seems fine, too. > + if (r) > + break; > + } > + } > + if (!r && subdir_cb) > + r = subdir_cb(de->d_name, path->buf, data); By my reading, path->buf still contains the name of the last file in the directory at this point. I assume you want to pass it the original "baselen"-length path here. > + closedir(dir); > + return r; ...and anyway, it would be more polite to restore the path strbuf to its original length before returning. > +} > + > +int for_each_loose_file_in_objdir(const char *path, > + each_loose_object_fn obj_cb, > + each_loose_cruft_fn cruft_cb, > + each_loose_subdir_fn subdir_cb, > + void *data) > +{ > + struct strbuf buf = STRBUF_INIT; > + size_t baselen; > + DIR *dir = opendir(path); > + struct dirent *de; > + int r = 0; > + > + if (!dir) > + return opendir_error(path); > + > + strbuf_addstr(&buf, path); > + baselen = buf.len; > + > + while ((de = readdir(dir))) { > + if (!isxdigit(de->d_name[0]) || > + !isxdigit(de->d_name[1]) || > + de->d_name[2]) > + continue; So other files or directories at the $GIT_DIR/objects/ level are just ignored; they are not considered cruft. This is worth clarifying in the docstring. > + > + strbuf_addf(&buf, "/%s", de->d_name); > + r = for_each_file_in_obj_subdir(&buf, de->d_name, obj_cb, > + cruft_cb, subdir_cb, data); > + strbuf_setlen(&buf, baselen); > + if (r) > + break; > + } > + > + closedir(dir); > + strbuf_release(&buf); > + return r; > +} > Other than my comments above, it looks good to me. Michael -- Michael Haggerty mhagger@xxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html