On Mon, Feb 19, 2024 at 03:39:08PM -0800, Junio C Hamano wrote: > Patrick Steinhardt <ps@xxxxxx> writes: > > > The `struct dir_iterator` is a helper that allows us to iterate through > > directory entries. This iterator returns entries in the exact same order > > as readdir(3P) does -- or in other words, it guarantees no specific > > order at all. > > > > This is about to become problematic as we are introducing a new reflog > > subcommand to list reflogs. As the "files" backend uses the directory > > iterator to enumerate reflogs, returning reflog names and exposing them > > to the user would inherit the indeterministic ordering. Naturally, it > > would make for a terrible user interface to show a list with no > > discernible order. While this could be handled at a higher level by the > > new subcommand itself by collecting and ordering the reflogs, this would > > be inefficient and introduce latency when there are many reflogs. > > I do not quite understand this argument. Why is sorting at higher > level less (or more, for that matter) efficient than doing so at > lower level? We'd need to sort somewhere no matter what, and I of > course have no problem in listing in a deterministic order. By sorting at a lower level we only need to sort the respective directory entries and can then return them without having to recurse into all subdirectories yet. Sorting at a higher level would require us to first collect _all_ reflogs and then sort them. Will rephrase a bit. > > Instead, introduce a new option into the directory iterator that asks > > for its entries to be yielded in lexicographical order. If set, the > > iterator will read all directory entries greedily end sort them before > > we start to iterate over them. > > "end" -> "and". And of course without such sorting option, this > codepath is allowed to yield entries in any order that is the > easiest to produce? That makes sense. > > > While this will of course also incur overhead as we cannot yield the > > directory entries immediately, it should at least be more efficient than > > having to sort the complete list of reflogs as we only need to sort one > > directory at a time. > > True. The initial latency before we see the first byte of the > output often matters more in perceived performance the throughput. > As we need to sort to give a reasonable output, that cannot be > avoided. > > > This functionality will be used in a follow-up commit. > > > > Signed-off-by: Patrick Steinhardt <ps@xxxxxx> > > --- > > dir-iterator.c | 87 ++++++++++++++++++++++++++++++++++++++++---------- > > dir-iterator.h | 3 ++ > > 2 files changed, 73 insertions(+), 17 deletions(-) > > > > diff --git a/dir-iterator.c b/dir-iterator.c > > index f58a97e089..396c28178f 100644 > > --- a/dir-iterator.c > > +++ b/dir-iterator.c > > @@ -2,9 +2,12 @@ > > #include "dir.h" > > #include "iterator.h" > > #include "dir-iterator.h" > > +#include "string-list.h" > > > > struct dir_iterator_level { > > DIR *dir; > > + struct string_list entries; > > + size_t entries_idx; > > Does it deserve a comment that "dir == NULL" is used as a signal > that we have read the level and sorted its contents into the > "entries" list (and also we have already called closedir(), of > course)? Yeah, probably. > > @@ -72,6 +75,40 @@ static int push_level(struct dir_iterator_int *iter) > > return -1; > > } > > > > + string_list_init_dup(&level->entries); > > + level->entries_idx = 0; > > + > > + /* > > + * When the iterator is sorted we read and sort all directory entries > > + * directly. > > + */ > > + if (iter->flags & DIR_ITERATOR_SORTED) { > > + while (1) { > > + struct dirent *de; > > + > > + errno = 0; > > + de = readdir(level->dir); > > + if (!de) { > > + if (errno && errno != ENOENT) { > > + warning_errno("error reading directory '%s'", > > + iter->base.path.buf); > > + return -1; > > + } > > + > > + break; > > + } > > + > > + if (is_dot_or_dotdot(de->d_name)) > > + continue; > > The condition to skip an entry currently is simple enough that "." > and ".." are the only ones that are skipped, but it must be kept in > sync with the condition in dir_iterator_advance(). > > If it becomes more complex than it is now (e.g., we may start to > skip any name that begins with a dot, like ".git" or ".dummy"), it > probably is a good idea *not* to add the same filtering logic here > and in dir_iterator_advance(). Instead, keep the filtering here to > an absolute minumum, and filter the name, whether it came from > readdir() or from the .entries string list, in a single copy of > filtering logic in dir_iterator_advance() function. > > We could drop the dot-or-dotdot filter here, too, if we want to > ensure that unified filtering will be correctly done over there. Fair point. As you mention further down below, there are two ways to approach it: - Just filter at the later stage and accept that we'll allocate memory for entries that are about to be discarded. - Create a function `should_include_entry()` that gets called at both code sites. I don't think the allocation overhead should matter much, but neither does it hurt to create a common `should_include_entry()` function. And as both are trivial to implement I rather lean towards the more efficient variant, even though the efficiency gain should be negligible. > > + string_list_append(&level->entries, de->d_name); > > + } > > + string_list_sort(&level->entries); > > + > > + closedir(level->dir); > > + level->dir = NULL; > > + } > > + > > return 0; > > } > > > > @@ -88,6 +125,7 @@ static int pop_level(struct dir_iterator_int *iter) > > warning_errno("error closing directory '%s'", > > iter->base.path.buf); > > level->dir = NULL; > > + string_list_clear(&level->entries, 0); > > > > return --iter->levels_nr; > > } > > It is somewhat interesting that the original code already has > conditional call to closedir() and prepares .dir to be NULL, > so that we do not have to make it conditional here. > > > @@ -136,30 +174,43 @@ int dir_iterator_advance(struct dir_iterator *dir_iterator) > > > > /* Loop until we find an entry that we can give back to the caller. */ > > while (1) { > > - struct dirent *de; > > struct dir_iterator_level *level = > > &iter->levels[iter->levels_nr - 1]; > > + struct dirent *de; > > + const char *name; > > Not a huge deal but this is an unnecessary reordering, right? Right. > > strbuf_setlen(&iter->base.path, level->prefix_len); > > + > > + if (level->dir) { > > + errno = 0; > > + de = readdir(level->dir); > > + if (!de) { > > + if (errno) { > > + warning_errno("error reading directory '%s'", > > + iter->base.path.buf); > > + if (iter->flags & DIR_ITERATOR_PEDANTIC) > > + goto error_out; > > + } else if (pop_level(iter) == 0) { > > + return dir_iterator_abort(dir_iterator); > > + } > > + continue; > > } > > > > + if (is_dot_or_dotdot(de->d_name)) > > + continue; > > This is the target of the "if we will end up filtering even more in > the future, it would probably be a good idea not to duplicate the > logic to decide what gets filtered in this function and in > push_level()" comment. If we wanted to go that route, we can get > rid of the filtering from push_level(), and move this filter code > outside this if/else before calling prepare_next_entry_data(). > > The fact that .entries.nr represents the number of entries that are > shown is unusable (because there is an unsorted codepath that does > not even populate .entries), so I am not worried about correctness > gotchas caused by including names in .entries to be filtered out. > But an obvious downside is that the size of the list to be sorted > will become larger. > > Or we could introduce a shared helper function that takes a name and > decides if it is to be included, and replace the is_dot_or_dotdot() > call here and in the push_level() with calls to that helper. > > In any case, that is primarily a maintainability issue. The code > posted as-is is correct. Yeah, let's use the proposed helper function. In fact, I think we can share even more code than merely the filtering part: the errno handling is a bit special, and the warning is the same across both code sites, too. Patrick > > + name = de->d_name; > > + } else { > > + if (level->entries_idx >= level->entries.nr) { > > + if (pop_level(iter) == 0) > > + return dir_iterator_abort(dir_iterator); > > + continue; > > + } > > + > > + name = level->entries.items[level->entries_idx++].string; > > + } > > + > > + if (prepare_next_entry_data(iter, name)) { > > if (errno != ENOENT && iter->flags & DIR_ITERATOR_PEDANTIC) > > goto error_out; > > continue; > > @@ -188,6 +239,8 @@ int dir_iterator_abort(struct dir_iterator *dir_iterator) > > warning_errno("error closing directory '%s'", > > iter->base.path.buf); > > } > > + > > + string_list_clear(&level->entries, 0); > > } > > > > free(iter->levels); > > diff --git a/dir-iterator.h b/dir-iterator.h > > index 479e1ec784..6d438809b6 100644 > > --- a/dir-iterator.h > > +++ b/dir-iterator.h > > @@ -54,8 +54,11 @@ > > * and ITER_ERROR is returned immediately. In both cases, a meaningful > > * warning is emitted. Note: ENOENT errors are always ignored so that > > * the API users may remove files during iteration. > > + * > > + * - DIR_ITERATOR_SORTED: sort directory entries alphabetically. > > */ > > #define DIR_ITERATOR_PEDANTIC (1 << 0) > > +#define DIR_ITERATOR_SORTED (1 << 1) > > > > struct dir_iterator { > > /* The current path: */
Attachment:
signature.asc
Description: PGP signature