Re: [PATCH 0/3] lazily load commit->buffer

Junio C Hamano <gitster@xxxxxxxxx> · Sat, 26 Jan 2013 13:26:53 -0800

Jeff King <peff@xxxxxxxx> writes:

> Yeah, agreed. I started to fix this up with a use/unuse pattern and
> realized something: all of the call sites are calling logmsg_reencode
> anyway, because that is the next logical step in doing anything with the
> buffer that is not just parsing out the parent/timestamp/tree info. And
> since that function already might allocate (for the re-encoded copy),
> callers have to handle the maybe-borrowed-maybe-free situation already.
>
> So I came up with this patch series, which I think should fix the
> problem, and actually makes the call-sites easier to read, rather than
> harder.
>
>   [1/3]: commit: drop useless xstrdup of commit message
>   [2/3]: logmsg_reencode: never return NULL
>   [3/3]: logmsg_reencode: lazily load missing commit buffers
>
> Here's the diffstat:
>
>  builtin/blame.c                  | 22 ++-------
>  builtin/commit.c                 | 14 +-----
>  commit.h                         |  1 +
>  pretty.c                         | 93 ++++++++++++++++++++++++++---------
>  t/t4042-diff-textconv-caching.sh |  8 +++
>  5 files changed, 85 insertions(+), 53 deletions(-)
>
> Not too bad, and 27 of the lines added in pretty.c are new comments
> explaining the flow of logmsg_reencode. So even if this doesn't get
> every case, I think it's a nice cleanup.

This looks very good.

I wonder if this lets us get rid of the hack in cmd_log_walk() that
does this:

        while ((commit = get_revision(rev)) != NULL) {
                if (!log_tree_commit(rev, commit) &&
                    rev->max_count >= 0)
                        rev->max_count++;
!               if (!rev->reflog_info) {
!                       /* we allow cycles in reflog ancestry */
                        free(commit->buffer);
                        commit->buffer = NULL;
!               }
                free_commit_list(commit->parents);
                commit->parents = NULL;

After log_tree_commit() handles the commit, using the buffer, we
discard the memory associated to it because we know we no longer
will use it in normal cases.

The "do not do that if rev->reflog_info is true" was added in
a6c7306 (--walk-reflogs: do not crash with cyclic reflog ancestry,
2007-01-20) because the second and subsequent display of "commit"
(which happens to occur only when walking reflogs) needs to look at
commit->buffer again, and this hack forces us to retain the buffer
for _all_ commit objects.

But your patches could be seen as a different (and more correct) way
to fix the same issue.  Once the display side learns how to re-read
the log text of the commit object, the above becomes unnecessary, no?

We may still be helped if majority of commit objects that appear in
the reflog appear more than once, in which case retaining the buffer
for _all_ commits could be an overall win.  Not having to read the
buffer for the same commit each time it is shown for majority of
multiply-appearing commits, in exchange for having to keep the
buffer for commits that appears only once that are minority and
suffering increasted page cache pressure.  That could be seen as an
optimization.

But that is a performance thing, not a correctness issue, so "we
allow cycles" implying "therefore if we discard the buffer, we will
show wrong output" becomes an incorrect justification.

I happen to have HEAD reflog that is 30k entries long; more than 26k
represent a checkout of unique commit.  So I suspect that the above
hack to excessive retain commit->buffer for already used commits will
not help us performance-wise, either.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html