Re: [PATCH 02/15] refs.c: return error instead of dying when locking fails during transaction

Jeff King <peff@xxxxxxxx> · Tue, 11 Nov 2014 05:34:49 -0500

On Tue, Oct 21, 2014 at 01:36:47PM -0700, Ronnie Sahlberg wrote:

> commit e193c10fc4f9274d1e751cfcdcc4507818e8d498 upstream.
> 
> Change lock_ref_sha1_basic to return an error instead of dying when
> we fail to lock a file during a transaction.
> This function is only called from transaction_commit() and it knows how
> to handle these failures.
> [...]
> -		else
> -			unable_to_lock_die(ref_file, errno);
> +		else {
> +			struct strbuf err = STRBUF_INIT;
> +			unable_to_lock_message(ref_file, errno, &err);
> +			error("%s", err.buf);
> +			strbuf_reset(&err);
> +			goto error_return;
> +		}

I coincidentally just wrote almost the identical patch, because this
isn't just a cleanup; it fixes a real bug. During pack_refs, we call
prune_ref to lock and delete the loose ref. If the lock fails, that's
OK; that just means somebody else is updating it at the same time, and
we can skip our pruning step. But due to the unable_to_lock_die call
here in lock_ref_sha1_basic, the pack-refs process may die prematurely.

I wonder if it is worth pulling this one out from the rest of the
series, as it has value (and can be applied) on its own. I did some
digging on the history of this, too. Here's the rationale I wrote:

    lock_ref_sha1_basic: do not die on locking errors

    lock_ref_sha1_basic is inconsistent about when it calls
    die() and when it returns NULL to signal an error. This is
    annoying to any callers that want to recover from a locking
    error.

    This seems to be mostly historical accident. It was added in
    4bd18c4 (Improve abstraction of ref lock/write.,
    2006-05-17), which returned an error in all cases except
    calling safe_create_leading_directories, in which case it
    died.  Later, 40aaae8 (Better error message when we are
    unable to lock the index file, 2006-08-12) asked
    hold_lock_file_for_update to die for us, leaving the
    resolve_ref code-path the only one which returned NULL.

    We tried to correct that in 5cc3cef (lock_ref_sha1(): do not
    sometimes error() and sometimes die()., 2006-09-30),
    by converting all of the die() calls into returns. But we
    missed the "die" flag passed to the lock code, leaving us
    inconsistent. This state persisted until e5c223e
    (lock_ref_sha1_basic(): if locking fails with ENOENT, retry,
    2014-01-18). Because of its retry scheme, it does not ask
    the lock code to die, but instead manually dies with
    unable_to_lock_die().

    We can make this consistent with the other return paths by
    converting this to use unable_to_lock_message(), and
    returning NULL. This is safe to do because all callers
    already needed to check the return value of the function,
    since it could fail (and return NULL) for other reasons.

I also have some other cleanups to lock_ref_sha1_basic's error handling.
I'd be happy to take over this patch and send it along with those
cleanups as a separate series.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html