Re: [PATCH v4 7/8] refs: allow multiple reflog entries for the same refname

karthik nayak <karthik.188@xxxxxxxxx> · Fri, 20 Dec 2024 06:15:07 -0500

Toon Claes <toon@xxxxxxxxx> writes:

> Karthik Nayak <karthik.188@xxxxxxxxx> writes:
>
>> The reference transaction only allows a single update for a given
>> reference to avoid conflicts. This, however, isn't an issue for reflogs.
>> There are no conflicts to be resolved in reflogs and when migrating
>> reflogs between backends we'd have multiple reflog entries for the same
>> refname.
>>
>> So allow multiple reflog updates within a single transaction. Also the
>> reflog creation logic isn't exposed to the end user. While this might
>> change in the future, currently, this reduces the scope of issues to
>> think about.
>>
>> In the reftable backend, the writer sorts all updates based on the
>> update_index before writing to the block. When there are multiple
>> reflogs for a given refname, it is essential that the order of the
>> reflogs is maintained. So add the `index` value to the `update_index`.
>> The `index` field is only set when multiple reflog entries for a given
>> refname are added and as such in most scenarios the old behavior
>> remains.
>>
>> This is required to add reflog migration support to `git refs migrate`.
>>
>> Signed-off-by: Karthik Nayak <karthik.188@xxxxxxxxx>
>> ---
>>  refs/files-backend.c    | 15 +++++++++++----
>>  refs/reftable-backend.c | 22 +++++++++++++++++++---
>>  2 files changed, 30 insertions(+), 7 deletions(-)
>>
>> diff --git a/refs/files-backend.c b/refs/files-backend.c
>> index c11213f52065bcf2fa7612df8f9500692ee2d02c..8953d1c6d37b13b0db701888b3db92fd87a68aaa 100644
>> --- a/refs/files-backend.c
>> +++ b/refs/files-backend.c
>> @@ -2611,6 +2611,9 @@ static int lock_ref_for_update(struct files_ref_store *refs,
>>
>>  	update->backend_data = lock;
>>
>> +	if (update->flags & REF_LOG_ONLY)
>> +		goto out;
>> +
>>  	if (update->type & REF_ISSYMREF) {
>>  		if (update->flags & REF_NO_DEREF) {
>>  			/*
>> @@ -2829,13 +2832,16 @@ static int files_transaction_prepare(struct ref_store *ref_store,
>>  	 */
>>  	for (i = 0; i < transaction->nr; i++) {
>>  		struct ref_update *update = transaction->updates[i];
>> -		struct string_list_item *item =
>> -			string_list_append(&affected_refnames, update->refname);
>> +		struct string_list_item *item;
>>
>>  		if ((update->flags & REF_IS_PRUNING) &&
>>  		    !(update->flags & REF_NO_DEREF))
>>  			BUG("REF_IS_PRUNING set without REF_NO_DEREF");
>>
>> +		if (update->flags & REF_LOG_ONLY)
>> +			continue;
>> +
>> +		item = string_list_append(&affected_refnames, update->refname);
>>  		/*
>>  		 * We store a pointer to update in item->util, but at
>>  		 * the moment we never use the value of this field
>> @@ -3035,8 +3041,9 @@ static int files_transaction_finish_initial(struct files_ref_store *refs,
>>
>>  	/* Fail if a refname appears more than once in the transaction: */
>>  	for (i = 0; i < transaction->nr; i++)
>> -		string_list_append(&affected_refnames,
>> -				   transaction->updates[i]->refname);
>> +		if (!(transaction->updates[i]->flags & REF_LOG_ONLY))
>> +			string_list_append(&affected_refnames,
>> +					   transaction->updates[i]->refname);
>>  	string_list_sort(&affected_refnames);
>>  	if (ref_update_reject_duplicates(&affected_refnames, err)) {
>>  		ret = TRANSACTION_GENERIC_ERROR;
>> diff --git a/refs/reftable-backend.c b/refs/reftable-backend.c
>> index b2e3ba877de9e59fea5a4d066eb13e60ef22a32b..bec5962debea7b62572d08f6fa8fd38ab4cd8af6 100644
>> --- a/refs/reftable-backend.c
>> +++ b/refs/reftable-backend.c
>> @@ -990,8 +990,9 @@ static int reftable_be_transaction_prepare(struct ref_store *ref_store,
>>  		if (ret)
>>  			goto done;
>>
>> -		string_list_append(&affected_refnames,
>> -				   transaction->updates[i]->refname);
>> +		if (!(transaction->updates[i]->flags & REF_LOG_ONLY))
>> +			string_list_append(&affected_refnames,
>> +					   transaction->updates[i]->refname);
>>  	}
>>
>>  	/*
>> @@ -1301,6 +1302,7 @@ static int write_transaction_table(struct reftable_writer *writer, void *cb_data
>>  	struct reftable_log_record *logs = NULL;
>>  	struct ident_split committer_ident = {0};
>>  	size_t logs_nr = 0, logs_alloc = 0, i;
>> +	uint64_t max_update_index = ts;
>>  	const char *committer_info;
>>  	int ret = 0;
>>
>> @@ -1405,7 +1407,19 @@ static int write_transaction_table(struct reftable_writer *writer, void *cb_data
>>  				}
>>
>>  				fill_reftable_log_record(log, &c);
>> -				log->update_index = ts;
>> +
>> +				/*
>> +				 * Updates are sorted by the writer. So updates for the same
>> +				 * refname need to contain different update indices.
>> +				 */
>> +				log->update_index = ts + u->index;
>
> During my review I was having a hard time figuring out when `u->index`
> was not 0 and where it is being set. Can you maybe explain a bit?
>

As of this patch, there is no users of the index. This patch adds in the
infrastructure. The next patch is where we actually set the index.

In short, the index is only needed for the reftable backend. This is
because reflogs contain a specific order and we need to retain that
order. In the reftable backend. For optimization, all writes are sorted
by refnames. The index provided a parallel system to retain the order of
the updates. There are no real usecases apart from migration of reflogs
from one backend to another, which is added in the next patch.

>> +
>> +				/*
>> +				 * Note the max update_index so the limit can be set later on.
>> +				 */
>> +				if (log->update_index > max_update_index)
>
> Is there a lot of value in having this if clause? I was a bit confused
> why it is here, because I think we can do the assignment to
> max_update_index unconditionally.
>

It is necessary. For reflogs whose index isn't set, their `update_index`
would simply be the `ts` value. So if there are a mix of reflog updates
with and without index, we could end up with a scenario where we don't
set the max to the actual max.

>> +					max_update_index = log->update_index;
>> +
>>  				log->refname = xstrdup(u->refname);
>>  				memcpy(log->value.update.new_hash,
>>  				       u->new_oid.hash, GIT_MAX_RAWSZ);
>> @@ -1469,6 +1483,8 @@ static int write_transaction_table(struct reftable_writer *writer, void *cb_data
>>  	 * and log blocks.
>>  	 */
>>  	if (logs) {
>> +		reftable_writer_set_limits(writer, ts, max_update_index);
>
> So max_update_index is used to set the limits on the current writer, but
> using reftable_stack_next_update_index() it's also used to give the next
> stack it's starting point for their range.

Using `reftable_stack_next_update_index()` would return `ts + 1` as that
is the next sequential update. This could be lesser than the
max_update_index. So we can't use that. Once all the reflogs are
written, the next call to `reftable_stack_next_update_index()` would
return `max_update_index + 1`.

> Now I'm not familiar enough with the code, but are all stacks handled
> in sequential order?

Not sure I understand your question correctly. Updates are handled as
per a given index. Each update is also sequentially stored. Tables are
named after the min and max index that they store.

> And how does a stack relate to a reftable file?

The stack is used to refer to a collection of reftable tables. So for a
given worktree, the tables under '$GIT_DIR/reftable' would constitute a
stack, where the 'tables.list' would state the tables which are part of
the stack

>> +
>>  		ret = reftable_writer_add_logs(writer, logs, logs_nr);
>>  		if (ret < 0)
>>  			goto done;
>>
>> --
>> 2.47.1
Attachment:
signature.asc

Description: PGP signature