Re: [PATCH v3 00/14] ref-transactions-reflog

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Tue, 18 Nov 2014 20:46:36 +0100

On 11/18/2014 07:36 PM, Ronnie Sahlberg wrote:
> On Tue, Nov 18, 2014 at 3:26 AM, Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote:
>> On 11/18/2014 02:35 AM, Stefan Beller wrote:
>>> The following patch series updates the reflog handling to use transactions.
>>> This patch series has previously been sent to the list[1].
>>> [...]
>>
>> I was reviewing this patch series (I left some comments in Gerrit about
>> the first few patches) when I realized that I'm having trouble
>> understanding the big picture of where you want to go with this. I have
>> the feeling that the operations that you are implementing are at too low
>> a level of abstraction.
>>
>> What are the elementary write operations that are needed for a reflog?
>> Off the top of my head,
>>
>> 1. Add a reflog entry when a reference is updated in a transaction.
>> 2. Rename a reflog file when the corresponding reference is renamed.
>> 3. Delete the reflog when the corresponding reference is deleted [1].
>> 4. Configure a reference to be reflogged.
>> 5. Configure a reference to not be reflogged anymore and delete any
>>    existing reflog.
>> 6. Selectively expire old reflog entries, e.g., based on their age.
>>
>> Have I forgotten any?
>>
>> The first three should be side-effects of the corresponding reference
>> updates. Aside from the fact that renames are not yet done within a
>> transaction, I think this is already the case.
>>
>> Number 4, I think, currently only happens in conjunction with adding a
>> line to the reflog. So it could be implemented, say, as a
>> FORCE_CREATE_REFLOG flag on a ref_update within a transaction.
>>
>> Number 5 is not very interesting, I think. For example, it could be a
>> separate API function, disconnected from any transactions.
>>
>> Number 6 is more interesting, and from my quick reading, it looks like a
>> lot of the work of this patch series is to allow number 6 to be
>> implemented in builtin/reflog.c:expire_reflog(). But it seems to me that
>> you are building API calls at the wrong level of abstraction. Expiring a
>> reflog should be a single API call to the refs API, and ultimately it
>> should be left up to the refs backend to decide how to implement it. For
>> a filesystem-based backend, it would do what it does now. But (for
>> example) a SQL-based backend might implement this as a single SELECT
>> statement.
> 
> I agree in principle. But things are more difficult since
> expire_reflog() has very complex semantics.
> To keep things simple for the reviews at this stage the logic is the
> same as the original code:
>   loop over all entries:
>      use very complex conditionals to decide which entries to keep/remove
>      optionally modify the sha1 values for the records we keep
>      write records we keep back to the file one record at a time
> 
> So that as far as possible, we keep the same rules and behavior but we
> use a different API for the actual
> "write entry to new reflog".
> 
> 
> We could wrap this inside a new specific transaction_expire_reflog()
> function so that other types of backends, for example an SQL backend,
> could optimize, but I think that should be in a separate later patch
> because expire_reflog is almost impossibly complex.
> It will not be a simple SELECT unfortunately.
> 
> The current expire logic is something like :
>   1, expire all entries older than timestamp
>   2, optionally, also expire all entries that refer to unreachable
> objects using a different timestamp
>       This involves actually reading the objects that the sha1 points
> to and parsing them!
>   3, optionally, if the sha1 objects can not be referenced, they are
> not commit objects or if they don't exist, then expire them too.
>       This also involves reading the objects behind the sha1.
>   4, optionally, delete reflog entry #foo
>   5, optionally, if any log entries were discarded due to 2,3,4 then
> we might also re-write and modify some of the reflog entries we keep.
> or any combination thereof
> 
>   (6, if --dry-run is specified, just print what we would have expired)
> 
> 
> 2 and 3 requires that we need to read the objects for the entry
> 4 allows us to delete a specific entry
> 5 means that even for entries we keep we will need to mutate them.

Thanks for the explanation. I now understand that it might be more than
a single SELECT statement.

Regarding the complicated rules for expiring reflogs (1, 2, 3, 4): For
now I think it would be fine for the new expire_reflog() API function to
take a callback function as an argument.

Regarding the stitching together of the survivors (5), it seems like the
API function would be the right place to handle that.

Regarding 6, it sounds like you could run the reflog entries through
your callback and report what it *would* have expired.

>> I also don't have the feeling that reflog expiration has to be done
>> within a ref_transaction. For example, is there ever a reason to combine
>> expiration with other reference updates in a single atomic transaction?
> 
> --updateref
> In expire_reflog() we not only prune the reflog. When --updateref is
> used we update the actual ref itself.
> I think we want to have both the ref update and also the reflog update
> both be part of a single atomic transaction.

ISTM that --updateref is another aspect of stitching together the
surviving reflog entries and could properly be done by the API
expire_reflog() function. Maybe the implementation would use an
*internal* transaction. But I still don't see a need for the caller to
be able to combine *arbitrary* reflog changes with *arbitrary* reference
updates in a single transaction, and the unneeded flexibility seems to
require the API to become more complicated than necessary.

>> I think not.
>>
>> So it seems to me that it would be more practical to have a separate API
>> function that is called to expire selected entries from a reflog [2],
>> unconnected with any transaction.
> 
> I think it makes the API cleaner if we have a
> 'you can only update a ref/reflog/<other things added in the future>/
> from within a transaction.'
> 
> Since we need to do reflog changes within a transaction for the expire
> reflog case as well as the rename ref case
> I think it makes sense to enforce that reflog changes must be done
> within a transaction to just make it consistent.

I'm still not convinced. For me, "reflog_expire()" is an unusual outlier
operation, much like "git gc" or "git pack-refs" or "git fsck". None of
these are part of the beautiful Git data model; they are messy
maintenance operations. Forcing reference transactions to be general
enough to allow reflog expiration to be implemented *outside* the refs
API sacrificies their simplicity for lots of infrastructure that will
probably only be used to implement this single operation. Better to
implement reflog expiration *inside* the refs API.

That's my take on it, anyway.

Michael

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html