Re: [PATCH v4 07/10] commit-graph: implement generation data chunk

Jakub Narębski <jnareb@xxxxxxxxx> · Fri, 06 Nov 2020 18:56:28 +0100

In short: I think that because current implementation of writing GDOV
chunk follows an example of writing EDGE chunk, it should be left as it
is now (simple), and posible performance improvements be postponed to
some future commit.

Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
> On Fri, Oct 30, 2020 at 01:45:29PM +0100, Jakub Narębski wrote:
[...]
>>> We test the overflow-related code with the following repo history:
>>>
>>>            F - N - U
>>>           /         \
>>> U - N - U            N
>>>          \          /
>>>            N - F - N
>> 
>> Do we need such complex history? I guess we need to test the handling of
>> merge commits too.
>> 
>
> I wanted to test three cases - a root epoch zero commit, a commit that's
> far enough in past to overflow the offset and a commit that's far enough
> in the future to overflow the offset.

All right, if I understand this correctly this would be U as root, U-F
pair of commits and N-F pair of commits, respectively.  Did I get it
right?

Anyway, it might be a good idea to put this explanation in the commit
message.

>>>
>>> Where the commits denoted by U have committer date of zero seconds
>>> since Unix epoch, the commits denoted by N have committer date of
>>> 1112354055 (default committer date for the test suite) seconds since
>>> Unix epoch and the commits denoted by F have committer date of
>>> (2 ^ 31 - 2) seconds since Unix epoch.
>>>
>>> The largest offset observed is 2 ^ 31, just large enough to overflow.

[...]
>>> @@ -1120,6 +1149,44 @@ static int write_graph_chunk_data(struct hashfile *f,
>>>  	return 0;
>>>  }
>>>  
>>> +static int write_graph_chunk_generation_data(struct hashfile *f,
>>> +					      struct write_commit_graph_context *ctx)
>>> +{
>>> +	int i, num_generation_data_overflows = 0;
>> 
>> Minor nitpick: in my opinion there should be empty line here, between
>> the variables declaration and the code... however not all
>> write_graph_chunk_*() functions have it.
>> 
>>> +	for (i = 0; i < ctx->commits.nr; i++) {
>>> +		struct commit *c = ctx->commits.list[i];
>>> +		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
>>> +		display_progress(ctx->progress, ++ctx->progress_cnt);
>> 
>> All right.
>> 
>>> +
>>> +		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
>>> +			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
>>> +			num_generation_data_overflows++;
>>> +		}
>> 
>> Hmmm... shouldn't we store these commits that need overflow handling
>> (with corrected commit date offset greater than GENERATION_NUMBER_V2_OFFSET_MAX)
>> in a list or a queue, to remember them for writing GDOV chunk?
>> 
>
> We could, although write_graph_chunk_extra_edges() (just like this function)
> prefers to iterate over all commits again. Both octopus merges and
> overflowing corrected commit dates are exceedingly rare, might be
> worthwhile to trade some memory to avoid looping again.

I'm sorry, I have not looked what write_graph_chunk_extra_edges() does,
or rather how it does what it does -- it is a good idea to pattern your
solution in similar existing code.

For me this is an even stronger hint that we should strive for
simplicity first, and leave possible performance improvements for the
future commit.  Especially that you perform the most significant
optimization for this overflow handling: ensuring that we do not perform
any work if there are no commits with generation data overflow.

Maybe, maybe we should add that information about similarity between
write_graph_chunk_generation_data_overflow() and write_graph_chunk_extra_edges() 
in the commit message.  I am unsure...

>> We could store oids, or we could store commits themselves, for example:
>> 
>> 		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
>> 			offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
>> 			num_generation_data_overflows++;
>> 
>> 			ALLOC_GROW(ctx->gdov_commits.list, ctx->gdov_commits.nr + 1, ctx->gdov_commits.alloc);
>> 			ctx->commits.list[ctx->gdov_commits.nr] = c;
>>             ctx->gdov_commits.nr++;
>> 		}
>> 
>> Though in the above proposal we could get rid of `num_generation_data_overflows`, 
>> as it should be the same as `ctx->gdov_commits.nr`.
>> 
>> I have called the extra commit list member of write_commit_graph_context
>> `gdov_commits`, but perhaps a better name would be `commits_gen_v2_overflow`, 
>> or similar more descriptive name.

[...]
>>> @@ -741,4 +742,47 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
>>>  	)
>>>  '
>>>  
>>> +test_commit_with_date() {
>>> +  file="$1.t" &&
>>> +  echo "$1" >"$file" &&
>>> +  git add "$file" &&
>>> +  GIT_COMMITTER_DATE="$2" GIT_AUTHOR_DATE="$2" git commit -m "$1"
>>> +  git tag "$1"
>>> +}
>> 
>> Here we add a helper function.  All right.
>> 
>> I wonder though if it wouldn't be a better idea to add `--date <date>`
>> option to the test_commit() function in test-lib-functions.sh (which
>> option would set GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, and also
>> set notick=yes).
>> 
>
> Yes, that's a better idea - I didn't know how to change test_commit()
> well enough to tinker with what's working.
>
>> For example:
>> 
>> diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
>> index f1ae935fee..a1f9a2b09b 100644
>> --- a/t/test-lib-functions.sh
>> +++ b/t/test-lib-functions.sh
>> @@ -202,6 +202,12 @@ test_commit () {
>>  		--signoff)
>>  			signoff="$1"
>>  			;;
>> +        --date)
>> +            notick=yes
>> +            GIT_COMMITTER_DATE="$2"
>> +            GIT_AUTHOR_DATE="$2"
>> +            shift
>> +            ;;
>>  		-C)
>>  			indir="$2"
>>  			shift

Note however that I have while I have followed example of other options
(namely '-C <directory>'), I have not actually tested this proposed
implementation in tests; I have just tested that it looks like it works
OK.

[...]
>>> +test_expect_success 'overflow corrected commit date offset' '
>>> +	objdir=".git/objects" &&
>>> +	UNIX_EPOCH_ZERO="1970-01-01 00:00 +0000" &&
>>> +	FUTURE_DATE="@2147483646 +0000" &&
>> 
>> It is a bit funny to see UNIX_EPOCH_ZERO spelled one way, and
>> FUTURE_DATE other way.
>> 
>> Wouldn't be more readable to use UNIX_EPOCH_ZERO="@0 +0000"?
>
> It would, for some reason - I couldn't figure out the valid format for
> this. Changed.

Well, if "@2147483646 +0000" works (i.e. "@<Unix epoch/timestamp> <offset>"),
why the same for timestamp 0, i.e. "@0 +0000", wouldn't work?

[...]
>>> +graph_git_behavior 'overflow corrected commit date offset' repo left right
>> 
>> All right, here we compare the Git behavior with the commit-graph to the
>> behavior without it... however I think that those two tests really
>> should have distinct (different) test names. Currently they both use
>> 'overflow corrected commit date offset'.
>> 
>
> Following the earlier tests, the first test could be "set up and verify
> repo with generation data overflow chunk" and the git behavior test can
> be "generation data overflow chunk repo"

First is OK, the second could possibly be improved but is all right.

[...]
>> Though to reduce "noise" in this patch, the rename of run_three_modes()
>> to run_all_modes() and test_three_modes() to test_all_modes() could have
>> been done in a separate preparatory patch.  It would be pure refactoring
>> patch, without introducing any new functionality.
>> 
>
> Sure, that makes sense to me - this is patch is over 200 lines long
> already.

Thanks in advance.

Best,
-- 
Jakub Narębski