Re: [PATCH 13/15] chunk-format: create chunk reading API

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 4 Dec 2020 08:45:10 -0500

On 12/3/2020 5:43 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
>> From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
>> ...
>> -		case GRAPH_CHUNKID_DATA:
>> -			if (graph->chunk_commit_data)
>> -				chunk_repeated = 1;
>> -			else
>> -				graph->chunk_commit_data = data + chunk_offset;
>> -			break;
>> +	/* limit the chunk-format list if we are ignoring Bloom filters */
>> +	if (!r->settings.commit_graph_read_changed_paths)
>> +		chunks_nr = 5;
> 
> I presume that this magic number 5 is directly connected to ...
> 
>> +static struct read_chunk_info read_chunks[] = {
>> +	[0] = {
>> +		GRAPH_CHUNKID_OIDFANOUT,
>> +		graph_read_oid_fanout
>> +	},
>> +	[1] = {
>> +		GRAPH_CHUNKID_OIDLOOKUP,
>> +		graph_read_oid_lookup
>> +	},
>> +	[2] = {
>> +		GRAPH_CHUNKID_DATA,
>> +		graph_read_data
>> +	},
>> +	[3] = {
>> +		GRAPH_CHUNKID_EXTRAEDGES,
>> +		graph_read_extra_edges
>> +	},
>> +	[4] = {
>> +		GRAPH_CHUNKID_BASE,
>> +		graph_read_base_graphs
>> +	},
>> +	[5] = {
>> +		GRAPH_CHUNKID_BLOOMINDEXES,
>> +		graph_read_bloom_indices
>> +	},
>> +	[6] = {
>> +		GRAPH_CHUNKID_BLOOMDATA,
>> +		graph_read_bloom_data
>> +	}
>> +};
> 
> ... the index of these entries in the table.
> 
> I find this arrangement brittle.  For one thing, it means that new
> chunks cannot be added at an arbitrary place, and BLOOMDATA must be
> at the end for the "limit the list when ignoring" logic to work
> correctly (even when the developer who is adding a new chunk type
> realizes that new one must be inserted before [5], and the magic
> number 5 above must be updated), and it won't work at all if a
> similar "we can optionally ignore reading the data" treatment needs
> to be made for new chunk types, since two or more things cannot be
> at the end of the list at the same time X-<.
> 
> Would it be a cleaner way to implement this "when member X in the
> settings structure is not set, ignore BLOOMINDEXES and BLOOMDATA"
> feature to let these graph_read_*() functions have access to the
> settings structure, so that bloom_indices and bloom_data callback
> functions can make decisions for themselves?
> 
> Once we do that, as far as I can tell, there is no longer any reason
> to initialize read_chunks[] array with designated initializer.  The
> implementation of read_table_of_contents() does not tolerate if this
> array is sparsely populated anyway, and except for the "do we ignore
> bloom?" hack, nothing really depends on the order of entries in the
> array, right?

This is a good point, and it's my fault for not noticing this behavior.
In an earlier version, I was initializing read_chunks dynamically
inside parse_commit_graph(), where the change made more sense (the
Bloom chunks were not initialized at all based on this condition). I
think the better way to handle this is to check the config within
the reading functions graph_read_bloom_(indices|data).

The added benefit of checking in the read_chunk_fn is that the
chunk-format API can see that these chunks are "known" chunk types,
in case we were to add logging for "I don't understand this chunk
type".

Thanks,
-Stolee