Hello, Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes: > On Sat, Aug 22, 2020 at 02:05:41AM +0200, Jakub Narębski wrote: >> "Abhishek Kumar via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: >> >>> From: Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> [...] >>> To minimize the space required to store corrected commit date, Git >>> stores corrected commit date offsets into the commit-graph file. The >>> corrected commit date offset for a commit is defined as the difference >>> between its corrected commit date and actual commit date. >> >> Perhaps we should add more details about data type sizes in question. > > Will add. Note however that we need to solve the problem of storing values which are not monotonic wrt. parent relation (partial order) in limited disk space, that is GENERATION_NUMBER_V2_OFFSET_MAX vs GENERATION_NUMBER_MAX; see comments in 11/11 and 00/11. >> >> Storing corrected commit date requires sizeof(timestamp_t) bytes, which >> in most cases is 64 bits (uintmax_t). However corrected commit date >> offsets can be safely stored^* using only 32 bits. This halves the size >> of GDAT chunk, reducing per-commit storage from 2*H + 16 + 8 bytes to >> 2*H + 16 + 4 bytes, which is reduction of around 6%, not including >> header, fanout table (OIDF) and extra edges list (EDGE). >> >> Which might mean that the extra complication is not worth it, and we >> should store corrected commit date directly instead. >> >> *) unless for example one of commits is malformed but valid, >> and has committerdate of 0 Unix time, 1 January 1970. See above. >>> While Git does not write out offsets at this stage, Git stores the >>> corrected commit dates in member generation of struct commit_graph_data. >>> It will begin writing commit date offsets with the introduction of >>> generation data chunk. >> >> OK, so the agenda for introducing geeration number v2 is as follows: >> - compute generation numbers v2, i.e. corrected commit date >> - store corrected commit date [offsets] in new GDAT chunk, >> unless backward-compatibility concerns require us to not to >> - load [and compute] corrected commit date from commit-graph >> storing it as 'generation' field of `struct commit_graph_data`, >> unless backward-compatibility concerns require us to store >> topological levels (generation number v1) in there instead >> > > The last point is not correct. We always store topological levels into > the topo_levels slab introduced and always store corrected commit date > into data->generation, regardless of backward compatibility concerns. I think I was not clear enough (in trying to be brief). I meant here loading available generation numbers for use in graph traversal, done in later patches in this series. In _next_ commit we store topological levels in `generation` field: @@ -755,7 +763,11 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); - graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + if (g->chunk_generation_data) + graph_data->generation = item->date + + (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index); + else + graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2; We use topo_level slab only when writing the commit-graph file. > We could avoid initializing topo_slab if we are not writing generation > data chunk (and thus don't need corrected commit dates) but that > wouldn't have an impact on run time while writing commit-graph because > computing corrected commit dates is cheap as the main cost is in walking > the graph and writing the file. Right. Though you need to add the cost of allocation and managing extra commit slab, I think that amortized cost is negligible. But what would be better is showing benchmark data: does writing the commit graph without GDAT take not insigificant more time than without this patch? [...] >>> @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags) >>> for (i = 0; i < g->num_commits; i++) { >>> struct commit *graph_commit, *odb_commit; >>> struct commit_list *graph_parents, *odb_parents; >>> - timestamp_t max_generation = 0; >>> - timestamp_t generation; >>> + timestamp_t max_corrected_commit_date = 0; >>> + timestamp_t corrected_commit_date; >> >> This is simple, and perhaps unnecessary, rename of variables. >> Shouldn't we however verify *both* topological level, and >> (if exists) corrected commit date? > > The problem with verifying both topological level and corrected commit > dates is that we would have to re-fill commit_graph_data slab with commit > data chunk as we cannot modify data->generation otherwise, essentially > repeating the whole verification process. > > While it's okay for now, I might take this up in a future series [1]. > > [1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@xxxxxxxxx/ All right, I believe you that verifying both topological level and corrected commit date would be more difficult. That doesn't change the conclusion that this variable should remain to be named `generation`, as when verifying GDAT-less commit-graph files it would check topological levels (it uses commit_graph_generation(), which in turn uses `generation` field in commit graph info, which as I have show above in later patch could be v1 or v2 generation number). Best, -- Jakub Narębski