On Tue, 27 Aug 2013, Junio C Hamano wrote: > Nicolas Pitre <nico@xxxxxxxxxxx> writes: > > > This goes as follows: > > > > - Tree reference: either variable length encoding of the index > > into the SHA1 table or the literal SHA1 prefixed by 0 (see > > add_sha1_ref()). > > > > - Parent count: variable length encoding of the number of parents. > > This is normally going to occupy a single byte but doesn't have to. > > > > - List of parent references: a list of add_sha1_ref() encoded references, > > or nothing if the parent count was zero. > > > > - Author reference: variable length encoding of an index into the author > > string dictionary table which also covers the time zone. To make the > > overall encoding efficient, the author table is already sorted by usage > > frequency so the most used names are first and require the shortest > > index encoding. > > > > - Author time stamp: variable length encoded. Year 2038 ready! > > > > - Committer reference: same as author reference. > > > > - Committer time stamp: same as author time stamp. > > > > The remainder of the canonical commit object content is then zlib > > compressed and appended to the above. > > > > Rationale: The most important commit object data is densely encoded while > > requiring no zlib inflate processing, and all SHA1 references are most > > likely to be direct indices into the pack index file requiring no SHA1 > > search into the pack index file. > > > > Signed-off-by: Nicolas Pitre <nico@xxxxxxxxxxx> > > --- > > packv4-create.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 119 insertions(+) > > > > diff --git a/packv4-create.c b/packv4-create.c > > index bf33d15..cedbbd9 100644 > > --- a/packv4-create.c > > +++ b/packv4-create.c > > @@ -13,6 +13,9 @@ > > #include "tree-walk.h" > > #include "pack.h" > > > > + > > +static int pack_compression_level = Z_DEFAULT_COMPRESSION; > > + > > struct data_entry { > > unsigned offset; > > unsigned size; > > @@ -289,6 +292,122 @@ static unsigned char *add_sha1_ref(unsigned char *dst, const unsigned char *sha1 > > return dst + 20; > > } > > > > +/* > > + * This converts a canonical commit object buffer into its > > + * tightly packed representation using the already populated > > + * and sorted commit_name_table dictionary. The parsing is > > + * strict so to ensure the canonical version may always be > > + * regenerated and produce the same hash. > > + */ > > +void * conv_to_dict_commit(void *buffer, unsigned long *psize) > > Drop SP between asterisk and "conv_"? > > > +{ > > + unsigned long size = *psize; > > + char *in, *tail, *end; > > + unsigned char *out; > > + unsigned char sha1[20]; > > + int nb_parents, index, tz_val; > > + unsigned long time; > > + z_stream stream; > > + int status; > > + > > + /* > > + * It is guaranteed that the output is always going to be smaller > > + * than the input. We could even do this conversion in place. > > + */ > > + in = buffer; > > + tail = in + size; > > + buffer = xmalloc(size); > > + out = buffer; > > + > > + /* parse the "tree" line */ > > + if (in + 46 >= tail || memcmp(in, "tree ", 5) || in[45] != '\n') > > + goto bad_data; > > + if (get_sha1_hex(in + 5, sha1) < 0) > > + goto bad_data; > > Is this strict enough to guarantee roundtrip hash identity? Because > get_sha1_hex() accepts hexadecimal represented with uppercase A-F, > you need to reject such a "broken" commit object, no? Indeed, yes. Same concern as with split_ident_line() I mentioned before. > Same for parent commit object names below that are parsed with the > same helper. Exact. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html