Re: [PATCH v2 02/17] chunk-format: create chunk format write API

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 04 Feb 2021 13:24:12 -0800

"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> +/*
> + * When writing a chunk-based file format, collect the chunks in
> + * an array of chunk_info structs. The size stores the _expected_
> + * amount of data that will be written by write_fn.
> + */
> +struct chunk_info {
> +	uint32_t id;
> +	uint64_t size;
> +	chunk_write_fn write_fn;
> +};
> +...
> +void add_chunk(struct chunkfile *cf,
> +	       uint64_t id,
> +	       chunk_write_fn fn,
> +	       size_t size)
> +{
> +	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
> +
> +	cf->chunks[cf->chunks_nr].id = id;
> +	cf->chunks[cf->chunks_nr].write_fn = fn;
> +	cf->chunks[cf->chunks_nr].size = size;
> +	cf->chunks_nr++;
> +}

Somebody somewhere between the caller in the higher part of the
callchain (that has to work with platform native types) and the
on-disk format at the bottom of the callchain (that has to work
with fixed size data fields) needs to make sure that the size that
the higher level caller has fits on-disk data structure we define,
and the data we read from disk fits the in-core structure our caller
use on the reading side.

If there is a function at the one level closer to the disk than
"struct chunk_info" that takes a "struct chunk_info" and writes the
id and size to disk (and fills "struct chunk_info" from what is read
from the disk, on the reading side), it would be a good place to do
the size_t to uint64_t conversion.

It is OK to do the conversion-while-checking in add_chunk(), too.

But a silent type casting from size_t to uint64_t done silently by
assignment bothers me.  Also, I think you meant to make the incoming
ID uint32_t; am I missing something, or did nobody notice it in the
review of the previous round?

> +int write_chunkfile(struct chunkfile *cf, void *data)
> +{
> +	int i;
> +	size_t cur_offset = cf->f->offset + cf->f->total;

That ought to be off_t, as it is a seek position inside a file
(struct hashfile.total is already off_t).

Use csum-file.h::hashfile_total() instead, perhaps?  .offset member
is an implementation detail of the hashfile API (i.e. some leftover
bits are kept in in-core buffer, until we accumulate enough to make
it worth flushing to the disk), and by using the helper, this code
does not have to know about it.

> +	/* Add the table of contents to the current offset */
> +	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;

Is that 12 == sizeof(chunk_info.id) + sizeof(chunk_info.size)?
If so, this makes sense.

> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		hashwrite_be32(cf->f, cf->chunks[i].id);
> +		hashwrite_be64(cf->f, cur_offset);
> +
> +		cur_offset += cf->chunks[i].size;
> +	}
> +
> +	/* Trailing entry marks the end of the chunks */
> +	hashwrite_be32(cf->f, 0);
> +	hashwrite_be64(cf->f, cur_offset);

OK.  This helper does not tell us anything about what comes in the
on-disk file before this point, but we write a table of contents
that says "chunk with this ID has this size, chunk with that ID has
that size, ...", concluded by something that looks like another
entry with chunk ID 0 that records the current offset as its size.

> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		uint64_t start_offset = cf->f->total + cf->f->offset;

Likewise about the type and use of hashfile_total().

> +		int result = cf->chunks[i].write_fn(cf->f, data);
> +
> +		if (result)
> +			return result;
> +
> +		if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size)
> +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
> +			    cf->chunks[i].size, cf->chunks[i].id,
> +			    cf->f->total + cf->f->offset - start_offset);

I won't complain, as this apparently is sufficient to abstract out
the two existing chunked format files, but it somehow feels a bit
limiting that the one that calls add_chunk() is required to know
what the size of generated data would be, way before .write_fn() is
called to produce the actual data here.

> +	}
> +
> +	return 0;
> +}
> diff --git a/chunk-format.h b/chunk-format.h
> new file mode 100644
> index 00000000000..bfaed672813
> --- /dev/null
> +++ b/chunk-format.h
> @@ -0,0 +1,20 @@
> +#ifndef CHUNK_FORMAT_H
> +#define CHUNK_FORMAT_H
> +
> +#include "git-compat-util.h"
> +
> +struct hashfile;
> +struct chunkfile;
> +
> +struct chunkfile *init_chunkfile(struct hashfile *f);
> +void free_chunkfile(struct chunkfile *cf);
> +int get_num_chunks(struct chunkfile *cf);
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> +			      void *data);

Write this on a single line.

> +void add_chunk(struct chunkfile *cf,
> +	       uint64_t id,
> +	       chunk_write_fn fn,
> +	       size_t size);

Shouldn't this match the order of members in chunk_info struct?

> +int write_chunkfile(struct chunkfile *cf, void *data);
> +
> +#endif