Re: [PATCH v2 02/17] chunk-format: create chunk format write API

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/4/2021 4:24 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
>> +/*
>> + * When writing a chunk-based file format, collect the chunks in
>> + * an array of chunk_info structs. The size stores the _expected_
>> + * amount of data that will be written by write_fn.
>> + */
>> +struct chunk_info {
>> +	uint32_t id;
>> +	uint64_t size;
>> +	chunk_write_fn write_fn;
>> +};
>> +...
>> +void add_chunk(struct chunkfile *cf,
>> +	       uint64_t id,
>> +	       chunk_write_fn fn,
>> +	       size_t size)
>> +{
>> +	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
>> +
>> +	cf->chunks[cf->chunks_nr].id = id;
>> +	cf->chunks[cf->chunks_nr].write_fn = fn;
>> +	cf->chunks[cf->chunks_nr].size = size;
>> +	cf->chunks_nr++;
>> +}
> 
> Somebody somewhere between the caller in the higher part of the
> callchain (that has to work with platform native types) and the
> on-disk format at the bottom of the callchain (that has to work
> with fixed size data fields) needs to make sure that the size that
> the higher level caller has fits on-disk data structure we define,
> and the data we read from disk fits the in-core structure our caller
> use on the reading side.
> 
> If there is a function at the one level closer to the disk than
> "struct chunk_info" that takes a "struct chunk_info" and writes the
> id and size to disk (and fills "struct chunk_info" from what is read
> from the disk, on the reading side), it would be a good place to do
> the size_t to uint64_t conversion.

I'm fine with keeping the external interface focused on size_t
instead of uint64_t.

> It is OK to do the conversion-while-checking in add_chunk(), too.
> 
> But a silent type casting from size_t to uint64_t done silently by
> assignment bothers me. 

Does this bother you only because its part of the external interface?
If I understand correctly, uint64_t will always be at least as big
as size_t, so this doesn't need any protections for overflow or
anything. Is there something I should be doing before casting?

> Also, I think you meant to make the incoming
> ID uint32_t; am I missing something, or did nobody notice it in the
> review of the previous round?

Yes, this should be 32-bits. Will fix.
 
>> +int write_chunkfile(struct chunkfile *cf, void *data)
>> +{
>> +	int i;
>> +	size_t cur_offset = cf->f->offset + cf->f->total;
> 
> That ought to be off_t, as it is a seek position inside a file
> (struct hashfile.total is already off_t).

I can use off_t for the other offsets in this computation, but
cur_offset will be used in hashwrite_be64(), so maybe it is best
to use uint64_t here?

> Use csum-file.h::hashfile_total() instead, perhaps?  .offset member
> is an implementation detail of the hashfile API (i.e. some leftover
> bits are kept in in-core buffer, until we accumulate enough to make
> it worth flushing to the disk), and by using the helper, this code
> does not have to know about it.

Thanks! This is cleaner.

>> +	/* Add the table of contents to the current offset */
>> +	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;
> 
> Is that 12 == sizeof(chunk_info.id) + sizeof(chunk_info.size)?
> If so, this makes sense.

Yes.

>> +	for (i = 0; i < cf->chunks_nr; i++) {
>> +		hashwrite_be32(cf->f, cf->chunks[i].id);
>> +		hashwrite_be64(cf->f, cur_offset);
>> +
>> +		cur_offset += cf->chunks[i].size;
>> +	}
>> +
>> +	/* Trailing entry marks the end of the chunks */
>> +	hashwrite_be32(cf->f, 0);
>> +	hashwrite_be64(cf->f, cur_offset);
> 
> OK.  This helper does not tell us anything about what comes in the
> on-disk file before this point, but we write a table of contents
> that says "chunk with this ID has this size, chunk with that ID has
> that size, ...", concluded by something that looks like another
> entry with chunk ID 0 that records the current offset as its size.

Right. The table of contents gives us enough information to find
the start _and_ end of each chunk (and hence compute their size).

>> +	for (i = 0; i < cf->chunks_nr; i++) {
>> +		uint64_t start_offset = cf->f->total + cf->f->offset;
> 
> Likewise about the type and use of hashfile_total().

This one can definitely be off_t.

>> +		int result = cf->chunks[i].write_fn(cf->f, data);
>> +
>> +		if (result)
>> +			return result;
>> +
>> +		if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size)
>> +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
>> +			    cf->chunks[i].size, cf->chunks[i].id,
>> +			    cf->f->total + cf->f->offset - start_offset);
> 
> I won't complain, as this apparently is sufficient to abstract out
> the two existing chunked format files, but it somehow feels a bit
> limiting that the one that calls add_chunk() is required to know
> what the size of generated data would be, way before .write_fn() is
> called to produce the actual data here.

This was pointed out earlier, but it _is_ part of the existing users
of the format. The table of contents is written at the start of the
file instead of the end (such as in the .zip format).

The current chunk format API makes the same assumption (ToC comes
first) but could be adjusted later to let this part of the method
dynamically compute the chunk sizes and fill a ToC at the end. The
way to modify this API would be to add a 'flags' parameter.

So far, this has not been necessary, but might be in the future.

>> +	}
>> +
>> +	return 0;
>> +}
>> diff --git a/chunk-format.h b/chunk-format.h
>> new file mode 100644
>> index 00000000000..bfaed672813
>> --- /dev/null
>> +++ b/chunk-format.h
>> @@ -0,0 +1,20 @@
>> +#ifndef CHUNK_FORMAT_H
>> +#define CHUNK_FORMAT_H
>> +
>> +#include "git-compat-util.h"
>> +
>> +struct hashfile;
>> +struct chunkfile;
>> +
>> +struct chunkfile *init_chunkfile(struct hashfile *f);
>> +void free_chunkfile(struct chunkfile *cf);
>> +int get_num_chunks(struct chunkfile *cf);
>> +typedef int (*chunk_write_fn)(struct hashfile *f,
>> +			      void *data);
> 
> Write this on a single line.

Will do.

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux