Re: [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Taylor Blau <me@xxxxxxxxxxxx> writes:

> +== pack-*.rev files have the format:
> +
> +  - A 4-byte magic number '0x52494458' ('RIDX').
> +
> +  - A 4-byte version identifier (= 1)
> +
> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256)

These two are presumably 4-byte-wide network byte order integers.
We should spell it out.

> +  - A table of index positions, sorted by their corresponding offsets in the
> +    packfile.

Likewise, how wide is each entry and in what byte order, and how
many entries are there in the table?

	... oh, what about the "one beyond the last"?  We cannot
	go back to the forward index to learn the offset of such
	an non-existent object, can we?

Again, I expect it to be 4-byte-wide network byte order integer.

> -int load_pack_revindex(struct packed_git *p)
> +static int load_pack_revindex_from_memory(struct packed_git *p)

I said it when I saw the beginning of v1 API patches, but it is a
bit unreasonable to call the act of "computing from the forward
index" "to load from memory".  Loading from disk perfectly works as
a phrase, though.

> +#define RIDX_MIN_SIZE (12 + (2 * the_hash_algo->rawsz))
> +
> +static int load_revindex_from_disk(char *revindex_name,
> +				   uint32_t num_objects,
> +				   const void **data, size_t *len)
> +{
> +	int fd, ret = 0;
> +	struct stat st;
> +	size_t revindex_size;
> +
> +	fd = git_open(revindex_name);
> +
> +	if (fd < 0) {
> +		ret = -1;
> +		goto cleanup;
> +	}
> +	if (fstat(fd, &st)) {
> +		ret = error_errno(_("failed to read %s"), revindex_name);
> +		goto cleanup;
> +	}
> +
> +	revindex_size = xsize_t(st.st_size);
> +
> +	if (revindex_size < RIDX_MIN_SIZE) {
> +		ret = error(_("reverse-index file %s is too small"), revindex_name);
> +		goto cleanup;
> +	}
> +
> +	if (revindex_size - RIDX_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
> +		ret = error(_("reverse-index file %s is corrupt"), revindex_name);
> +		goto cleanup;
> +	}
> +
> +	*len = revindex_size;
> +	*data = xmmap(NULL, revindex_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +
> +cleanup:
> +	close(fd);
> +	return ret;
> +}
> +
> +static int load_pack_revindex_from_disk(struct packed_git *p)
> +{
> +	char *revindex_name;
> +	int ret;
> +	if (open_pack_index(p))
> +		return -1;
> +
> +	revindex_name = pack_revindex_filename(p);
> +
> +	ret = load_revindex_from_disk(revindex_name,
> +				      p->num_objects,
> +				      &p->revindex_map,
> +				      &p->revindex_size);
> +	if (ret)
> +		goto cleanup;
> +
> +	p->revindex_data = (char *)p->revindex_map + 12;

We've seen hardcoded constant "12" twice so far in this patch.

We need a C proprocessor macro "#define RIDX_FILE_HEADER_SIZE 12" or
something, perhaps?

> +cleanup:
> +	free(revindex_name);
> +	return ret;
> +}
> +
> +int load_pack_revindex(struct packed_git *p)
> +{
> +	if (p->revindex || p->revindex_data)
> +		return 0;
> +
> +	if (!load_pack_revindex_from_disk(p))
> +		return 0;
> +	else if (!load_pack_revindex_from_memory(p))
> +		return 0;
> +	return -1;
> +}
> +
>  int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
>  {
>  	unsigned lo, hi;
> @@ -203,18 +285,28 @@ int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
>  
>  uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
>  {
> -	if (!p->revindex)
> +	if (!(p->revindex || p->revindex_data))
>  		BUG("pack_pos_to_index: reverse index not yet loaded");
>  	if (p->num_objects <= pos)
>  		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
> -	return p->revindex[pos].nr;
> +
> +	if (p->revindex)
> +		return p->revindex[pos].nr;
> +	else
> +		return get_be32((char *)p->revindex_data + (pos * sizeof(uint32_t)));

Good.  We are using 32-bit uint in network byte order.  We should
document it as such.

Let's not strip const away while casting, though.  get_be32()
ensures that it only reads and never writes thru the pointer, and
p->revindex_data is a "const void *".

>  }
>  
>  off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos)
>  {
> -	if (!p->revindex)
> +	if (!(p->revindex || p->revindex_data))
>  		BUG("pack_pos_to_index: reverse index not yet loaded");
>  	if (p->num_objects < pos)
>  		BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos);
> -	return p->revindex[pos].offset;
> +
> +	if (p->revindex)
> +		return p->revindex[pos].offset;
> +	else if (pos == p->num_objects)
> +		return p->pack_size - the_hash_algo->rawsz;

OK, here is the answer to my previous question.  We should document
that the table has num_objects entries in the on-disk file (we do
not need to say that there is no sentinel entry in the table at the
end).

> +	else
> +		return nth_packed_object_offset(p, pack_pos_to_index(p, pos));
>  }
> diff --git a/pack-revindex.h b/pack-revindex.h
> index 6e0320b08b..01622cf21a 100644
> --- a/pack-revindex.h
> +++ b/pack-revindex.h
> @@ -21,6 +21,9 @@ struct packed_git;
>  /*
>   * load_pack_revindex populates the revindex's internal data-structures for the
>   * given pack, returning zero on success and a negative value otherwise.
> + *
> + * If a '.rev' file is present, it is checked for consistency, mmap'd, and
> + * pointers are assigned into it (instead of using the in-memory variant).

Hmph, I missed where it got checked for consistency, though.  If the
file is corrupt and has say duplicated entries, we'd happily grab
the data via get_be32(), for example.

> @@ -55,7 +58,9 @@ uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
>   * If the reverse index has not yet been loaded, or the position is out of
>   * bounds, this function aborts.
>   *
> - * This function runs in constant time.
> + * This function runs in constant time under both in-memory and on-disk reverse
> + * indexes, but an additional step is taken to consult the corresponding .idx
> + * file when using the on-disk format.

Again, I know this is a kind of detail that is interesting to those
who implemented the function, but I wonder how it would help those
who wonder if they should call it or use some other method to
achieve what they want.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux