Re: [PATCH v2 05/30] loose: add a mapping between SHA-1 and SHA-256 for loose objects

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Oct 01, 2023 at 09:40:09PM -0500, Eric W. Biederman wrote:
> From: "brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx>
> 
> As part of the transition plan, we'd like to add a file in the .git
> directory that maps loose objects between SHA-1 and SHA-256.  Let's
> implement the specification in the transition plan and store this data
> on a per-repository basis in struct repository.
> 
> Signed-off-by: brian m. carlson <sandals@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
> ---
>  Makefile              |   1 +
>  loose.c               | 246 ++++++++++++++++++++++++++++++++++++++++++
>  loose.h               |  22 ++++
>  object-file-convert.c |  14 ++-
>  object-store-ll.h     |   3 +
>  object.c              |   2 +
>  repository.c          |   6 ++
>  7 files changed, 293 insertions(+), 1 deletion(-)
>  create mode 100644 loose.c
>  create mode 100644 loose.h
> 
> diff --git a/Makefile b/Makefile
> index f7e824f25cda..3c18664def9a 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1053,6 +1053,7 @@ LIB_OBJS += list-objects-filter.o
>  LIB_OBJS += list-objects.o
>  LIB_OBJS += lockfile.o
>  LIB_OBJS += log-tree.o
> +LIB_OBJS += loose.o
>  LIB_OBJS += ls-refs.o
>  LIB_OBJS += mailinfo.o
>  LIB_OBJS += mailmap.o
> diff --git a/loose.c b/loose.c
> new file mode 100644
> index 000000000000..6ba73cc84dca
> --- /dev/null
> +++ b/loose.c

When reading "loose" I immediately think about loose objects, only. I
would not consider this about mapping object IDs, which I expect would
also happen for packed objects?

It very much seems like you explicitly only care about loose objects in
the code here, which is weird to me. If that is in fact intentional
because we learn to store the compat object hash in pack files over the
course of this patch seires then it would make sense to explain this a
bit more in depth.

> @@ -0,0 +1,246 @@
> +#include "git-compat-util.h"
> +#include "hash.h"
> +#include "path.h"
> +#include "object-store.h"
> +#include "hex.h"
> +#include "wrapper.h"
> +#include "gettext.h"
> +#include "loose.h"
> +#include "lockfile.h"
> +
> +static const char *loose_object_header = "# loose-object-idx\n";
> +
> +static inline int should_use_loose_object_map(struct repository *repo)
> +{
> +	return repo->compat_hash_algo && repo->gitdir;
> +}
> +
> +void loose_object_map_init(struct loose_object_map **map)
> +{
> +	struct loose_object_map *m;
> +	m = xmalloc(sizeof(**map));
> +	m->to_compat = kh_init_oid_map();
> +	m->to_storage = kh_init_oid_map();
> +	*map = m;
> +}
> +
> +static int insert_oid_pair(kh_oid_map_t *map, const struct object_id *key, const struct object_id *value)
> +{
> +	khiter_t pos;
> +	int ret;
> +	struct object_id *stored;
> +
> +	pos = kh_put_oid_map(map, *key, &ret);
> +
> +	/* This item already exists in the map. */
> +	if (ret == 0)
> +		return 0;

Should we safeguard this and compare whether the key's value matches the
passed-in value? One of the more general themes that I'm worried about
is what happens when we hit hash collisions (e.g. two objects mapping to
the same SHA1, but different SHA256 hashes), and safeguarding us against
this possibility feels sensible to me.

> +	stored = xmalloc(sizeof(*stored));
> +	oidcpy(stored, value);
> +	kh_value(map, pos) = stored;
> +	return 1;
> +}
> +
> +static int load_one_loose_object_map(struct repository *repo, struct object_directory *dir)
> +{
> +	struct strbuf buf = STRBUF_INIT, path = STRBUF_INIT;
> +	FILE *fp;
> +
> +	if (!dir->loose_map)
> +		loose_object_map_init(&dir->loose_map);
> +
> +	insert_oid_pair(dir->loose_map->to_compat, repo->hash_algo->empty_tree, repo->compat_hash_algo->empty_tree);
> +	insert_oid_pair(dir->loose_map->to_storage, repo->compat_hash_algo->empty_tree, repo->hash_algo->empty_tree);
> +
> +	insert_oid_pair(dir->loose_map->to_compat, repo->hash_algo->empty_blob, repo->compat_hash_algo->empty_blob);
> +	insert_oid_pair(dir->loose_map->to_storage, repo->compat_hash_algo->empty_blob, repo->hash_algo->empty_blob);
> +
> +	insert_oid_pair(dir->loose_map->to_compat, repo->hash_algo->null_oid, repo->compat_hash_algo->null_oid);
> +	insert_oid_pair(dir->loose_map->to_storage, repo->compat_hash_algo->null_oid, repo->hash_algo->null_oid);
> +
> +	strbuf_git_common_path(&path, repo, "objects/loose-object-idx");
> +	fp = fopen(path.buf, "rb");
> +	if (!fp) {
> +		strbuf_release(&path);
> +		return 0;

I think we should discern ENOENT from other errors. Failing gracefully
when the file doesn't exist may be sensible, but not when we failed due
to something like an I/O error.

> +	}
> +
> +	errno = 0;
> +	if (strbuf_getwholeline(&buf, fp, '\n') || strcmp(buf.buf, loose_object_header))
> +		goto err;
> +	while (!strbuf_getline_lf(&buf, fp)) {
> +		const char *p;
> +		struct object_id oid, compat_oid;
> +		if (parse_oid_hex_algop(buf.buf, &oid, &p, repo->hash_algo) ||
> +		    *p++ != ' ' ||
> +		    parse_oid_hex_algop(p, &compat_oid, &p, repo->compat_hash_algo) ||
> +		    p != buf.buf + buf.len)
> +			goto err;
> +		insert_oid_pair(dir->loose_map->to_compat, &oid, &compat_oid);
> +		insert_oid_pair(dir->loose_map->to_storage, &compat_oid, &oid);
> +	}

Is the actual format specified anywhere? I have to wonder about the
scalability of such a format that uses a simple line-based format for
every object ID. Two main concerns:

  1. If the format is unsorted and we simply append to it whenever the
     repo gains new objects then we are forced to always load the
     complete map into memory. This would be quite inefficient in larger
     repositories that have millions of objects. Every line contains two
     object hashes as well as two whitespace characters, which amounts
     to `(2 + 40 + 64) * $numobjects` many bytes.

     For linux.git with more than 10 million objects, the map would thus
     be around 1GB in size. Loading that into memory and converting it
     into maps feels prohibitively expensive to me.

  2. If the format was sorted then we could perform binary searches
     inside the format to look up object IDs because we know that each
     line has a fixed length. On the other hand, adding new objects
     would require us to rewrite the whole file every time.

I think loading the complete object map into memory is simply too
expensive in any larger "real-world" repository. But rewriting a sorted
file format every time we add new objects feels sufficiently expensive,
too. Neither of these properties sounds like it would be feasible to use
for larger Git hosting platforms. So I think we should put some more
thought into this.

Some proposals:

  - We shouldn't store hex characters but raw object IDs, thus reducing
    the size of the file by almost half.

  - We should store the file sorted so that we can avoid loading it into
    memory and do binary searches.

  - We might grow this into a "stack" of object maps so that it becomes
    easier to add new objects to the map without having to rewrite it
    every time. With geometric repacking this should be somewhat
    manageable.

We don't have to do all of this right from the beginning, I just want to
start the discussion around this.

> +	strbuf_release(&buf);
> +	strbuf_release(&path);
> +	return errno ? -1 : 0;

It feels quite fragile to me to check for `errno` in this way. Should we
instead check `ferror(fp)`?

> +err:
> +	strbuf_release(&buf);
> +	strbuf_release(&path);
> +	return -1;
> +}

We could deduplicate the error paths by storing the return value into an
`int ret`.

> +int repo_read_loose_object_map(struct repository *repo)
> +{
> +	struct object_directory *dir;
> +
> +	if (!should_use_loose_object_map(repo))
> +		return 0;
> +
> +	prepare_alt_odb(repo);
> +
> +	for (dir = repo->objects->odb; dir; dir = dir->next) {
> +		if (load_one_loose_object_map(repo, dir) < 0) {
> +			return -1;
> +		}
> +	}

The braces here are not needed.

> +	return 0;
> +}
> +
> +int repo_write_loose_object_map(struct repository *repo)
> +{
> +	kh_oid_map_t *map = repo->objects->odb->loose_map->to_compat;
> +	struct lock_file lock;
> +	int fd;
> +	khiter_t iter;
> +	struct strbuf buf = STRBUF_INIT, path = STRBUF_INIT;
> +
> +	if (!should_use_loose_object_map(repo))
> +		return 0;
> +
> +	strbuf_git_common_path(&path, repo, "objects/loose-object-idx");
> +	fd = hold_lock_file_for_update_timeout(&lock, path.buf, LOCK_DIE_ON_ERROR, -1);
> +	iter = kh_begin(map);
> +	if (write_in_full(fd, loose_object_header, strlen(loose_object_header)) < 0)
> +		goto errout;
> +
> +	for (; iter != kh_end(map); iter++) {
> +		if (kh_exist(map, iter)) {
> +			if (oideq(&kh_key(map, iter), the_hash_algo->empty_tree) ||
> +			    oideq(&kh_key(map, iter), the_hash_algo->empty_blob))
> +				continue;
> +			strbuf_addf(&buf, "%s %s\n", oid_to_hex(&kh_key(map, iter)), oid_to_hex(kh_value(map, iter)));
> +			if (write_in_full(fd, buf.buf, buf.len) < 0)
> +				goto errout;
> +			strbuf_reset(&buf);
> +		}
> +	}
> +	strbuf_release(&buf);
> +	if (commit_lock_file(&lock) < 0) {
> +		error_errno(_("could not write loose object index %s"), path.buf);
> +		strbuf_release(&path);
> +		return -1;
> +	}
> +	strbuf_release(&path);
> +	return 0;
> +errout:
> +	rollback_lock_file(&lock);
> +	strbuf_release(&buf);
> +	error_errno(_("failed to write loose object index %s\n"), path.buf);
> +	strbuf_release(&path);
> +	return -1;

Same here, we should be able to combine cleanup of both the successful
and error paths. It's safe to call `rollback_lock_file()` even if the
file has already been committed.

> +}
> +
> +static int write_one_object(struct repository *repo, const struct object_id *oid,
> +			    const struct object_id *compat_oid)
> +{
> +	struct lock_file lock;
> +	int fd;
> +	struct stat st;
> +	struct strbuf buf = STRBUF_INIT, path = STRBUF_INIT;
> +
> +	strbuf_git_common_path(&path, repo, "objects/loose-object-idx");
> +	hold_lock_file_for_update_timeout(&lock, path.buf, LOCK_DIE_ON_ERROR, -1);
> +
> +	fd = open(path.buf, O_WRONLY | O_CREAT | O_APPEND, 0666);
> +	if (fd < 0)
> +		goto errout;
> +	if (fstat(fd, &st) < 0)
> +		goto errout;
> +	if (!st.st_size && write_in_full(fd, loose_object_header, strlen(loose_object_header)) < 0)
> +		goto errout;
> +
> +	strbuf_addf(&buf, "%s %s\n", oid_to_hex(oid), oid_to_hex(compat_oid));
> +	if (write_in_full(fd, buf.buf, buf.len) < 0)
> +		goto errout;
> +	if (close(fd))
> +		goto errout;

It's not safe to update the file in-place like this. A concurrent reader
may end up seeing partial lines and error out. Also, if we were to crash
we might easily end up with a corrupted mapping file.

> +	adjust_shared_perm(path.buf);
> +	rollback_lock_file(&lock);
> +	strbuf_release(&buf);
> +	strbuf_release(&path);
> +	return 0;
> +errout:
> +	error_errno(_("failed to write loose object index %s\n"), path.buf);
> +	close(fd);
> +	rollback_lock_file(&lock);
> +	strbuf_release(&buf);
> +	strbuf_release(&path);
> +	return -1;

Same.

> +}
> +
> +int repo_add_loose_object_map(struct repository *repo, const struct object_id *oid,
> +			      const struct object_id *compat_oid)
> +{
> +	int inserted = 0;
> +
> +	if (!should_use_loose_object_map(repo))
> +		return 0;
> +
> +	inserted |= insert_oid_pair(repo->objects->odb->loose_map->to_compat, oid, compat_oid);
> +	inserted |= insert_oid_pair(repo->objects->odb->loose_map->to_storage, compat_oid, oid);
> +	if (inserted)
> +		return write_one_object(repo, oid, compat_oid);
> +	return 0;
> +}
> +
> +int repo_loose_object_map_oid(struct repository *repo,
> +			      const struct object_id *src,
> +			      const struct git_hash_algo *to,
> +			      struct object_id *dest)
> +{
> +	struct object_directory *dir;
> +	kh_oid_map_t *map;
> +	khiter_t pos;
> +
> +	for (dir = repo->objects->odb; dir; dir = dir->next) {
> +		struct loose_object_map *loose_map = dir->loose_map;
> +		if (!loose_map)
> +			continue;
> +		map = (to == repo->compat_hash_algo) ?
> +			loose_map->to_compat :
> +			loose_map->to_storage;
> +		pos = kh_get_oid_map(map, *src);
> +		if (pos < kh_end(map)) {
> +			oidcpy(dest, kh_value(map, pos));
> +			return 0;
> +		}
> +	}
> +	return -1;
> +}
> +
> +void loose_object_map_clear(struct loose_object_map **map)

Nit: I'd rather call it `loose_object_map_release()`. `clear` typically
indicates that we clear contents, but do not end up freeing the
containing structure.

> +{
> +	struct loose_object_map *m = *map;
> +	struct object_id *oid;
> +
> +	if (!m)
> +		return;
> +
> +	kh_foreach_value(m->to_compat, oid, free(oid));
> +	kh_foreach_value(m->to_storage, oid, free(oid));
> +	kh_destroy_oid_map(m->to_compat);
> +	kh_destroy_oid_map(m->to_storage);
> +	free(m);
> +	*map = NULL;
> +}
> diff --git a/loose.h b/loose.h
> new file mode 100644
> index 000000000000..2c2957072c5f
> --- /dev/null
> +++ b/loose.h
> @@ -0,0 +1,22 @@
> +#ifndef LOOSE_H
> +#define LOOSE_H
> +
> +#include "khash.h"
> +
> +struct loose_object_map {
> +	kh_oid_map_t *to_compat;
> +	kh_oid_map_t *to_storage;
> +};

Any specific reason why you don't use `struct oidmap` here?

Patrick

> +void loose_object_map_init(struct loose_object_map **map);
> +void loose_object_map_clear(struct loose_object_map **map);
> +int repo_loose_object_map_oid(struct repository *repo,
> +			      const struct object_id *src,
> +			      const struct git_hash_algo *dest_algo,
> +			      struct object_id *dest);
> +int repo_add_loose_object_map(struct repository *repo, const struct object_id *oid,
> +			      const struct object_id *compat_oid);
> +int repo_read_loose_object_map(struct repository *repo);
> +int repo_write_loose_object_map(struct repository *repo);
> +
> +#endif
> diff --git a/object-file-convert.c b/object-file-convert.c
> index 4777aba83636..1ec945eaa17f 100644
> --- a/object-file-convert.c
> +++ b/object-file-convert.c
> @@ -4,6 +4,7 @@
>  #include "repository.h"
>  #include "hash-ll.h"
>  #include "object.h"
> +#include "loose.h"
>  #include "object-file-convert.h"
>  
>  int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
> @@ -21,7 +22,18 @@ int repo_oid_to_algop(struct repository *repo, const struct object_id *src,
>  			oidcpy(dest, src);
>  		return 0;
>  	}
> -	return -1;
> +	if (repo_loose_object_map_oid(repo, src, to, dest)) {
> +		/*
> +		 * We may have loaded the object map at repo initialization but
> +		 * another process (perhaps upstream of a pipe from us) may have
> +		 * written a new object into the map.  If the object is missing,
> +		 * let's reload the map to see if the object has appeared.
> +		 */
> +		repo_read_loose_object_map(repo);
> +		if (repo_loose_object_map_oid(repo, src, to, dest))
> +			return -1;
> +	}
> +	return 0;
>  }
>  
>  int convert_object_file(struct strbuf *outbuf,
> diff --git a/object-store-ll.h b/object-store-ll.h
> index 26a3895c821c..bc76d6bec80d 100644
> --- a/object-store-ll.h
> +++ b/object-store-ll.h
> @@ -26,6 +26,9 @@ struct object_directory {
>  	uint32_t loose_objects_subdir_seen[8]; /* 256 bits */
>  	struct oidtree *loose_objects_cache;
>  
> +	/* Map between object IDs for loose objects. */
> +	struct loose_object_map *loose_map;
> +
>  	/*
>  	 * This is a temporary object store created by the tmp_objdir
>  	 * facility. Disable ref updates since the objects in the store
> diff --git a/object.c b/object.c
> index 2c61e4c86217..186a0a47c0fb 100644
> --- a/object.c
> +++ b/object.c
> @@ -13,6 +13,7 @@
>  #include "alloc.h"
>  #include "packfile.h"
>  #include "commit-graph.h"
> +#include "loose.h"
>  
>  unsigned int get_max_object_index(void)
>  {
> @@ -540,6 +541,7 @@ void free_object_directory(struct object_directory *odb)
>  {
>  	free(odb->path);
>  	odb_clear_loose_cache(odb);
> +	loose_object_map_clear(&odb->loose_map);
>  	free(odb);
>  }
>  
> diff --git a/repository.c b/repository.c
> index 80252b79e93e..6214f61cf4e7 100644
> --- a/repository.c
> +++ b/repository.c
> @@ -14,6 +14,7 @@
>  #include "read-cache-ll.h"
>  #include "remote.h"
>  #include "setup.h"
> +#include "loose.h"
>  #include "submodule-config.h"
>  #include "sparse-index.h"
>  #include "trace2.h"
> @@ -109,6 +110,8 @@ void repo_set_compat_hash_algo(struct repository *repo, int algo)
>  	if (hash_algo_by_ptr(repo->hash_algo) == algo)
>  		BUG("hash_algo and compat_hash_algo match");
>  	repo->compat_hash_algo = algo ? &hash_algos[algo] : NULL;
> +	if (repo->compat_hash_algo)
> +		repo_read_loose_object_map(repo);
>  }
>  
>  /*
> @@ -201,6 +204,9 @@ int repo_init(struct repository *repo,
>  	if (worktree)
>  		repo_set_worktree(repo, worktree);
>  
> +	if (repo->compat_hash_algo)
> +		repo_read_loose_object_map(repo);
> +
>  	clear_repository_format(&format);
>  	return 0;
>  
> -- 
> 2.41.0
> 

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux