On Tue, Jul 8, 2008 at 10:14 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > "Geoffrey Irving" <irving@xxxxxxx> writes: > >> From a3afd1455d215a541e1481e0f064df743d9219cc Mon Sep 17 00:00:00 2001 > > Please drop this line. > >> From: Geoffrey Irving <irving@xxxxxxx> >> Date: Sat, 7 Jun 2008 16:03:49 -0700 >> Subject: [PATCH 1/3] cherry: cache patch-ids to avoid repeating work > > These are Ok _if_ the difference between them and what you have in your > e-mail header really matter (e.g. you are forwarding somebody else's > patch). I don't think it is in this case, though. > >> Added cached-sha-map.[hc] implementing a persistent hash map from sha1 to >> sha1. > > "Add cached-sha1-map.[ch]" (imperative mood), not "(here is what I) _did_". > >> diff --git a/cached-sha1-map.c b/cached-sha1-map.c >> new file mode 100644 >> index 0000000..e363745 >> --- /dev/null >> +++ b/cached-sha1-map.c >> @@ -0,0 +1,182 @@ >> +#include "cached-sha1-map.h" >> + >> +union cached_sha1_map_header { >> + struct { >> + char signature[4]; /* HASH */ >> + off_t count, size; >> + }; >> + struct cached_sha1_entry padding; /* pad header out to 40 bytes */ >> +}; >> + >> +static const char *signature = "HASH"; > > That sounds a bit too generic, doesn't it, to protect ourselves from > getting confused by some other filetype? > >> +static void init_cached_sha1_map(struct cached_sha1_map *cache) >> +{ >> + int fd; >> + union cached_sha1_map_header header; >> + >> + if (cache->initialized) >> + return; >> + >> + fd = open(git_path(cache->filename), O_RDONLY); >> + if (fd < 0) { >> + init_empty_map(cache, 64); >> + return; > > Check errno and do this only when ENOENT. Other errors should be caught > and reported. > >> + } >> + >> + if (read_in_full(fd, &header, sizeof(header)) != sizeof(header)) >> + die("cannot read %s header", cache->filename); >> + >> + if (memcmp(header.signature, signature, 4)) >> + die("%s has invalid header", cache->filename); >> + >> + if (header.size & (header.size-1)) >> + die("%s size %lld is not a power of two", cache->filename, >> + (long long)header.size); > > Two issues and a half: > > - Isn't it gcc extension to be able to say header.signature, bypassing > the anonymous structure inside the union that the "header" itself is? > > - The signature header (count and size) is defined to be off_t, which > makes the cached file unusable across architectures. The map header > structure should be specified with explicit size: > > union { > struct { > char sig[4]; > uint32_t version; > uint32_t count; > unit32_t size; > } u; > struct cached_sha1_entry pad; > }; > > the uint32_t fields should be treated as network byte order integers, > e.g. > > cache->count = ntohl(header.u.count); > header.u.size = htonl(cache->size); > > - If this file is truly intended as "cache", shouldn't corruption of it > be detected, reported but otherwise ignored, so that the lookup would > continue in degraded uncached mode? > >> + /* check off_t to size_t conversion */ >> + if (cache->count != header.count || cache->size != header.size) >> + die("%s is too large to hold in memory", cache->filename); > > This does not make sense to me. What you are checking does not match the > error message. > > If you are making the file format architecture dependent (which I would > suggest strongly against), you can use the same type and be done with it. > Otherwise, if you are making the format portable across architectures, > then you would know how large the on-disk integer will be, so as long as > you use appropriate type that is large enough for cache->count you should > be Ok. > > What you may want to check is that (header.u.size + 1) * sizeof(entry) > does not wrap around, but you don't. > >> + /* mmap entire file so that file / memory blocks are aligned */ >> + cache->entries = xmmap(NULL, >> + sizeof(struct cached_sha1_entry) * (header.size + 1), >> + PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); > > I think this will die() if the file is too large to map. Ideally you > would want to allow this mmap to fail if the cache is too large, in which > case you can gracefully degrade to cacheless mode of operation, but that > can probably be left to 47th round. > >> +int write_cached_sha1_map(struct cached_sha1_map *cache) >> +{ >> + union cached_sha1_map_header header; >> + struct lock_file update_lock; >> + int fd; >> + size_t entry_size; >> + >> + if (!cache->initialized || !cache->dirty) >> + return 0; >> + >> + fd = hold_lock_file_for_update(&update_lock, >> + git_path(cache->filename), 0); >> + >> + if (fd < 0) >> + return error("could not construct %s", cache->filename); > > Use a "const char *" to hold git_path(cache->filename) upfront in the > function, use it to obtain lock _and_ for reporting. > >> + memcpy(header.signature, signature, 4); >> + header.count = cache->count; >> + header.size = cache->size; > > And here will be htonl(). > >> + entry_size = sizeof(struct cached_sha1_entry) * cache->size; > > Typically "entry_size" means the size of individual entry; this is the > size of the whole thing. > >> + if (write_in_full(fd, &header, sizeof(header)) != sizeof(header) >> + || write_in_full(fd, cache->entries, entry_size) != entry_size) >> + return error("could not write %s", cache->filename); >> + >> + if (commit_lock_file(&update_lock) < 0) >> + return error("could not write %s", cache->filename); >> + >> + cache->dirty = 0; >> + return 0; >> +} > > But it is good that you used this intermediate variable; the above > write_in_full() is much easier to read than the xmmap() above at the end > of init_cached_sha1_map() function. > >> +static size_t get_hash_index(const unsigned char *sha1) >> +{ >> + return ntohl(*(size_t*)sha1); >> +} > > Two issues: > > - I do not see any guarantee that sha1 is suitably aligned for reading > size_t bytes off of; > > - size_t is architecture dependent, so you would get different hash value > depending on the architecture, which again makes this file format > unportable. > >> diff --git a/patch-ids.c b/patch-ids.c >> index 3be5d31..36332f3 100644 >> --- a/patch-ids.c >> +++ b/patch-ids.c >> @@ -2,17 +2,31 @@ >> #include "diff.h" >> #include "commit.h" >> #include "patch-ids.h" >> +#include "cached-sha1-map.h" >> + >> +struct cached_sha1_map patch_id_cache; > > Does this have to be extern? > >> static int commit_patch_id(struct commit *commit, struct diff_options *options, >> unsigned char *sha1) >> { >> + /* pull patch-id out of the cache if possible */ >> + patch_id_cache.filename = "patch-id-cache"; >> + if (!get_cached_sha1_entry(&patch_id_cache, commit->object.sha1, sha1)) >> + return 0; >> + >> if (commit->parents) >> diff_tree_sha1(commit->parents->item->object.sha1, >> commit->object.sha1, "", options); >> else >> diff_root_tree_sha1(commit->object.sha1, "", options); >> diffcore_std(options); >> - return diff_flush_patch_id(options, sha1); >> + int ret = diff_flush_patch_id(options, sha1); > > Decl-after-statement. > >> + if (ret) >> + return ret; >> + >> + /* record commit, patch-id pair in cache */ >> + set_cached_sha1_entry(&patch_id_cache, commit->object.sha1, sha1); >> + return 0; >> } Thanks. I'll fix these in the next few days. Should I rewrite the patch sequence to incorporate these changes into the first commit, or add them as a forth commit off the end? Geoffrey -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html