On Tue, Sep 20, 2022 at 03:27:29PM -0400, John Cai wrote: > One observation is that `git-fsck` seems to use up alot more memory in calling > unpack_compressed_entry() than `git-rev-list` does. That just means it's data coming from a pack. The interesting parts are further up the stack, I think: > | | ->40.48% (826,445,820B) 0x2B2447: unpack_compressed_entry (packfile.c:1601) > | | | ->40.48% (826,445,820B) 0x2B46AB: unpack_entry (packfile.c:1768) > | | | | ->40.48% (826,445,820B) 0x2B4A23: cache_or_unpack_entry (packfile.c:1438) > | | | | ->40.48% (826,445,820B) 0x2B4A23: packed_object_info (packfile.c:1516) > | | | | ->40.48% (826,445,820B) 0x29EA83: do_oid_object_info_extended (object-file.c:1620) > | | | | ->40.48% (826,445,820B) 0x29EBDB: oid_object_info_extended (object-file.c:1639) > | | | | ->40.48% (826,445,820B) 0x29EDC3: read_object (object-file.c:1671) > | | | | ->40.48% (826,445,820B) 0x29EE5B: read_object_file_extended (object-file.c:1714) > | | | | ->40.12% (819,056,147B) 0x2143BB: repo_read_object_file (object-store.h:253) > | | | | | ->40.12% (819,056,147B) 0x2143BB: repo_parse_commit_internal (commit.c:511) > | | | | | ->40.12% (819,056,147B) 0x2143BB: repo_parse_commit_internal (commit.c:495) > | | | | | ->40.12% (819,056,147B) 0x25A747: repo_parse_commit (commit.h:90) > | | | | | ->40.12% (819,056,147B) 0x25A747: fsck_walk_commit (fsck.c:355) OK, so we're holding onto commit buffers after parsing them. We probably should tell the parser not to do that. It's useful for git-log, etc, which are going to pretty-print the buffer, but not for this use. You can set the global save_commit_buffer to do that. > | | ->08.61% (175,870,114B) 0x2B9DC3: patch_delta (patch-delta.c:36) > | | | ->08.61% (175,870,114B) 0x2B42EB: unpack_entry (packfile.c:1829) > | | | ->08.61% (175,870,114B) 0x2B4A23: cache_or_unpack_entry (packfile.c:1438) > | | | ->08.61% (175,870,114B) 0x2B4A23: packed_object_info (packfile.c:1516) > | | | ->08.61% (175,870,114B) 0x29EA83: do_oid_object_info_extended (object-file.c:1620) > | | | ->08.61% (175,870,114B) 0x29EBDB: oid_object_info_extended (object-file.c:1639) > | | | ->08.61% (175,870,114B) 0x29EDC3: read_object (object-file.c:1671) > | | | ->08.61% (175,870,114B) 0x29EE5B: read_object_file_extended (object-file.c:1714) > | | | ->04.52% (92,219,588B) 0x3461CB: repo_read_object_file (object-store.h:253) > | | | | ->04.52% (92,219,588B) 0x3461CB: parse_tree_gently.part.0 (tree.c:132) > | | | | ->04.52% (92,219,588B) 0x25A4EB: parse_tree (tree.h:24) > | | | | ->04.52% (92,219,588B) 0x25A4EB: fsck_walk_tree (fsck.c:307) And this is data we've allocated for trees. It kind of looks like fsck_walk_tree() never bothers to clean up the trees it parses, leaving the buffers attached to the tree structs. But that can't be the case, because linux.git has something like 16GB of trees. These may be entries we keep in the internal delta cache, though it should be a bit smaller than what you have here (the default is 96MB; you can drop it with core.deltaBaseCacheLimit, but runtime may suffer). There's a call to free_tree_buffer() in builtin/fsck.c:traverse_one(); that may be what ends up freeing things. It's been a while since I've traced through the call paths for fsck. > But, not having delved too deeply into the code, I wanted to ask the list if > anything jumps out as to why `git-fsck` consumes so much more memory than > `git-rev-list`--perhaps there is opportunity for improvement/optimization? The patch below improves my peak heap (as reported by massif) for "git fsck --connectivity-only" on linux.git from ~2GB to ~1GB. But in general, I think "rev-list" is a better tool for connectivity checks. One problem with fsck is that it tries to read the set of available objects up front, making it racy in a repository which is receiving pushes, or which may be repacked (e.g., if somebody makes a new object and references it, "fsck --connectivity-only" may fail to notice the new object but will see the updated ref, and think the ref is corrupt). diff --git a/builtin/fsck.c b/builtin/fsck.c index f7916f06ed..949073a00d 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -853,6 +853,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) errors_found = 0; read_replace_refs = 0; + save_commit_buffer = 0; argc = parse_options(argc, argv, prefix, fsck_opts, fsck_usage, 0); -Peff