From: Lars Schneider <larsxschneider@xxxxxxxxx> Hi, I found a way to make Git LFS faster up to a factor of 100x in repositories with a large number of Git LFS files. I am looking for comments if my approach would be acceptable by the Git community. ## What is Git LFS? Git LFS [1] is an extension to Git that handles large files for Git repositories. The project gained quite some momentum as almost all major Git hosting services support it (GitHub [1], Atlassian Bitbucket [2], GitLab [4]). ## What is the problem with Git LFS? Git LFS is an application that is executed via Git clean/smudge filter. The process invocation of these filters requires noticeable time (especially on Windows) even if the Git LFS executable only accesses its local cache. Based on my previous findings [5] Steve Streeting (@sinbad) improved the clone times of Git LFS repositories with a lot of files by a factor of 10 or more [6][7]. Unfortunately that fix helps only with cloning. Any local Git operation that invokes the clean/smudge filter (e.g. switching branches) is still slow. Even on the Git mailing list a user reported that issue [8]. ## Proposed solution Git LFS caches its objects under .git/lfs/objects. Most of the time Git LFS objects are already available in the cache (e.g. if you switch branches back and forth). I implemented these "cache hits" natively in Git. Please note that this implementation is just a quick and dirty proof of concept. If the Git community agrees that this kind of approach would be acceptable then I will start to work on a proper patch series with cross platform support and unit tests. ## Performance tests I executed both test runs on a 2,5 GHz Intel Core i7 with SSD and OS X. A test run is the consecutive execution of four Git commands: 1. clone the repo 2. checkout to the "removed-files" branch 3. timed: checkout the "master" branch 4. timed: checkout "removed-files" branch Test command: set -x; git lfs clone https://github.com/larsxschneider/lfstest-manyfiles.git repo; cd repo; git checkout removed-files; time git checkout master; time git checkout removed-files I compiled Git with the following flags: NO_GETTEXT=YesPlease NEEDS_SSL_WITH_CRYPTO=YesPlease make -j 8 CFLAGS="-I/usr/local/opt/openssl/include" LDFLAGS="-L/usr/local/opt/openssl/lib" ### TEST RUN A -- Default Git 2.9 (ab7797d) and Git LFS 1.2.1 + git lfs clone https://github.com/larsxschneider/lfstest-manyfiles.git repo Cloning into 'repo'... warning: templates not found /Users/lars/share/git-core/templates remote: Counting objects: 15012, done. remote: Total 15012 (delta 0), reused 0 (delta 0), pack-reused 15012 Receiving objects: 100% (15012/15012), 2.02 MiB | 1.77 MiB/s, done. Checking connectivity... done. Checking out files: 100% (15001/15001), done. Git LFS: (15000 of 15000 files) 0 B / 77.04 KB + cd repo + git checkout removed-files Branch removed-files set up to track remote branch removed-files from origin. Switched to a new branch 'removed-files' + git checkout master Checking out files: 100% (12000/12000), done. Switched to branch 'master' Your branch is up-to-date with 'origin/master'. real 6m2.979s user 2m39.066s sys 2m41.610s + git checkout removed-files Switched to branch 'removed-files' Your branch is up-to-date with 'origin/removed-files'. real 0m1.310s user 0m0.385s sys 0m0.881s ### TEST RUN B -- Default Git 2.9 with native LFS cache and Git LFS 1.2.1 https://github.com/larsxschneider/git/tree/lfs-cache + git lfs clone https://github.com/larsxschneider/lfstest-manyfiles.git repo Cloning into 'repo'... warning: templates not found /Users/lars/share/git-core/templates remote: Counting objects: 15012, done. remote: Total 15012 (delta 0), reused 0 (delta 0), pack-reused 15012 Receiving objects: 100% (15012/15012), 2.02 MiB | 1.44 MiB/s, done. Checking connectivity... done. Git LFS: (15001 of 15000 files) 0 B / 77.04 KB + cd repo + git checkout removed-files Branch removed-files set up to track remote branch removed-files from origin. Switched to a new branch 'removed-files' + git checkout master Checking out files: 100% (12000/12000), done. Switched to branch 'master' Your branch is up-to-date with 'origin/master'. real 0m2.267s user 0m0.295s sys 0m1.948s + git checkout removed-files Switched to branch 'removed-files' Your branch is up-to-date with 'origin/removed-files'. real 0m0.715s user 0m0.072s sys 0m0.672s ### Results Default Git: 6m2.979s + 0m1.310s = 364s Git with native LFS cache access: 0m2.267s + 0m0.715s = 4s The native cache solution is almost 100x faster when switching branches on my local machine with a test repository containing 15,000 Git LFS files. Based on my previous experience with Git LFS clone I expect even more dramatic results on Windows. Thanks, Lars [1] https://git-lfs.github.com/ [2] https://github.com/blog/1986-announcing-git-large-file-storage-lfs [3] http://blogs.atlassian.com/2016/02/git-lfs-for-designers-game-developers-architects/ [4] https://about.gitlab.com/2015/11/23/announcing-git-lfs-support-in-gitlab/ [5] https://github.com/github/git-lfs/issues/931#issuecomment-172939381 [6] https://github.com/github/git-lfs/pull/988 [7] https://developer.atlassian.com/blog/2016/04/git-lfs-12-clone-faster/ [8] http://article.gmane.org/gmane.comp.version-control.git/297809 --- cache.h | 2 + convert.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++---------- csum-file.c | 31 +++++++++++++ csum-file.h | 2 + hex.c | 16 +++++++ 5 files changed, 172 insertions(+), 24 deletions(-) diff --git a/cache.h b/cache.h index 6049f86..57fdb18 100644 --- a/cache.h +++ b/cache.h @@ -1196,6 +1196,8 @@ extern char *sha1_to_hex_r(char *out, const unsigned char *sha1); extern char *sha1_to_hex(const unsigned char *sha1); /* static buffer result! */ extern char *oid_to_hex(const struct object_id *oid); /* same static buffer as sha1_to_hex */ +extern char *sha256_to_hex_r(char *out, const unsigned char *sha1); + extern int interpret_branch_name(const char *str, int len, struct strbuf *); extern int get_sha1_mb(const char *str, unsigned char *sha1); diff --git a/convert.c b/convert.c index b1614bf..006e9c4 100644 --- a/convert.c +++ b/convert.c @@ -18,6 +18,10 @@ #define CONVERT_STAT_BITS_TXT_CRLF 0x2 #define CONVERT_STAT_BITS_BIN 0x4 +const char *LFS_VERSION_MARKER = "version https://git-lfs.github.com/spec/v1\n"; +const char *LFS_OID_MARKER = "oid sha256:"; + + enum crlf_action { CRLF_UNDEFINED, CRLF_BINARY, @@ -427,6 +431,79 @@ static int filter_buffer_or_fd(int in, int out, void *data) return (write_err || status); } +static int cached_lfs_smudge(const char *src, const char *cmd, + struct strbuf *lfsbuf) +{ + int ret = 0; + if (src && + strlen(src) > strlen(LFS_VERSION_MARKER) && + !strncmp(LFS_VERSION_MARKER, src, strlen(LFS_VERSION_MARKER)) && + !strcmp("git-lfs smudge %f", cmd) + ) { + const char *lfs_oid_found = strstr(src, LFS_OID_MARKER); + if (lfs_oid_found) { + const char *lfs_oid = lfs_oid_found + strlen(LFS_OID_MARKER); + + // Construct path to LFS object + strbuf_reset(lfsbuf); + strbuf_addstr(lfsbuf, git_pathdup("lfs/objects/")); + strbuf_add(lfsbuf, lfs_oid, 2); + strbuf_addch(lfsbuf, '/'); + strbuf_add(lfsbuf, lfs_oid+2, 2); + strbuf_addch(lfsbuf, '/'); + strbuf_add(lfsbuf, lfs_oid, 64); + + if (access(lfsbuf->buf, F_OK) != -1) { + // LFS object found in local LFS cache + ret = 1; + } + } + } + return ret; +} + +static int cached_lfs_clean(const char *path, const char *cmd, + struct strbuf *lfsbuf) +{ + int ret = 0; + if (path && !strcmp("git-lfs clean %f", cmd)) { + + // TODO: Is there an easy way to access the content of the last + // known committed state of this file in the Git repo? If yes, + // then we could read the last known Git LFS OID, construct a + // path in the Git LFS cache and compare this file against "path". + // If both files are equal then we can use the last known committed + // state as "clean" and we could get rid of the SHA256 dependency + // here. + ssize_t file_size; + unsigned char sha256[64]; + sha256fd(path, &sha256, &file_size); + + char lfs_oid[64]; + sha256_to_hex_r(lfs_oid, sha256); + + // Construct path to LFS object + strbuf_reset(lfsbuf); + strbuf_addstr(lfsbuf, git_pathdup("lfs/objects/")); + strbuf_add(lfsbuf, lfs_oid, 2); + strbuf_addch(lfsbuf, '/'); + strbuf_add(lfsbuf, lfs_oid+2, 2); + strbuf_addch(lfsbuf, '/'); + strbuf_add(lfsbuf, lfs_oid, 64); + + if (access(lfsbuf->buf, F_OK) != -1) { + // LFS object found in local LFS cache + strbuf_reset(lfsbuf); + strbuf_addstr(lfsbuf, LFS_VERSION_MARKER); + strbuf_addstr(lfsbuf, LFS_OID_MARKER); + strbuf_add(lfsbuf, lfs_oid, 64); + strbuf_addf(lfsbuf, "\nsize %d\n", file_size); + ret = 1; + } + } + return ret; +} + static int apply_filter(const char *path, const char *src, size_t len, int fd, struct strbuf *dst, const char *cmd) { @@ -437,6 +514,7 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd, * (child --> cmd) --> us */ int ret = 1; + struct strbuf lfsbuf = STRBUF_INIT; struct strbuf nbuf = STRBUF_INIT; struct async async; struct filter_params params; @@ -447,37 +525,56 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd, if (!dst) return 1; - memset(&async, 0, sizeof(async)); - async.proc = filter_buffer_or_fd; - async.data = ¶ms; - async.out = -1; - params.src = src; - params.size = len; - params.fd = fd; - params.cmd = cmd; - params.path = path; - - fflush(NULL); - if (start_async(&async)) - return 0; /* error was already reported */ - - if (strbuf_read(&nbuf, async.out, len) < 0) { - error("read from external filter %s failed", cmd); - ret = 0; - } - if (close(async.out)) { - error("read from external filter %s failed", cmd); - ret = 0; + // TODO: check if git config "lfs.native-cache" is true + if (cached_lfs_smudge(src, cmd, &lfsbuf)) { + fd = open(lfsbuf.buf, O_RDONLY); + if (strbuf_read(&nbuf, fd, len) < 0) { + error("reading from cached LFS object failed", lfsbuf.buf); + ret = 0; + } + if (close(fd)) { + error("closing cached LFS object failed", lfsbuf.buf); + ret = 0; + } } - if (finish_async(&async)) { - error("external filter %s failed", cmd); - ret = 0; + // TODO: check if git config "lfs.native-cache" is true + else if (cached_lfs_clean(path, cmd, &lfsbuf)) { + strbuf_reset(&nbuf); + strbuf_addstr(&nbuf, lfsbuf.buf); + } else { + memset(&async, 0, sizeof(async)); + async.proc = filter_buffer_or_fd; + async.data = ¶ms; + async.out = -1; + params.src = src; + params.size = len; + params.fd = fd; + params.cmd = cmd; + params.path = path; + + fflush(NULL); + if (start_async(&async)) + return 0; /* error was already reported */ + + if (strbuf_read(&nbuf, async.out, len) < 0) { + error("read from external filter %s failed", cmd); + ret = 0; + } + if (close(async.out)) { + error("read from external filter %s failed", cmd); + ret = 0; + } + if (finish_async(&async)) { + error("external filter %s failed", cmd); + ret = 0; + } } if (ret) { strbuf_swap(dst, &nbuf); } strbuf_release(&nbuf); + strbuf_release(&lfsbuf); return ret; } diff --git a/csum-file.c b/csum-file.c index a172199..2ca4d7f 100644 --- a/csum-file.c +++ b/csum-file.c @@ -11,6 +11,37 @@ #include "progress.h" #include "csum-file.h" +void sha256fd(const char *name, unsigned char *sha256, ssize_t *file_size) +{ + int fd; + struct stat st; + fd = open(name, O_RDONLY); + if (fd < 0) + die_errno("unable to open '%s'", name); + fstat(fd, &st); + size_t size = xsize_t(st.st_size); + *file_size = size; + + SHA256_CTX ctx; + SHA256_Init(&ctx); + unsigned char fd_buffer[8192]; + + while (size > 0) { + ssize_t rsize = size < sizeof(fd_buffer) ? size : sizeof(fd_buffer); + ssize_t ret = read_in_full(fd, fd_buffer, rsize); + + if (ret < 0) + die_errno("%s: sha256 file read error", name); + if (ret != rsize) + die("failed to read %d bytes from '%s'", (int)rsize, name); + SHA256_Update(&ctx, fd_buffer, rsize); + size -= rsize; + } + + SHA256_Final(sha256, &ctx); + close(fd); +} + static void flush(struct sha1file *f, const void *buf, unsigned int count) { if (0 <= f->check_fd && count) { diff --git a/csum-file.h b/csum-file.h index 7530927..bad9262 100644 --- a/csum-file.h +++ b/csum-file.h @@ -39,6 +39,8 @@ extern void sha1flush(struct sha1file *f); extern void crc32_begin(struct sha1file *); extern uint32_t crc32_end(struct sha1file *); +extern void sha256fd(const char *name, unsigned char *sha256, ssize_t *file_size); + static inline void sha1write_u8(struct sha1file *f, uint8_t data) { sha1write(f, &data, sizeof(data)); diff --git a/hex.c b/hex.c index 0519f85..73e2077 100644 --- a/hex.c +++ b/hex.c @@ -88,3 +88,19 @@ char *oid_to_hex(const struct object_id *oid) { return sha1_to_hex(oid->hash); } + +char *sha256_to_hex_r(char *buffer, const unsigned char *sha1) +{ + static const char hex[] = "0123456789abcdef"; + char *buf = buffer; + int i; + + for (i = 0; i < 32; i++) { + unsigned int val = *sha1++; + *buf++ = hex[val >> 4]; + *buf++ = hex[val & 0xf]; + } + *buf = '\0'; + + return buffer; +} -- 2.5.1 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html