Hello, Git contributors! Today, we (the Git Ecosystem team at Microsoft) announced our latest project: Scalar [1]. Scalar is a .NET Core application with installers available for Windows and macOS. We are considering a Linux port [2]. [1] https://github.com/microsoft/scalar [2] https://github.com/microsoft/scalar/issues/323 Scalar helps manage Git repositories from the client side. It enables optional features like `fsmonitor` using some config settings and hooks, then performs background operations to reduce foreground computations. These work on any Git repository, but we also have a mode to create a repository using the GVFS protocol which only works with Azure Repos. That's how we plan to support the Microsoft Office monorepo. If you want to read more about Scalar, we put out a blog post about it [3]. [3] https://devblogs.microsoft.com/devops/introducing-scalar/ However, we think that there are more repositories that just need an extra boost from Scalar. Running "scalar register" in a Git repository will alert Scalar to start watching it and keep it maintained. We hope that this mode of Scalar is short-lived: there are a few features that we could contribute to Git to make this use of Scalar somewhat obsolete. In this message, I want to share the important ways we are using existing Git features and hope to contribute more to Git in the future. Most of these will be suggested as topics for the contributors summit on March 5th. Current Git features used by Scalar ----------------------------------- We've been using the following features as critical components. Sparse-checkout: The "scalar clone" command creates a repo using the GVFS protocol, but also uses "git sparse-checkout init --cone" to initialize the repo in cone mode. Users then set the directories they need, which triggers pack-file downloads of the missing blobs at HEAD. This was the major motivation for our interest in the sparse-checkout builtin and performance recently. Before cone mode, the Office repository could not function with its 3 million index entries and ~1,200 sparse-checkout patterns. Without cone mode, it takes 2,800 seconds to update the cache entries compared to 1-2 seconds in cone mode. Filesystem monitor: Kevin Willford has been contributing stability updates to the fsmonitor feature due in part to how we now depend on it. VFS for Git used its own version of the fsmonitor hook, and it has a connection to a filesystem driver to get every filesystem event. Tools like Watchman perform similar tasks, but not quite to the same precision. Kevin's work to update the hook to v2 helps eliminate race conditions by using Watchman's token instead of a numerical timestamp. Partial clone: The Git client can now handle missing reachable objects! This is huge. The service-side support is still lacking, especially for Azure Repos, so we still need to use the GVFS protocol for now. However, Jeff Hostetler created the `git-gvfs-helper` [4] which is a native process that speaks the GVFS protocol (a set of REST APIs with `gvfs/` in the route). He inserted logic to use that instead of the HTTPS transport in Git, and hooked into the logic for partial clone. The logic for batching the missing objects into a single pack download when using partial clone can now speak the GVFS protocol to find those objects. While we don't expect the Git community to be interested in such a tool upstream, we do expect that we will find ways to improve our use of it by modifying the partial clone logic, and those we can contribute upstream to help everyone! [4] https://github.com/microsoft/git/pull/191 Potential Git features to replace Scalar -------------------------------------- Parallel checkout: VFS for Git could get away with a lot by faking that the working directory was updated. When using actual file updates, we cannot rely on that behavior, so we need to improve the throughput of workdir updates. Jeff Hostetler is currently building a parallel version of 'git checkout' and will have more concrete things to say about it at the summit. Git-aware/Git-native filesystem monitor: We are using the 'fsmonitor' hook with Facebook's Watchman [5] tool. While this is mostly serving our needs, it is slow to start up and slow after Git changes lots of files in the working directory. It also fails to work efficiently when directories are excluded by the `.gitignore` file. Kevin Willford and Johannes Schindelin are working on building a filesystem watcher as part of Git, or at least very close to Git. They are working right now on a Windows version since we believe the largest gains can be found there. [5] https://github.com/facebook/watchman `fetch-object` URLs: VFS for Git and the GVFS protocol have a notion of "cache server" that provides a different place to acquire objects than the origin remote. This allows geo-distributed servers to provide Git objects at lower latency and higher throughput than any one server could do. I believe that we could provide this functionality in Git by extending the existing `fetch` and `push` URLs for remotes with a new `fetch-object` URL. We would still use the `fetch` URL to acquire refs (there are too many race conditions for the cache servers to duplicate that endpoint) but the `upload-pack` request would go to the `fetch-object` URL. I hope to discuss this at the summit, and I will be working on a prototype before then. Background maintenance: Git relies on auto-GC for most of its maintenance. An expert user could determine what alternate maintenance they want and create cron jobs or schedule background maintenance by other means. We need this to be as painless as possible for users who don't want to design their own system. We built Scalar with our opinionated mechanisms for repository maintenance: * Fetch in the background to reduce object transfer in a foreground fetch. Background fetch was recently discussed on-list [6]. We use an alternate refspec to create refs in refs/scalar/hidden/<remote>/. [6] https://lore.kernel.org/git/pull.532.v2.git.1579570692766.gitgitgadget@xxxxxxxxx/ * We disable 'fetch.writeCommitGraph' so foreground fetches do not write a commit-graph file, but instead update the commit-graph in the background. By using the --reachable option, we gather the latest commits from the background fetch. We use the incremental commit-graph using --size-multiple=4. * Clean up loose objects non-destructively. This is less of an issue on non-Windows platforms, but it is important that we do not try to delete a file that a concurrent Git process could have a handle to, or would try to open. For that reason, we perform the following steps to clean up loose objects: 1. git prune-packed 2. git pack-objects <loose-objects-batch where "loose-objects-patch" is a stream of loose objects we find from scanning the loose object directories. By running in this order, we only delete loose objects that were previously in a pack-file, but not the objects that we just put in a pack-file. We run this pair of steps once every 24 hours, which is enough time to expect all Git processes started before one step to end before the start of the next. * Clean up pack-files non-destructively using a similar pair of steps: 1. git multi-pack-index expire 2. git multi-pack-index repack --batch-size=X By using the multi-pack-index to expire packs with no referenced objects, we can be sure that no Git process will attempt to read that pack. By repacking a batch of pack-files (our default --batch-size is 2GB), we can still collect a large number of small packs into a small number of larger packs. While this is overall less space-efficient than repacking carefully into a single pack-file, the enormous repositories would take too long to repack that way and take too many user resources in the process. While I expect that the Git community will have different opinions on what background maintenance steps we would find valuable, I do think it is worth considering if background maintenance is at all viable. This is the part that seems so much different than the expected model for Git, so I'd be interested in how we could achieve similar results in core Git. Of course, such maintenance would be highly configurable in both type of maintenance and frequency. Thanks for your time. I look forward to any discussions this may start. Thanks, -Stolee