[ANNOUNCE] Scalar

Derrick Stolee <stolee@xxxxxxxxx> · Wed, 12 Feb 2020 08:51:34 -0500

Hello, Git contributors!

Today, we (the Git Ecosystem team at Microsoft) announced our latest
project: Scalar [1]. Scalar is a .NET Core application with installers
available for Windows and macOS. We are considering a Linux port [2].

[1] https://github.com/microsoft/scalar
[2] https://github.com/microsoft/scalar/issues/323

Scalar helps manage Git repositories from the client side. It enables
optional features like `fsmonitor` using some config settings and hooks,
then performs background operations to reduce foreground computations.
These work on any Git repository, but we also have a mode to create a
repository using the GVFS protocol which only works with Azure Repos.
That's how we plan to support the Microsoft Office monorepo. If you want
to read more about Scalar, we put out a blog post about it [3].

[3] https://devblogs.microsoft.com/devops/introducing-scalar/

However, we think that there are more repositories that just need an extra
boost from Scalar. Running "scalar register" in a Git repository will
alert Scalar to start watching it and keep it maintained. We hope that
this mode of Scalar is short-lived: there are a few features that we could
contribute to Git to make this use of Scalar somewhat obsolete.

In this message, I want to share the important ways we are using existing
Git features and hope to contribute more to Git in the future. Most of
these will be suggested as topics for the contributors summit on March 5th.

Current Git features used by Scalar
-----------------------------------

We've been using the following features as critical components.

Sparse-checkout:
The "scalar clone" command creates a repo using the GVFS protocol, but
also uses "git sparse-checkout init --cone" to initialize the repo in cone
mode. Users then set the directories they need, which triggers
pack-file downloads of the missing blobs at HEAD. This was the major
motivation for our interest in the sparse-checkout builtin and performance
recently. Before cone mode, the Office repository could not function with
its 3 million index entries and ~1,200 sparse-checkout patterns. Without
cone mode, it takes 2,800 seconds to update the cache entries compared to
1-2 seconds in cone mode.

Filesystem monitor:
Kevin Willford has been contributing stability updates to the fsmonitor
feature due in part to how we now depend on it. VFS for Git used its own
version of the fsmonitor hook, and it has a connection to a filesystem
driver to get every filesystem event. Tools like Watchman perform similar
tasks, but not quite to the same precision. Kevin's work to update the
hook to v2 helps eliminate race conditions by using Watchman's token
instead of a numerical timestamp.

Partial clone:
The Git client can now handle missing reachable objects! This is huge. The
service-side support is still lacking, especially for Azure Repos, so we
still need to use the GVFS protocol for now. However, Jeff Hostetler
created the `git-gvfs-helper` [4] which is a native process that speaks
the GVFS protocol (a set of REST APIs with `gvfs/` in the route). He
inserted logic to use that instead of the HTTPS transport in Git, and
hooked into the logic for partial clone. The logic for batching the
missing objects into a single pack download when using partial clone can
now speak the GVFS protocol to find those objects. While we don't expect
the Git community to be interested in such a tool upstream, we do expect
that we will find ways to improve our use of it by modifying the partial
clone logic, and those we can contribute upstream to help everyone!

[4] https://github.com/microsoft/git/pull/191

Potential Git features to replace Scalar
--------------------------------------

Parallel checkout:
VFS for Git could get away with a lot by faking that the working directory
was updated. When using actual file updates, we cannot rely on that
behavior, so we need to improve the throughput of workdir updates. Jeff
Hostetler is currently building a parallel version of 'git checkout' and
will have more concrete things to say about it at the summit.

Git-aware/Git-native filesystem monitor:
We are using the 'fsmonitor' hook with Facebook's Watchman [5] tool. While
this is mostly serving our needs, it is slow to start up and slow after
Git changes lots of files in the working directory. It also fails to work
efficiently when directories are excluded by the `.gitignore` file. Kevin
Willford and Johannes Schindelin are working on building a filesystem
watcher as part of Git, or at least very close to Git. They are working
right now on a Windows version since we believe the largest gains can be
found there.

[5] https://github.com/facebook/watchman

`fetch-object` URLs:
VFS for Git and the GVFS protocol have a notion of "cache server" that
provides a different place to acquire objects than the origin remote. This
allows geo-distributed servers to provide Git objects at lower latency and
higher throughput than any one server could do. I believe that we could
provide this functionality in Git by extending the existing `fetch` and
`push` URLs for remotes with a new `fetch-object` URL. We would still use
the `fetch` URL to acquire refs (there are too many race conditions for
the cache servers to duplicate that endpoint) but the `upload-pack`
request would go to the `fetch-object` URL. I hope to discuss this at the
summit, and I will be working on a prototype before then.

Background maintenance:
Git relies on auto-GC for most of its maintenance. An expert user could
determine what alternate maintenance they want and create cron jobs or
schedule background maintenance by other means. We need this to be as
painless as possible for users who don't want to design their own system.
We built Scalar with our opinionated mechanisms for repository
maintenance:

 * Fetch in the background to reduce object transfer in a foreground
   fetch. Background fetch was recently discussed on-list [6]. We use an
   alternate refspec to create refs in refs/scalar/hidden/<remote>/.

[6] https://lore.kernel.org/git/pull.532.v2.git.1579570692766.gitgitgadget@xxxxxxxxx/

 * We disable 'fetch.writeCommitGraph' so foreground fetches do not write
   a commit-graph file, but instead update the commit-graph in the
   background. By using the --reachable option, we gather the latest
   commits from the background fetch. We use the incremental commit-graph
   using --size-multiple=4.

 * Clean up loose objects non-destructively. This is less of an
   issue on non-Windows platforms, but it is important that we do not try
   to delete a file that a concurrent Git process could have a handle to,
   or would try to open. For that reason, we perform the following steps
   to clean up loose objects:

   1. git prune-packed
   2. git pack-objects <loose-objects-batch

   where "loose-objects-patch" is a stream of loose objects we find from
   scanning the loose object directories. By running in this order, we
   only delete loose objects that were previously in a pack-file, but not
   the objects that we just put in a pack-file. We run this pair of steps
   once every 24 hours, which is enough time to expect all Git processes
   started before one step to end before the start of the next.

 * Clean up pack-files non-destructively using a similar pair of steps:

   1. git multi-pack-index expire
   2. git multi-pack-index repack --batch-size=X

   By using the multi-pack-index to expire packs with no referenced
   objects, we can be sure that no Git process will attempt to read that
   pack. By repacking a batch of pack-files (our default --batch-size is
   2GB), we can still collect a large number of small packs into a small
   number of larger packs. While this is overall less space-efficient than
   repacking carefully into a single pack-file, the enormous repositories
   would take too long to repack that way and take too many user resources
   in the process.

While I expect that the Git community will have different opinions on what
background maintenance steps we would find valuable, I do think it is
worth considering if background maintenance is at all viable. This is the
part that seems so much different than the expected model for Git, so I'd
be interested in how we could achieve similar results in core Git. Of
course, such maintenance would be highly configurable in both type of
maintenance and frequency.

Thanks for your time. I look forward to any discussions this may start.

Thanks,
-Stolee