This session was led by Derrick Stolee. Supporting cast: Jonathan "jrnieder" Nieder, Elijah Newren, Jeff Hostetler, Jeff "Peff" King, Johannes "Dscho" Schindelin, Ævar Arnfjörð Bjarmason, Emily Shaffer, Victoria Dye, brian m. carlson, and CB Bailey. Notes: 1. Cone mode has stabilized 2. jrnieder: would sparse index without cone mode support be welcome? 1. Stolee: you’re welcome to try ;-) 2. Elijah: main theme: performance. Cone mode allows reasonable performance due to fewer rules to check 3. Stolee: directory-level lookups mean lookups can have sublinear cost, since you can skip sparse rules (no need to check them in order to figure out whether or not a file is excluded or not) 3. Elijah: interested in “sparse clones”, i.e. clones that download everything related to a specified cone 1. Would be nice not having to download extra objects when already having specified a cone of interest 2. Jeff: the original partial clone had code to restrict to a cone 3. Peff: we still have the code, but turned it off, you can have bitmaps with that (too heavy on the server) 4. Stolee: also, how can the cone be updated if things change? Never solved that problem 5. Stolee: but the extra blob downloads turned out not to be too big of a problem 6. Stolee: got a feature request to restrict git log to the current cone, git grep already does that (thanks Matheus) 7. Elijah: “git grep” without revision arguments is restricted to worktree, so it respects the sparse checkout. When you pass a revision, though, it searches the whole tree 8. Many commands want to examine the whole tree, makes sense to figure out the UX (configuration, etc) of them together 9. Peff: Is diff code on someone’s radar? 10. Stolee: I’d view that as part of the same story as “git log”, “git log -p”. 11. Sparse index means we can avoid faulting in trees outside of HEAD, so it helps unlock this 4. Sparse index: Victoria and Lessley are taking lead on the number of commands supporting sparse index 1. update-index, diff, blame, clean, stash, sparse-checkout itself so far supported only in the Microsoft fork of Git 2. Enabled by default internally so helps us gather data 3. Elijah: awesome that you’re working on this, sorry I haven’t been as responsive as I’d like on reviews 4. I’m interested in “clean” in particular --- isn’t that about untracked files? 5. Stolee: It uses the index to find what is tracked, want to avoid expanding the in-memory index. If there are files outside the sparse checkout area then it does expand. 5. jrnieder: question about failure modes 1. When I convert a command, I make sure my code path doesn’t assume the cache array contains all entries. Then I turn off command_requires_full_index. What happens if I missed a spot? 2. Stolee: I put ensure_full_index() in front of everything that assumes a full index, but if there’s a loop that we missed, there’s no extra protection. 3. Example: cache-tree was calling itself, invalidating points, segfaulted. 4. More worrying failure mode would be if commands proceed with bad data. Segfaulting is the good case! 5. jrnieder is not too worried since we’re pretty far along and soon enough we’ll have converted all commands and these questions would be moot 1. Stolee: goal isn’t to get 100% coverage, so point of questions being moot isn’t coming soon 2. jrnieder: Thanks! Okay, I’ll take a look. 6. http://sweng.the-davies.net/Home/rustys-api-design-manifesto 7. Stolee is less worried because we have sufficient ensure_full_index calls. 6. One optimization we’re considering: not expanding the full index when anything outside the cone is needed (we’d like to maybe expand just the part that needs expanding) 1. Elijah: we would still keep cone mode, but it’s a bit weird because the cone mode does not match what we have in the index 2. Stolee: we might actually not need this 7. Stolee: in the process of this work, found D/F conflict issue, made a test illustrating it 8. Elijah: atomicitiy 1. checkout is a non-atomic operation. ^C makes a mess 2. “git sparse-checkout disable” is non-atomic. Takes a while, people ^C, and the very last step is updating the sparsity files. Leaves the worktree with a bunch of files they don’t need but commands ignore them 3. We run into problems because then they can check out a different branch, do a bunch of other work, then update the sparse-checkout and it will see these precious files it doesn’t want to overwrite 4. Should “git status” show them? 5. Dscho: We could set a flag on disk when you’re about to disable, then if we were interrupted print an error message to get the user to sort things out 6. Peff: I was going to suggest something similar. FS doesn’t make transactions easy, but we can at least do a rollback (signal handler), not foolproof, but it works pretty well and covers your ^C case. 7. Stolee: coming in 2.34: sparse-checkout reapply will delete ignored (and tracked?) files. Helps with these leftover files. 8. Elijah: no current way to get out of that state, thank you for making sparse-checkout reapply do that 9. Stolee: noticed during experimental release to people from Office. Everything was slow because they had run build and left behind ignored files 10. jrnieder: Piggy-backing on Dscho’s comment, there’s a database analogy: record intent (in the database case, that’s a transaction journal) before the non-atomic steps the act on that intent. Suggests maybe we should be updating the sparsity pattern before the checkout step 9. That’s it, that’s the status update what’s currently on the list. 10. We have more plans, though. 11. Idea: use git.git itself 1. Tried it, but had to have 97% files to still be workable 2. Could change the Makefile to accept that, say, po/ is missing 3. Ævar: creates a lot of complexity for the build 4. jrnieder: as VCS provider, what is our recommendation to build authors? Do we want them querying sparse checkout, do we want builds that Just Work in cone mode, do we want to treat sparse checkout as a thing that builds don’t need to support? 5. Stolee: want build system to be able to tell Git about what needs to be checked out. “In-tree sparse checkout” (see below) 12. Emily: we’re interested in sparse-checkout affecting the set of active submodules, just mentioning this as a heads-up 13. [PATCH 00/10] [RFC] In-tree sparse-checkout definitions - Derrick Stolee via GitGitGadget (https://lore.kernel.org/git/pull.627.git.1588857462.gitgitgadget@xxxxxxxxx/) 14. Victoria: today when you switch gears and work on something else you have to update the sparse checkout pattern 15. Proposal here is to have in-tree sparse checkout definitions, e.g. a .gitdependencies file that lists, for the directories you’re working with, what other subdirectories they depend on 16. That way, you get exactly the folders you need 17. Stolee: office has their own tool “scoper” that figures out dependencies and runs “git sparse-checkout set” for the user. Is confusing when you rebase and need to remember to run it 18. Currently lives in a hook, custom and built for one engineering system, want to generalize and make a standard feature 19. Victoria: being built in to Git would make sense because it’s general enough to work in most monorepo environments. 20. Involves two pieces: having git understand the dependencies and assemble your sparse checkout cone using them, and having the build system maintain and use sparse checkout correctly. 21. Some build setups tolerate missing directories reasonably well. If we make .gitdependencies more of a first-class concept then we could go further and make build systems handle missing directories as something that would be expected 22. C# .proj files link to dependencies on other .proj files with relative path. But in a solution file collecting all .proj files, it lists all of them and you need to have them all present. If a subdirectory isn’t present, proposal is to build what is there instead of everything. 23. Tried another prototype on how to do this in Bazel. It has a rigorous definition of inputs and outputs, and based on that you could translate to a .gitdependencies file or sparse-checkout pattern. 24. Microsoft’s buildxl has similar properties 25. Victoria asks: how general is the above? 26. brian: Many monorepos has multiple microservices. A cone can represent what a particular service needs to run. 27. If you’re building one coherent product like Windows, you’re going to need some prebuilt artifacts that you pull down. 28. jrnieder: Large monorepos often have strong remote build. Not everything you depend on is things that you need to have in source form locally 29. CB: My team at Bloomberg has a teamwide “monorepo” (not Bloomberg-wide). We’re cmake based. Sparse checkout would be interesting for us. We’re experimenting with what’s called workspace builds: you have a thing you can build (a subdirectory), that you pull into the toplevel CMakeLists.txt as a single thing. 30. With cmake you can declare a dependency with target_link_libraries. A dependency name can either be a cmake defined target in the codebase you’re building it, or it can be a pre-built library pulled in another way, e.g. importing via a pkg-config file. 31. At build time if I decide I want to change that library, I’ll expand my sparse-checkout region, and rerun cmake to have it understand the newly available source. 32. Optionality: I don’t have to have that source checked out, but when it’s present I want to use it. 33. Victoria: sounds like in-tree sparse checkout is more of an intermediate step. Sometimes you want the source, sometimes you want to pull in an external artifact. 34. Elijah: we have a monorepo, about the size of the Linux kernel. Multiple separate services, interconnected pieces. Using sparse-checkout required some code changes, refactoring that wasn’t just around the build system. We created a tool before the sparse-checkout command existed, using older mechanisms, and then switched to sparse-checkout when it came out. We track our dependencies ourselves --- you need this set of modules (3 or 4) or the modules relevant to a particular team, and it then computes the relevant directories to get. We had to make some changes to adopt cone mode but I like it and the changes it led to. Then you run the build system --- you have files that declare the dependencies, are they newer than .git/info/sparse-checkout? If not then recompute them again. 35. Potentially would want to rerun the dependency generation after you run a rebase as well… 36. If we track it in-tree, there are some interesting cases we’ll run into (merge conflicts on this generated file). 37. Also, tracking dependencies in two places can result in difficulty, skew. Maybe can generate one from the other. 38. Our sparse checkout tends to be build oriented “what do I need for this build”. But testing inverts the dependency graph, want to see what tests depend on this code. We encourage them to test in the cloud but not everyone does that, leads fewer people to use sparse checkout. 39. There’s some remote build, mixing-and-matching pieces built remotely and locally. 40. Part of working in a monorepo is you need strong tool hygiene enforcement. Without that, you get a ball of mud of dependencies. Adopting sparse checkout drove modularity. 41. Ævar: I’d be interested in a summary 42. Git’s lack of support for sparse checkout was unusual, so I think this topic is well explored by previous version control systems