[Summit topic] Sparse checkout behavior and plans

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This session was led by Derrick Stolee. Supporting cast: Jonathan
"jrnieder" Nieder, Elijah Newren, Jeff Hostetler, Jeff "Peff" King,
Johannes "Dscho" Schindelin, Ævar Arnfjörð Bjarmason, Emily Shaffer,
Victoria Dye, brian m. carlson, and CB Bailey.

Notes:

 1.  Cone mode has stabilized

 2.  jrnieder: would sparse index without cone mode support be welcome?

     1. Stolee: you’re welcome to try ;-)

     2. Elijah: main theme: performance. Cone mode allows reasonable
        performance due to fewer rules to check

     3. Stolee: directory-level lookups mean lookups can have sublinear cost,
        since you can skip sparse rules (no need to check them in order to
        figure out whether or not a file is excluded or not)

 3.  Elijah: interested in “sparse clones”, i.e. clones that download
     everything related to a specified cone

     1.  Would be nice not having to download extra objects when already having
         specified a cone of interest

     2.  Jeff: the original partial clone had code to restrict to a cone

     3.  Peff: we still have the code, but turned it off, you can have bitmaps
         with that (too heavy on the server)

     4.  Stolee: also, how can the cone be updated if things change? Never
         solved that problem

     5.  Stolee: but the extra blob downloads turned out not to be too big of a
         problem

     6.  Stolee: got a feature request to restrict git log to the current cone,
         git grep already does that (thanks Matheus)

     7.  Elijah: “git grep” without revision arguments is restricted to
         worktree, so it respects the sparse checkout. When you pass a
         revision, though, it searches the whole tree

     8.  Many commands want to examine the whole tree, makes sense to figure
         out the UX (configuration, etc) of them together

     9.  Peff: Is diff code on someone’s radar?

     10. Stolee: I’d view that as part of the same story as “git log”, “git log
         -p”.

     11. Sparse index means we can avoid faulting in trees outside of HEAD, so
         it helps unlock this

 4.  Sparse index: Victoria and Lessley are taking lead on the number of
     commands supporting sparse index

     1. update-index, diff, blame, clean, stash, sparse-checkout itself so far
        supported only in the Microsoft fork of Git

     2. Enabled by default internally so helps us gather data

     3. Elijah: awesome that you’re working on this, sorry I haven’t been as
        responsive as I’d like on reviews

     4. I’m interested in “clean” in particular --- isn’t that about untracked
        files?

     5. Stolee: It uses the index to find what is tracked, want to avoid
        expanding the in-memory index. If there are files outside the sparse
        checkout area then it does expand.

 5.  jrnieder: question about failure modes

     1. When I convert a command, I make sure my code path doesn’t assume the
        cache array contains all entries. Then I turn off
        command_requires_full_index. What happens if I missed a spot?

     2. Stolee: I put ensure_full_index() in front of everything that assumes a
        full index, but if there’s a loop that we missed, there’s no extra
        protection.

     3. Example: cache-tree was calling itself, invalidating points,
        segfaulted.

     4. More worrying failure mode would be if commands proceed with bad data.
        Segfaulting is the good case!

     5. jrnieder is not too worried since we’re pretty far along and soon
        enough we’ll have converted all commands and these questions would be
        moot

        1. Stolee: goal isn’t to get 100% coverage, so point of questions being
           moot isn’t coming soon

        2. jrnieder: Thanks! Okay, I’ll take a look.

     6. http://sweng.the-davies.net/Home/rustys-api-design-manifesto

     7. Stolee is less worried because we have sufficient ensure_full_index
        calls.

 6.  One optimization we’re considering: not expanding the full index when
     anything outside the cone is needed (we’d like to maybe expand just the
     part that needs expanding)

     1. Elijah: we would still keep cone mode, but it’s a bit weird because the
        cone mode does not match what we have in the index

     2. Stolee: we might actually not need this

 7.  Stolee: in the process of this work, found D/F conflict issue, made a test
     illustrating it

 8.  Elijah: atomicitiy

     1.  checkout is a non-atomic operation. ^C makes a mess

     2.  “git sparse-checkout disable” is non-atomic. Takes a while, people ^C,
         and the very last step is updating the sparsity files. Leaves the
         worktree with a bunch of files they don’t need but commands ignore
         them

     3.  We run into problems because then they can check out a different
         branch, do a bunch of other work, then update the sparse-checkout and
         it will see these precious files it doesn’t want to overwrite

     4.  Should “git status” show them?

     5.  Dscho: We could set a flag on disk when you’re about to disable, then
         if we were interrupted print an error message to get the user to sort
         things out

     6.  Peff: I was going to suggest something similar. FS doesn’t make
         transactions easy, but we can at least do a rollback (signal handler),
         not foolproof, but it works pretty well and covers your ^C case.

     7.  Stolee: coming in 2.34: sparse-checkout reapply will delete ignored
         (and tracked?) files. Helps with these leftover files.

     8.  Elijah: no current way to get out of that state, thank you for making
         sparse-checkout reapply do that

     9.  Stolee: noticed during experimental release to people from Office.
         Everything was slow because they had run build and left behind ignored
         files

     10. jrnieder: Piggy-backing on Dscho’s comment, there’s a database
         analogy: record intent (in the database case, that’s a transaction
         journal) before the non-atomic steps the act on that intent. Suggests
         maybe we should be updating the sparsity pattern before the checkout
         step

 9.  That’s it, that’s the status update what’s currently on the list.

 10. We have more plans, though.

 11. Idea: use git.git itself

     1. Tried it, but had to have 97% files to still be workable

     2. Could change the Makefile to accept that, say, po/ is missing

     3. Ævar: creates a lot of complexity for the build

     4. jrnieder: as VCS provider, what is our recommendation to build authors?
        Do we want them querying sparse checkout, do we want builds that Just
        Work in cone mode, do we want to treat sparse checkout as a thing that
        builds don’t need to support?

     5. Stolee: want build system to be able to tell Git about what needs to be
        checked out. “In-tree sparse checkout” (see below)

 12. Emily: we’re interested in sparse-checkout affecting the set of active
     submodules, just mentioning this as a heads-up

 13. [PATCH 00/10] [RFC] In-tree sparse-checkout definitions - Derrick Stolee
     via GitGitGadget
     (https://lore.kernel.org/git/pull.627.git.1588857462.gitgitgadget@xxxxxxxxx/)

 14. Victoria: today when you switch gears and work on something else you have
     to update the sparse checkout pattern

 15. Proposal here is to have in-tree sparse checkout definitions, e.g. a
     .gitdependencies file that lists, for the directories you’re working with,
     what other subdirectories they depend on

 16. That way, you get exactly the folders you need

 17. Stolee: office has their own tool “scoper” that figures out dependencies
     and runs “git sparse-checkout set” for the user. Is confusing when you
     rebase and need to remember to run it

 18. Currently lives in a hook, custom and built for one engineering system,
     want to generalize and make a standard feature

 19. Victoria: being built in to Git would make sense because it’s general
     enough to work in most monorepo environments.

 20. Involves two pieces: having git understand the dependencies and assemble
     your sparse checkout cone using them, and having the build system maintain
     and use sparse checkout correctly.

 21. Some build setups tolerate missing directories reasonably well. If we make
     .gitdependencies more of a first-class concept then we could go further
     and make build systems handle missing directories as something that would
     be expected

 22. C# .proj files link to dependencies on other .proj files with relative
     path. But in a solution file collecting all .proj files, it lists all of
     them and you need to have them all present. If a subdirectory isn’t
     present, proposal is to build what is there instead of everything.

 23. Tried another prototype on how to do this in Bazel. It has a rigorous
     definition of inputs and outputs, and based on that you could translate to
     a .gitdependencies file or sparse-checkout pattern.

 24. Microsoft’s buildxl has similar properties

 25. Victoria asks: how general is the above?

 26. brian: Many monorepos has multiple microservices. A cone can represent
     what a particular service needs to run.

 27. If you’re building one coherent product like Windows, you’re going to need
     some prebuilt artifacts that you pull down.

 28. jrnieder: Large monorepos often have strong remote build. Not everything
     you depend on is things that you need to have in source form locally

 29. CB: My team at Bloomberg has a teamwide “monorepo” (not Bloomberg-wide).
     We’re cmake based. Sparse checkout would be interesting for us. We’re
     experimenting with what’s called workspace builds: you have a thing you
     can build (a subdirectory), that you pull into the toplevel CMakeLists.txt
     as a single thing.

 30. With cmake you can declare a dependency with target_link_libraries. A
     dependency name can either be a cmake defined target in the codebase
     you’re building it, or it can be a pre-built library pulled in another
     way, e.g. importing via a pkg-config file.

 31. At build time if I decide I want to change that library, I’ll expand my
     sparse-checkout region, and rerun cmake to have it understand the newly
     available source.

 32. Optionality: I don’t have to have that source checked out, but when it’s
     present I want to use it.

 33. Victoria: sounds like in-tree sparse checkout is more of an intermediate
     step. Sometimes you want the source, sometimes you want to pull in an
     external artifact.

 34. Elijah: we have a monorepo, about the size of the Linux kernel. Multiple
     separate services, interconnected pieces. Using sparse-checkout required
     some code changes, refactoring that wasn’t just around the build system.
     We created a tool before the sparse-checkout command existed, using older
     mechanisms, and then switched to sparse-checkout when it came out. We
     track our dependencies ourselves --- you need this set of modules (3 or 4)
     or the modules relevant to a particular team, and it then computes the
     relevant directories to get. We had to make some changes to adopt cone
     mode but I like it and the changes it led to. Then you run the build
     system --- you have files that declare the dependencies, are they newer
     than .git/info/sparse-checkout? If not then recompute them again.

 35. Potentially would want to rerun the dependency generation after you run a
     rebase as well…

 36. If we track it in-tree, there are some interesting cases we’ll run into
     (merge conflicts on this generated file).

 37. Also, tracking dependencies in two places can result in difficulty, skew.
     Maybe can generate one from the other.

 38. Our sparse checkout tends to be build oriented “what do I need for this
     build”. But testing inverts the dependency graph, want to see what tests
     depend on this code. We encourage them to test in the cloud but not
     everyone does that, leads fewer people to use sparse checkout.

 39. There’s some remote build, mixing-and-matching pieces built remotely and
     locally.

 40. Part of working in a monorepo is you need strong tool hygiene enforcement.
     Without that, you get a ball of mud of dependencies. Adopting sparse
     checkout drove modularity.

 41. Ævar: I’d be interested in a summary

 42. Git’s lack of support for sparse checkout was unusual, so I think this
     topic is well explored by previous version control systems

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux