Re: Git Merge contributor summit notes

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sat, 10 Mar 2018 14:01:14 +0100

On Sat, Mar 10 2018, Alex Vandiver jotted:

> It was great to meet some of you in person!  Some notes from the
> Contributor Summit at Git Merge are below.  Taken in haste, so
> my apologies if there are any mis-statements.

Thanks a lot for taking these notes. I've read them over and they're all
accurate per my wetware recollection. Adding some things I remember
about various discussions below where I think it may help to clarify
things a bit.

>  - Alex
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>   "Does anyone think there's a compelling reason for git to exist?"
>     - peff
>
>
> Partial clone (Jeff Hostetler / Jonathan Tan)
> ---------------------------------------------
>  - Request that the server not send everything
>  - Motivated by getting Windows into git
>  - Also by not having to fetch large blobs that are in-tree
>  - Allows client to request a clone that excludes some set of objects, with incomplete packfiles
>  - Decoration on objects that include promise for later on-demand backfill
>  - In `master`, have a way of:
>    - omitting all blobs
>    - omitting large blobs
>    - sparse checkout specification stored on server
>  - Hook in read_object to fetch objects in bulk
>
>  - Future work:
>    - A way to fetch blobsizes for virtual checkouts
>    - Give me new blobs that this tree references relative to now
>    - Omit some subset of trees
>    - Modify other commits to exclude omitted blobs
>    - Protocol v2 may have better verbs for sparse specification, etc
>
> Questions:
>  - Reference server implementation?
>    - In git itself
>    - VSTS does not support
>  - What happens if a commit becomes unreachable?  Does promise still apply?
>    - Probably yes?
>    - If the promise is broken, probably crashes
>    - Can differentiate between promise that was made, and one that wasn't
>    => Demanding commitment from server to never GC seems like a strong promise
>  - Interactions with external object db
>    - promises include bulk fetches, as opposed to external db, which is one-at-a-time
>    - dry-run semantics to determine which objects will be needed
>    - very important for small objects, like commits/trees (which is not in `master`, only blobs)
>    - perhaps for protocol V2
>  - server has to promise more, requires some level of online operation
>    - annotate that only some refs are forever?
>    - requires enabling the "fetch any SHA" flags
>    - rebasing might require now-missing objects?
>      - No, to build on them you must have fetched them
>      - Well, building on someone else's work may mean you don't have all of them
>    - server is less aggressive about GC'ing by keeping "weak references" when there are promises?
>    - hosting requires that you be able to forcibly remove information
>  - being able to know where a reference came from?
>    - as being able to know why an object was needed, for more advanced logic
>  - Does `git grep` attempt to fetch blobs that are deferred?
>    - will always attempt to fetch
>    - one fetch per object, even!
>    - might not be true for sparse checkouts
>    - Maybe limit to skipping "binary files"?
>    - Currently sparse checkout grep "works" because grep defaults to looking at the index, not the commit
>    - Does the above behavior for grepping revisions
>    - Don't yet have a flag to exclude grep on non-fetched objects
>    - Should `git grep -L` die if it can't fetch the file?
>    - Need a config option for "should we die, or try to move on"?
>  - What's the endgame?  Only a few codepaths that are aware, or threaded through everywhere?
>    - Fallback to fetch on demand means there's an almost-reasonable fallback
>    - Better prediction with bulk fetching
>    - Are most commands going to _need_ to be sensitive to it?
>    - GVFS has a caching server in the building
>    - A few git commands have been disabled (see recent mail from Stolee); those are likely candidates for code that needs to be aware of de-hydrated objects
>  - Is there an API to know what objects are actually local?
>    - No external API
>    - GVFS has a REST API
>  - Some way to later ask about files?
>    - "virtualized filesystem"?
>    - hook to say "focus on this world of files"
>    - GVFS writes out your index currently
>  - Will this always require turning off reachability checks?
>    - Possibly
>  - Shallow clones, instead of partial?
>    - Don't download the history, just the objects
>    - More of a protocol V2 property
>    - Having all of the trees/commits make this reasonable
>  - GVFS vs this?
>    - GVFS was a first pass
>    - Now trying to mainstream productize that
>    - Goal is to remove features from GVFS and replace with this

As I understood it Microsoft deploys this in a mode where they're not
vulnerable to the caveats noted above, i.e. the server serving this up
only has branches that are fast-forwarded (and never deleted).

However, if you were to build history on a server where you're counting
on lazily getting a blob later and the server breaks that promise, we're
in a state of having corrupted the local repo (most git commands will
just fail).

Some sub-mode where you can declare that only some branches should
implicitly promise that they have lazy blobs would be useful, but it
wasn't clear to me whether such a thing would be very hard to implement.

In any case, this is something that needs active server cooperation, and
is very unlikely to be deployed by people who don't know the caveats
involved, so I for one am all for getting this in even if there's some
significant caveats like that.

> Protocol V2 (Brandon)
> [...]
>  - (peff) Time to deprecate the git anonymous protocol?
>    - Biggest pain to sneak information into
>    - Shawn/Johannes added in additional parameters after a null byte
>    - Bug; if there's anything other than host, then die
>    - But it doesn't check anything after _two_ null bytes.
>    - "Two null bytes, for protocol V2"
>    - Only in use by github and individual users
>    - Would not be too sad if daemon went away
>    - Git for Windows has interest in daemon
>    - Still interested in simple HTTP wrapper?
>    - HTTP deployment could be made eaiser
>    - Useful for unauthenticated internal push
>    - Perhaps make the daemon use HTTPS?  Because it needs to be _simple_
>    - Currently run out of inittab

I think the conclusion was that nobody cares about the git:// protocol,
but people do care about it being super easy to spin up a server, and
currently it's easiest to spin up git://, but we could also ship with
some git-daemon mode that had a stand-alone webserver (or ssh server) to
get around that.

>  - Series as currently out
>    - Only used for local operations
>    - Not confident on remote CURL
>    - Once jgit implementation is done, should be more confident
>    - e.g. authentication may be messed up
>    - only file:// is currently in production
>    - test scripts to exercise HTTP, so only thing unknown is auth
>    - May need interop tests? there is one, but not part as standard tests
>    - Dscho can set up something in VSTS infra to allow these two versions to be tested
>    - Tests should specify their versions; might be as simple as `cd ...; make` and maybe they should be in Travis

FWIW "local operations" here refers to `git clone file://` and the like
which Google apparently does a lot of with git, and is stess testing the
v2 protocol.

> [...]
>  - some hash functions are in silicon (e.g. microsoft cares)

FWIW this refers to https://en.wikipedia.org/wiki/Intel_SHA_extensions &
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0500e/CJHDEBAF.html
among others. Previous on-list discussion at
https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@xxxxxxxxxxxxxx/