Re: [PATCH 0/4] Compile-time extensions for list-object-filter

Robert Coup <robert.coup@xxxxxxxxxxxxxxx> · Wed, 8 Sep 2021 15:23:08 +0100

Hi all,

Sorry, life got in the way at an unfortunate moment. And it should
very much be tagged "RFC" — thanks Ævar and Bagas for reading. Here's
the additional background you could have used earlier on — I've
bundled it together, but I'll happily follow up specific questions
individually. I've CCed in a couple of other people who might find it
interesting too.

So Andrew & my motivation here is to provide some specialised
filtering at clone/fetch time. In Kart[1] datasets are organised
(simplistically) by primary key, but for spatial data we want to
provide an orthogonal spatial extent filter which isn't part of the
tree path, so we can't reuse the work done in the sparse filters. For
a fetch obviously the server-side will require support for any
indexing and ultimately deciding whether a particular blob should be
part of the tree or not.

In the original filter implementation [2], various "profiles" were
alluded to as a case where the server operator might know a lot more
about how the developer would want to use the repository than the
client does, and a named profile for the server to interpret would be
a reasonably clean approach. Referred again in [3]. Sparse filters,
subject to the performance issues hopefully being improved by the
cone-mode changes, cater to a lot of them. The existing built-in
filters are fairly simple and there's a relatively simple interface
for them to implement, extending them seems like a reasonable approach
to me — potentially allowing people doing interesting things with
partial clones to take it and run in a general way without too much
effort.

So the key element to clarify/understand for this proposal is that the
main change to Git is the ability to use
`--filter=extension:<name>[=<param>]` which passes through to
git-upload-pack on the server side, to rev-list, which looks up /
validates the filter name/parameter and applies it. So if you want to
offer a custom filter, you build & set it up on the server and any Git
client (if this is merged) can make use of it without any additional
code.

Wrt IPC, my very first proof of concept used an external process that
rev-list launched, passed a series of oids/types via stdin, receiving
yes/no responses via stdout. Even after quite a lot of OS-specific
efforts to optimise the data flow across the pipes it was slow for
non-trivial sized repositories (where it matters) — essentially
boiling down to too much context switching between processes.
Reorganising the existing filtering approach to do batching with
deferred responses and parallelising the filtering into threads seemed
like an awful lot of effort for potentially little gain, in a niche
use case.

Moving it in-process made it perform well: CPU use moves into the
"deciding whether this object is in or out" phase rather than burning
it in IPC & context-switching. I did build up a basic runtime-loadable
plugin approach, but there was a reasonable amount of the internal git
API that the filters need/touched (even things like hash sizes add a
pile of complexity to it) unless it was reduced back to passing
oids+types. My approach for plugins was basically "could I potentially
implement the existing filters?" Without more of the git API I don't
think this would be feasible. Plus Git would have to agree on and
support a public ABI going forward, which for a potentially niche use
case didn't seem reasonable to propose.

Hence compile time: simpler; no ABI issues; the internal API doesn't
change that much wrt things that filters are likely to do — if someone
creates a plugin then it's on them to keep it building across git
upgrades on their server; platform support is simpler; and if others
find exciting uses for it then a runtime-loadable plugin API is always
possible in future. And only the server ever needs any custom
binaries.

Licensing — yes, any filters would need to be GPL-licensed since
they're compiled with Git. Only the server operator needs to concern
themselves with complying with this (& associated licensing for any
external libraries/etc a plugin might need) since that's where the
plugin code is linked & runs. With the usual issue around internal use
within an organisation not qualifying as "distribution" under the GPL.
FWIW, for Kart we'll be GPL-licensing the server-side spatial filter
plugin code for anyone who's interested.

Hope this clarifies a bit.

Rob :)

[1] https://kartproject.org — building on Git to version geospatial
datasets. Not sure if the videos ever got released (thanks Covid), but
I did a talk at Git Merge 2020 on it when we released the first alpha.
[2] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@xxxxxxxxxxxxxxxxx/
[3] https://public-inbox.org/git/79b06312-75ca-5a50-c337-dc6715305edb@xxxxxxxxxxxxxxxxx/