Re: [PATCH v1] read-cache: speed up index load through parallelization

Ben Peart <peartben@xxxxxxxxx> · Fri, 24 Aug 2018 15:57:40 -0400

On 8/24/2018 3:00 PM, Duy Nguyen wrote:
On Fri, Aug 24, 2018 at 8:40 PM Ben Peart <peartben@xxxxxxxxx> wrote:

On 8/24/2018 2:20 PM, Duy Nguyen wrote:
On Fri, Aug 24, 2018 at 5:37 PM Duy Nguyen <pclouds@xxxxxxxxx> wrote:
On Thu, Aug 23, 2018 at 10:36 PM Ben Peart <peartben@xxxxxxxxx> wrote:
Nice to see this done without a new index extension that records
offsets, so that we can load existing index files in parallel.

Yes, I prefer this simpler model as well.  I wasn't sure it would
produce a significant improvement given the primary thread still has to
run through the variable length cache entries but was pleasantly surprised.

Out of curiosity, how much time saving could we gain by recording
offsets as an extension (I assume we need, like 4 offsets if the
system has 4 cores)? Much much more than this simpler model (which may
justify the complexity) or just "meh" compared to this?

To answer my own question, I ran a patched git to precalculate
individual thread parameters, removed the scheduler code and hard
coded these parameters (I ran just 4 threads, one per core). I got
0m2.949s (webkit.git, 275k files, 100 read-cache runs). Compared to
0m4.996s from Ben's patch (same test settings of course) I think it's
definitely worth adding some extra complexity.

I took a run at doing that last year [1] but that was before the
mem_pool work that allowed us to avoid the thread contention on the heap
so the numbers aren't an apples to apples comparison (they would be
better today).

Ah.. sorry I was not aware. A big chunk of 2017 is blank to me when it
comes to git.

The trade-off is the additional complexity to be able to load the index
extension without having to parse through all the variable length cache
entries.  My patch worked but there was feedback requested to make it
more generic and robust that I haven't gotten around to yet.

One more comment. Instead of forcing this special index at the bottom,
add a generic one that gives positions of all extensions and put that
one at the bottom. Then you can still quickly locate your offset table
extension, and you could load UNTR and TREE extensions in parallel too
(those scale up to worktree size)

That is pretty much what Junio's feedback was and what I was referring 
to as making it "more generic."  The "more robust" was the request to 
add a SHA to the extension ensure it wasn't corrupt and was a valid 
extension.

This patch series went for simplicity over absolutely the best possible
performance.

Well, you know my stance on this now :) Not that it really matters.

[1]
https://public-inbox.org/git/20171109141737.47976-1-benpeart@xxxxxxxxxxxxx/

PS. I still think it's worth bring v4's performance back to v2. It's
low hanging fruit because I'm pretty sure Junio did not add v4 code
with cpu performance in mind. It was about file size at that time and
cpu consumption was still dwarfed by hashing.

I see that as a nice follow up patch.  If the extension exists, use it 
and jump directly to the blocks and spin up threads.  If it doesn't 
exist, fall back to the code in this patch that has to find/compute the 
blocks on the fly.