Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 10/19/2017 1:22 AM, Junio C Hamano wrote:
Ben Peart <benpeart@xxxxxxxxxxxxx> writes:

There is code in post_read_index_from() to catch out of order entries
when reading an index file.  This order verification is ~13% of the cost
of every call to read_index_from().

I find this a bit over-generalized claim---wouldn't the overhead
depend on various conditions, e.g. the size of the index and if
split-index is in effect?


I tested it against 10K, 100K and 1,000K files and the absolute time varies with different sized indexes, but the percentage is relatively consistent (after doing multiple runs and averaging the results to reduce noise). I didn't measure it with split index so can't say how that would impact performance.

In general, I get very skeptical towards any change that makes the
integrity of the data less certain based only on microbenchmarks,
and prefer to see a solution that can absorb the overhead in some
other way.

When we are using split-index, the current code is not validating
the two input files from the disk. Because merge_base_index()
depends on the base to be properly sorted before the overriding
entries are added into it, if the input from disk is in a wrong
order, we are screwed already, and the order check in post
processing is pointless.

The original commit message doesn't say *why* this test was added so I have to make some educated guesses. Given no attempt is made to recover or continue and instead we just die when an out-of-order entry is detected, I'm assuming check_ce_order() is protecting against buggy code that incorrectly wrote out an invalid index (the sha check would have detected file corruption or torn writes).

If we are guarding against "git" writing out an invalid index, we can move this into an assert so that only git developers pay the cost of validating they haven't created a new bug. I think this is better than just adding a new test case as a new test case would not achieve the same coverage. This is my preferred solution.

If we are guarding against "some other application" writing out an invalid index, then everyone will have to pay the cost as we can't insert the test into "some other applications." Without user reports of it happening or any telemetry saying it has happened I really have no idea if it every actually happens in the wild anymore and whether the cost on every index load is still justified.

If we want to do this order validation, I
think we should be doing it in do_read_index() where it does
create_from_disk() and the set_index_entry(), instead of having it
as a separate phase that scans a potentially large index array one
more time.  And doing so will not penalize the case where we do not
use split-index, either.

So, I think I like the direction of getting rid of the order
validation in post_read_index_from(), not only during the normal
operation but also in fsck.  I think it makes more sense to do so
incrementally inside do_read_index() all the time and see how fast
we can make it do so.


Unfortunately, all the cost is in the strcmp() - the other tests are negligible so moving it to be done incrementally inside do_read_index() won't reduce the cost, it just moves it and makes it harder to identify.

The only idea I could come up with for reducing the cost to our end users is to keep it separate and split the test across multiple threads with some minimum index size threshold as we have done elsewhere. This adds code and complexity that we'll have to maintain forever so is less preferred than making it an assert.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux