Re: [PATCH 09/16] documentation: add documentation for the bitmap format

Vicent Martí <tanoku@xxxxxxxxx> · Thu, 27 Jun 2013 04:36:54 +0200

That was a very rude reply. :(

Please refrain from interacting with me in the ML in the future. I'l
do accordingly.

Thanks!
vmg

On Thu, Jun 27, 2013 at 3:11 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
> On Tue, Jun 25, 2013 at 4:08 PM, Vicent Martí <tanoku@xxxxxxxxx> wrote:
>> On Tue, Jun 25, 2013 at 11:17 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>>> What case are you talking about?
>>>
>>> The n-th object must be one of these four types and can never be of
>>> more than one type at the same time, so a natural expectation from
>>> the reader is "If you OR them together, you will get the same set".
>>> If you say "If you XOR them", that forces the reader to wonder when
>>> these bitmaps ever can overlap at the same bit position.
>>
>> I guess this is just wording. I don't particularly care about the
>> distinction, but I'll change it to OR.
>
> Hmm, OK. If you think XOR and OR are the same operation, I also have a
> bridge to sell you. Its in Brooklyn. Its a great value.
>
> The correct operation is OR. Not XOR. OR. Drop the X.
>
>> It cannot be mmapped not particularly because of endianness issues,
>> but because the original format is not indexed and requires a full
>> parse of the whole index before it can be accessed programatically.
>> The wrong endianness just increases the parse time.
>
> Wrong endianness has nothing to do with the parse time. Modern CPUs
> can flip a word around very quickly. In JGit we chose to parse the
> file at load time because its simpler than having an additional index
> segment, and we do what you did which is to toss the object SHA-1s
> into a hashtable for fast lookup. By the time we look for the SHA-1s
> and toss them into a hashtable we can stride through the file and find
> the bitmap regions. Simple.
>
> In other words, the least complex solution possible that still
> provides good performance. I'd say we have pretty good performance.
>
>>>> and I'm going to try to make it run fast enough in that
>>>> encoding.
>>>
>>> Hmph.  Is it an option to start from what JGit does, so that people
>>> can use both JGit and your code on the same repository?
>
> I'm afraid I agree here with Junio. The JGit format is already
> shipping in JGit 3.0, Gerrit Code Review 2.6, and in heavy production
> use for almost a year on android.googlesource.com, and Google's own
> internal Git trees.
>
> I would prefer to see a series adding bitmap support to C Git start
> with the existing format, make it run, taking advantage of the
> optimizations JGit uses (many of which you ignored and tried to "fix"
> in other ways), and then look at improving the file format itself if
> load time is still the largest low hanging fruit in upload-pack. I'm
> guessing its not. You futzed around with the object table, but JGit
> sped itself up considerably by simply not using the object table when
> the bitmap is used. I think there are several such optimizations you
> missed in your rush to redefine the file format.
>
>>>  And then if
>>> you do not succeed, after trying to optimize in-core processing
>>> using that on-disk format to make it fast enough, start thinking
>>> about tweaking the on-disk format?
>>
>> I'm afraid this is not an option. I have an old patchset that
>> implements JGit v1 bitmap loading (and in fact that's how I initially
>> developed these series -- by loading the bitmaps from JGit for
>> debugging), but I discarded it because it simply doesn't pan out in
>> production. ~3 seconds time to spawn `upload-pack` is not an option
>> for us. I did not develop a tweaked on-disk format out of boredom.
>
> I think your code or experiments are bogus. Even on our systems with
> JGit a cold start for the Linux kernel doesn't take 3s. And this is
> JGit where Java is slow because "Jesus it has a lot of factories", and
> without mmap'ing the file into the server's address space. Hell the
> file has to come over the network from a remote disk array.
>
>> I could dig up the patch if you're particularly interested in
>> backwards compatibility, but since it was several times slower than
>> the current iteration, I have no interest (time, actually) to maintain
>> it, brush it up, and so on. I have already offered myself to port the
>> v2 format to JGit as soon as it's settled. It sounds like a better
>> investment of all our times.
>
> Actually, I think the format you propose here is inferior to the JGit
> format. In particular the idx-ordering means the EWAH code is useless.
> You might as well not use the EWAH format and just store 2.6M bits per
> commit. The idx-ordering also makes *much* harder to emit a pack file
> a reasonable order for the client. Colby and I tried idx-ordering and
> discarded it when it didn't perform as well as the pack-ordering that
> JGit uses.
>
>> Following up on Shawn's comments, I removed the little-endian support
>> from the on-disk format and implemented lazy loading of the bitmaps to
>> make up for it. The result is decent (slowed down from 250ms to 300ms)
>> and it lets us keep the whole format as NWO on disk. I think it's a
>> good tradeback.
>
> The maintenance burden of two endian formats in a single file is too
> high to justify. I'm glad to see you saw that.
>
>> As it stands right now, the only two changes from v1 of the on-disk format are:
>>
>> - There is an index at the end. This is a good idea.
>
> I don't think the index is necessary if you plan to build a hashtable
> at runtime anyway. If you mmap the file you can quickly skip over a
> bitmap and find the next SHA-1 using this thing called "pointer
> arithmetic". I am not sure if you are familiar with the term, perhaps
> you could search the web for it.
>
>> - The bitmaps are sorted in packfile-index order, not in packfile
>> order. This is a good idea.
>
> As Colby and I have repeatedly tried to explain, this is not a good idea.
>
>> German kisses,
>
> Strawberry and now German kisses? What's next, Mango kisses?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html