Re: BlueStore Performance issue

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 10 Mar 2016 10:08:53 -0800



On Thu, Mar 10, 2016 at 8:12 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Thu, 10 Mar 2016, Allen Samuels wrote:
>> > > Another important benefit of this is that the WAL code need only kick
>> > > in for operations that are less than 4K rather than the current 64K.
>> > > This is a big reduction in write-amplification for these operations
>> > > which should translate directly into improved throughput, especially
>> > > for such benchmark critical areas as random 4K writes...
>> > >
>> > > While not terribly useful on hybrid or HDD systems, the bitmap based
>> > > code has MUCH shorter CPU paths than does the current code. On an all
>> > > flash OSD system this will directly translate into more performance
>> > > (quantity unknown of course).
>> > >
>> > > While the memory consumption reduction is nice, for me the significant
>> > > performance improvement implied by virtual elimination of WAL is the
>> > > compelling factor.
>> >
>> > I wouldn't conflate the allcoation size and freelist representation; we get the
>> > same avoidance of the WAL path with the current code by changing
>> > min_alloc_size to 4k.  Making the freelist represntation more efficient is
>> > important for small block sizes, but *any* memory-efficient strategy is fine
>> > (and even the current one is probably fine for most workloads).  For
>> > example, we could keep an extent-based representation and page in regions
>> > instead...
>>
>> Perhaps I inartfully made my point. I just wanted to say that if you set
>> min_alloc_size to 4K you avoid the WAL stuff but you spend more memory,
>> in the worst case the memory consumption would be 16 * 320MB => 5GB per
>> TB of storage. While I agree that the true worst-case pattern requires a
>> pathological use-case, I am concerned that normal use-cases will still
>> consume unreasonably large amounts of memory -- leading to unreliable
>> systems.
>>
>> I believe that we simply don't want to be using that much memory for
>> this part of the system. There are other tradeoffs (time, complexity,
>> latency, etc.) that could significantly reduce memory consumption. Let's
>> explore these.
>
> Agreed.  :)
>
>> > The other thing to keep in mind is that the freelist is representated
>> > twice: once in memory in indexed form in StupidAllocator, and once in
>> > FreelistManager just to ensure it's persisted properly.  In the second case, I
>> > suspect we could leverage a custom rocksdb merge operator to avoid that
>> > representation entirely so that adjacent extents are coalesced when the sst
>> > is generated (when writing to l0 or during compaction).  I'm guessing that
>> > functionality isn't present is ZS though?
>>
>> Not at present. But since I control the developers, that is something
>> that could be added. Clearly it's possible to address the global
>> serialization of KV commits if a different mechanism was available for
>> representing the allocation lists in the KV store. Having some kind of
>> merging primitive allows that to be done. I was going to raise exactly
>> this issue yesterday, but tabled it. Let's continue this conversation.
>
> I haven't really done my homework with the existing rocksdb merge
> operators to confirm that they would work in this way.  In particular, I
> think we want them to merge adjacent keys together (at least if we stick
> with the naive offset=length kv representation), and I'm afraid they might
> be intended to merge values with matching keys.

Yes, that's definitely my recollection of how those operators work.
It's a merge when sstables get put together — anything operating on
adjacent ranges would run into all kinds of problems across sstable
boundaries!
-Greg


> In any case, though, my
> hope would be that we'd settle on a single strategy that would work across
> both ZS and rocksdb as it'd simplify our life a bit.
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html