RE: BlueStore Performance issue

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Thu, 10 Mar 2016 15:31:35 +0000

> -----Original Message-----
> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> Sent: Thursday, March 10, 2016 5:45 AM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: RE: BlueStore Performance issue
>
> On Thu, 10 Mar 2016, Allen Samuels wrote:
> > Thanks for information on the allocator, many sections of code that we
> > didn't understand and thought didn't seem relevant are clearer now. I
> > believe we now understand the deeper coupling between the freespace
> > management and the transaction commit ordering logic.
> >
> > The root cause is that we missed the synchronous/asynchronous commits
> > in kv_sync_thread when this was mapped into ZetaScale's commit --
> > which only has a synchronous transaction commit. So in the short term,
> > we'll do some in-memory caching which will effectively create the
> > equivalent of sync and async transactions in ZetaScale. I'm not yet
> > convinced that this is the best long term solution, but it should get
> > us past this particular problem right now which is more important.
>
> Good to hear.  Let's see how it goes...
>
> > On another front, I've expressed concern about the memory consumption
> > associated with the current allocation implementation. Please verify
> > these computations...
> >
> > On a 64-bit x86 machine, I used the google btree_map code and
> > populated it with a worst-case allocation scheme -- which is every
> > other block being free and then measured the memory consumption. It
> > turns out that for large maps it consumes about 20 bytes per entry
> > (pretty good for an 8-byte key and an 8-byte value!).
> >
> > So for a worst-case allocation pattern and the default 64K allocation
> > block size, the current code will consume 40 bytes per 128KB of
> > storage
> > (2^17) [I'm assuming StupidAllocator also gets converted to btree_map].
> > This means that for each TB of storage you'll need 40*(2^40/2^17) =>
> > 320MB of memory per TB of storage. Current HW recommendations are
> > 1-2GB DRAM for each TB of storage. This is a pretty big chunk of that
> > memory that I'm certain has much better ways to be useful.
> >
> > Alternatively, if you use a simple bit vector with a 4KB block size
> > (each bit represents 4KB), then you only need (2^40/2^12/2^3) which is
> > 2^25 or 32MB of DRAM per TB. Of course the current code uses two of
> > those vectors which would be 64MB total for each TB. (And yes, you'll
> > use more memory to allow efficient searching of those vectors, but
> > that memory is only an additional 3-4%).
>
> What would the search data structures look like in this case?

I'll write up a couple of possibilities later today on this.

>
> > Another important benefit of this is that the WAL code need only kick
> > in for operations that are less than 4K rather than the current 64K.
> > This is a big reduction in write-amplification for these operations
> > which should translate directly into improved throughput, especially
> > for such benchmark critical areas as random 4K writes...
> >
> > While not terribly useful on hybrid or HDD systems, the bitmap based
> > code has MUCH shorter CPU paths than does the current code. On an all
> > flash OSD system this will directly translate into more performance
> > (quantity unknown of course).
> >
> > While the memory consumption reduction is nice, for me the significant
> > performance improvement implied by virtual elimination of WAL is the
> > compelling factor.
>
> I wouldn't conflate the allcoation size and freelist representation; we get the
> same avoidance of the WAL path with the current code by changing
> min_alloc_size to 4k.  Making the freelist represntation more efficient is
> important for small block sizes, but *any* memory-efficient strategy is fine
> (and even the current one is probably fine for most workloads).  For
> example, we could keep an extent-based representation and page in regions
> instead...

Perhaps I inartfully made my point. I just wanted to say that if you set min_alloc_size to 4K you avoid the WAL stuff but you spend more memory, in the worst case the memory consumption would be  16 * 320MB => 5GB per TB of storage. While I agree that the true worst-case pattern requires a pathological use-case, I am concerned that normal use-cases will still consume unreasonably large amounts of memory -- leading to unreliable systems.

I believe that we simply don't want to be using that much memory for this part of the system. There are other tradeoffs (time, complexity, latency, etc.) that could significantly reduce memory consumption. Let's explore these.

>
> The other thing to keep in mind is that the freelist is representated
> twice: once in memory in indexed form in StupidAllocator, and once in
> FreelistManager just to ensure it's persisted properly.  In the second case, I
> suspect we could leverage a custom rocksdb merge operator to avoid that
> representation entirely so that adjacent extents are coalesced when the sst
> is generated (when writing to l0 or during compaction).  I'm guessing that
> functionality isn't present is ZS though?

Not at present. But since I control the developers, that is something that could be added. Clearly it's possible to address the global serialization of KV commits if a different mechanism was available for representing the allocation lists in the KV store. Having some kind of merging primitive allows that to be done. I was going to raise exactly this issue yesterday, but tabled it. Let's continue this conversation.

>
> sage
>
>
> > Of course, if I've misplaced a binary (or decimal) point in my
> > computation, then please ignore.
> >
> >
> > Allen Samuels
> > Software Architect, Fellow, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> >
> >
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > Sent: Wednesday, March 09, 2016 7:38 AM
> > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > Subject: RE: BlueStore Performance issue
> >
> > On Wed, 9 Mar 2016, Allen Samuels wrote:
> > > > Stage 3 is serious bottleneck in that it guarantees that you will
> > > > never exceed QD=1 for your logging device. We believe there is no
> > > > need to serialize the KV commit operations.
> > >
> > > It's potentially a bottleneck, yes, but it's also what keeps the
> > > commit rate self-throttling.  If we assume that there are generally
> > > lots of other IOs in flight because every op isn't metadata-only the
> > > QD will be higher.
> > >
> > > If it's a separate log device, though, yes.. it will have QD=1.  In
> > > those situations, though, the log device is probably faster than the
> > > other devices, and a shallow QD probably isn't going to limit
> > > throughput--just marginally increase latency?
> > >
> > > [Allen] No a shallow queue depth will directly impact BW on many
> > > (most?/all?) SSDs. I agree that in a hybrid model (DB on flash, data
> > > on
> > > HDD) that the delivered performance delta may not be large. As for
> > > the throttling, we haven't focused on that area yet (just enough to
> > > put it on the list of future things to investigate).
> >
> > FWIW in the single-device non-hybrid case, the QD=1 for *kv* IOs, but
> > there will generally be a whole bunch of non-kv reads and writes also
> > in flight.  I wouldn't expect us to ever actually have a QD of 1
> > unless
> > *every* operation is pure-kv (say, omap operations).
> >
> > For example, say we have 4 KB random writes, and the QD at the OSD level
> is 64.  In that case, BlueStore should have up to 64 4 KB aio writes and 0 to 1
> 64*whatever kv writes in flight.
> >
> > My intuition says that funneling the txn commits like this will in practice
> maybe halve the effective QD at the device (compared to the QD at the
> OSD)... does that seem about right?  Maybe it'd stay about the same, since 1
> OSD IO is actually between 1 and 2 device IOs (the aio write + the [maybe
> batched] txn commit).
> >
> > sage
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> >
> >
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html