RE: BlueStore Performance issue

Sage Weil <sweil@xxxxxxxxxx> · Thu, 10 Mar 2016 11:12:36 -0500 (EST)

On Thu, 10 Mar 2016, Allen Samuels wrote:
> > > Another important benefit of this is that the WAL code need only kick
> > > in for operations that are less than 4K rather than the current 64K.
> > > This is a big reduction in write-amplification for these operations
> > > which should translate directly into improved throughput, especially
> > > for such benchmark critical areas as random 4K writes...
> > >
> > > While not terribly useful on hybrid or HDD systems, the bitmap based
> > > code has MUCH shorter CPU paths than does the current code. On an all
> > > flash OSD system this will directly translate into more performance
> > > (quantity unknown of course).
> > >
> > > While the memory consumption reduction is nice, for me the significant
> > > performance improvement implied by virtual elimination of WAL is the
> > > compelling factor.
> >
> > I wouldn't conflate the allcoation size and freelist representation; we get the
> > same avoidance of the WAL path with the current code by changing
> > min_alloc_size to 4k.  Making the freelist represntation more efficient is
> > important for small block sizes, but *any* memory-efficient strategy is fine
> > (and even the current one is probably fine for most workloads).  For
> > example, we could keep an extent-based representation and page in regions
> > instead...
> 
> Perhaps I inartfully made my point. I just wanted to say that if you set 
> min_alloc_size to 4K you avoid the WAL stuff but you spend more memory, 
> in the worst case the memory consumption would be 16 * 320MB => 5GB per 
> TB of storage. While I agree that the true worst-case pattern requires a 
> pathological use-case, I am concerned that normal use-cases will still 
> consume unreasonably large amounts of memory -- leading to unreliable 
> systems.
> 
> I believe that we simply don't want to be using that much memory for 
> this part of the system. There are other tradeoffs (time, complexity, 
> latency, etc.) that could significantly reduce memory consumption. Let's 
> explore these.

Agreed.  :)

> > The other thing to keep in mind is that the freelist is representated
> > twice: once in memory in indexed form in StupidAllocator, and once in
> > FreelistManager just to ensure it's persisted properly.  In the second case, I
> > suspect we could leverage a custom rocksdb merge operator to avoid that
> > representation entirely so that adjacent extents are coalesced when the sst
> > is generated (when writing to l0 or during compaction).  I'm guessing that
> > functionality isn't present is ZS though?
> 
> Not at present. But since I control the developers, that is something 
> that could be added. Clearly it's possible to address the global 
> serialization of KV commits if a different mechanism was available for 
> representing the allocation lists in the KV store. Having some kind of 
> merging primitive allows that to be done. I was going to raise exactly 
> this issue yesterday, but tabled it. Let's continue this conversation.

I haven't really done my homework with the existing rocksdb merge 
operators to confirm that they would work in this way.  In particular, I 
think we want them to merge adjacent keys together (at least if we stick 
with the naive offset=length kv representation), and I'm afraid they might 
be intended to merge values with matching keys.  In any case, though, my 
hope would be that we'd settle on a single strategy that would work across 
both ZS and rocksdb as it'd simplify our life a bit.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html