On Tue, Feb 19, 2019 at 5:11 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > On Fri, Feb 15, 2019 at 2:07 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: > > > > Hey Gregory, > > > > first of all please note that BlueStore doesn't allocate chunks less > > that min_alloc_size which is 16K for SSD and 64K for HDD by default. > > Oh, I thought it was 4KB for SSD. Maybe I confused it with the > prefer_deferred_size below. > > > Depending on the incoming block size there are two different write > > procedures: > > > > 1) big writes (block size is aligned with min_alloc_size) > > > > 2) small writes (block size in less than min_alloc_size). May be > > performed via deferred write or directly, see below > > > > Large but unaligned blocks are split and pass through the combination of > > 1) and 2) > > > > > > 1) always triggers allocation of the new blob and writing to a different > > location. > > > > 2) If we're overwriting existing data then block passes through deferred > > write procedure. Which prepares 4K aligned block (by merging incoming > > data, padding and reading non-overlapped data), puts it into DB and then > > performs disk block overwrite at exactly the same location. This way a > > sort of WAL is provided (do not confuse with the one in RocksDB). > > > > If writing goes to unused extent and block size <= > > bluestore_prefer_deferred_size (16K for HDD and 0 for SSD) than deferred > > procedure is applied as well. > > > > Otherwise block is written directly to disk, along with allocation if > > needed. > > > > > > Hence in your case (4K overwrites) the scenario looks like: > > > > AAAA > > > > ABAA + B temporarily put into RocksDB > > > > ACAA + C temporarily put into RocksDB > > > > etc. > > > > And proposed optimization makes no sense. > > > > Indeed one might observe higher DB load this way. > > > > Your original scenario: > > > > AAAA > > A[A]AA...B > > A[A]AA...[B]...C > > A[A]AA...[B]...[C]...D > > > > is rather about big (16K/64K) writes. Not sure if any optimization is > > required here either. Maybe except the case when we want data to be less > > fragmented (for HDD?). But I doubt this is feasible . > > Well, there's two reasons to prefer defragmented data: > 1) sequential read speeds are higher than random, even on SSD. > 2) defragmented data means we have to store fewer extents in RocksDB. > > I definitely don't know how expensive that is, though! > > Or indeed if we actually get IO patterns that would trigger this. I > was thinking specifically of RBD, so if it's not going to trigger > patterns like this it's not worth worrying about. Jason, do we have > any data on what IO sizes tend to look like? I only have IO data on fio workloads, mongodb, and a commercial VM backup tool -- all of which was running (unfortunately) the Luminous bluestore stupid allocator instead of the current bitmap allocator. When paired with a BBU controller w/ writeback and readahead capabilities, I think it's fair to say that <4MiB sequential IOs are impacted. Even though librbd sends an allocation hint with each IO, that is disregarded under Bluestore [1] and therefore you could really only expect <min-alloc-size> chunks of locality. For real-world workload, the mongodb IO sizes were averaging 8-16KiB I believe (unknown if sequential or random pattern, but even if random the data would still be virtually nearby seek-wise). The backup software was around 130KiB in sequential IO, but from bluestore's PoV it translated to lots of 32KiB-64KiB random IO with long seeks [2]. > -Greg > > > > > > > Hope this helps. > > > > Thanks, > > > > Igor > > > > > > > > On 2/15/2019 3:42 AM, Gregory Farnum wrote: > > > Hey Igor, > > > I don't know much about the BlueStore allocator pattern, so I don't > > > have a clear idea how difficult this is. > > > But I *believe* we have a common pattern in RBD that might be worth > > > optimizing for: the repeated-overwrites case. Often this would be some > > > kind of journal header — either for the FS stored on top, a database, > > > or whatever, that results in the same 4KB logical block getting > > > overwritten repeatedly. > > > > > > For instance, librbd might write out > > > AAAA > > > to an object, then do updates to the second block resulting in a logical > > > ABAA > > > ACAA > > > ADAA > > > etc. > > > > > > I think, from my very limited understanding and what I heard when I > > > asked this in standup, that right now the layout in BlueStore for this > > > will tend to be something like > > > AAAA > > > A[A]AA...B > > > A[A]AA...[B]...C > > > A[A]AA...[B]...[C]...D > > > where the brackets indicate a deallocated [hole]. I expect that to > > > happen (certainly for the first overwrite) as long as the incoming IO > > > is large enough to trigger an immediate write to disk and then an > > > update to the metadata, rather than stuffing the data in the WAL and > > > then doing a write-in-place. > > > > > > So I wonder: is there any optimization to try and place incoming data > > > so that it closes up holes and allows merging the extents/blobs > > > (sorry, I forget the BlueStore internal terms)? If not, is this a > > > feasible optimization to try and apply at some point? > > > That way we could get an on disk layout pattern more like > > > AAAA > > > A[A]AA...B > > > ACAA...[B] > > > A[C]AA...D > > > > > > I don't know what the full value of something like this would actually > > > be, but I was in some discussion recently where it came up that RBD > > > causes much larger RocksDB usage than RGW does, thanks to the > > > fragmented layouts it provokes. Cutting that down might be very good > > > for our long-term performance? > > > -Greg [1] I think it used to control the CRC size but now it doesn't (?) [2] https://raw.githubusercontent.com/dillaman/public/master/bluestore/high%20latency%20IO.png (stupid allocator-induced seeks) -- Jason