On Fri, Feb 15, 2019 at 2:07 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: > > Hey Gregory, > > first of all please note that BlueStore doesn't allocate chunks less > that min_alloc_size which is 16K for SSD and 64K for HDD by default. Oh, I thought it was 4KB for SSD. Maybe I confused it with the prefer_deferred_size below. > Depending on the incoming block size there are two different write > procedures: > > 1) big writes (block size is aligned with min_alloc_size) > > 2) small writes (block size in less than min_alloc_size). May be > performed via deferred write or directly, see below > > Large but unaligned blocks are split and pass through the combination of > 1) and 2) > > > 1) always triggers allocation of the new blob and writing to a different > location. > > 2) If we're overwriting existing data then block passes through deferred > write procedure. Which prepares 4K aligned block (by merging incoming > data, padding and reading non-overlapped data), puts it into DB and then > performs disk block overwrite at exactly the same location. This way a > sort of WAL is provided (do not confuse with the one in RocksDB). > > If writing goes to unused extent and block size <= > bluestore_prefer_deferred_size (16K for HDD and 0 for SSD) than deferred > procedure is applied as well. > > Otherwise block is written directly to disk, along with allocation if > needed. > > > Hence in your case (4K overwrites) the scenario looks like: > > AAAA > > ABAA + B temporarily put into RocksDB > > ACAA + C temporarily put into RocksDB > > etc. > > And proposed optimization makes no sense. > > Indeed one might observe higher DB load this way. > > Your original scenario: > > AAAA > A[A]AA...B > A[A]AA...[B]...C > A[A]AA...[B]...[C]...D > > is rather about big (16K/64K) writes. Not sure if any optimization is > required here either. Maybe except the case when we want data to be less > fragmented (for HDD?). But I doubt this is feasible . Well, there's two reasons to prefer defragmented data: 1) sequential read speeds are higher than random, even on SSD. 2) defragmented data means we have to store fewer extents in RocksDB. I definitely don't know how expensive that is, though! Or indeed if we actually get IO patterns that would trigger this. I was thinking specifically of RBD, so if it's not going to trigger patterns like this it's not worth worrying about. Jason, do we have any data on what IO sizes tend to look like? -Greg > > > Hope this helps. > > Thanks, > > Igor > > > > On 2/15/2019 3:42 AM, Gregory Farnum wrote: > > Hey Igor, > > I don't know much about the BlueStore allocator pattern, so I don't > > have a clear idea how difficult this is. > > But I *believe* we have a common pattern in RBD that might be worth > > optimizing for: the repeated-overwrites case. Often this would be some > > kind of journal header — either for the FS stored on top, a database, > > or whatever, that results in the same 4KB logical block getting > > overwritten repeatedly. > > > > For instance, librbd might write out > > AAAA > > to an object, then do updates to the second block resulting in a logical > > ABAA > > ACAA > > ADAA > > etc. > > > > I think, from my very limited understanding and what I heard when I > > asked this in standup, that right now the layout in BlueStore for this > > will tend to be something like > > AAAA > > A[A]AA...B > > A[A]AA...[B]...C > > A[A]AA...[B]...[C]...D > > where the brackets indicate a deallocated [hole]. I expect that to > > happen (certainly for the first overwrite) as long as the incoming IO > > is large enough to trigger an immediate write to disk and then an > > update to the metadata, rather than stuffing the data in the WAL and > > then doing a write-in-place. > > > > So I wonder: is there any optimization to try and place incoming data > > so that it closes up holes and allows merging the extents/blobs > > (sorry, I forget the BlueStore internal terms)? If not, is this a > > feasible optimization to try and apply at some point? > > That way we could get an on disk layout pattern more like > > AAAA > > A[A]AA...B > > ACAA...[B] > > A[C]AA...D > > > > I don't know what the full value of something like this would actually > > be, but I was in some discussion recently where it came up that RBD > > causes much larger RocksDB usage than RGW does, thanks to the > > fragmented layouts it provokes. Cutting that down might be very good > > for our long-term performance? > > -Greg