bluestore: extend/reuse blob on write rather than create a new one

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,

Let me share some thoughts on another one approach that might help to reduce BlueStore Onode size (both in-memory and serialized).

Brief example showing current write algorithm functioning below first:
- min_alloc_size = 0x10000

1) write 0x0~10000
At this step we have following Onode/extent/blob structure:
  Extent1 0x0~10000 -> Blob1(size=0x10000) -> 1 * 0x10000 pextent
2) Append: write 0x10000~20000
Onode/extent/blob structure:
  Extent1 0x0~10000 -> Blob1(size=0x10000) -> 1 * 0x10000 pextent
  Extent2 0x10000~20000 -> Blob2(size=0x20000) -> 2 * 0x10000 pextents

3) Partially Overwrite Extent2: write 0x10000~10000
Onode/extent/blob structure:
  Extent1 0x0~10000 -> Blob1(size=0x10000) -> 1 * 0x10000 pextent
  Extent2 0x10000~10000 -> Blob3(size=0x10000) -> 1 * 0x10000 pextents
  Extent3 0x20000~10000 -> Blob2(size=0x10000) -> 1 * 0x10000 pextents

So one can see that we have increased Onode structure at step 3 while object content layout is unchanged: 0~30000. Moreover at step 2) we could avoid Extent2 creation and extend Extent1/Blob1 instead.

Hence my suggestion is to consider an approach that will allow existing extents/blobs reuse and blob internals(pextent+csum vectors) update only whenever possible (i.e. big aligned writes/appends). As a result Onode structure becomes more static and Blob's pextent vector grows rather than ExtentMap. Surely some limit for Blob growth to be introduced.



IMHO as we have reduced Onode metadata this approach might impact BlueStore performance comparable to the results we achieved by the cases with min_alloc_size increased. AFAIR Mark did some tests for 16K MAS and they showed better results. But we wouldn't suffer from storage space loss caused by increased granularity this way. Onode in-memory representation is positively affected this way too hence BlueStore caching becomes more efective.

The cost for this change is the need for more dynamic behavior of Blob's pextent & csum containers - currently their size are 'semi-dynamic' (entry count decrease is possible sometimes but no increase is possible)

Any comments/thoughts?

Thanks,
Igor
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux