Hi All!
Preface:
Recently we've got a customer report about a discrepancy between USED
and RAW USED columns in df report for a specific pool.
Approx. 100% higher volume was reported for RAW USED. Pool in question
has EC 6+3 and keeps RBD images.
Other relevant cluster/OSD information: Luminous v12.2.12, BlueStore,
HDDs as main devices, EC stripe width = 24K.
Preliminary investigation showed significant difference between
bluestore_stored and bluestore_allocated performance counters at all
involved OSDs
with pretty the same 100% increase ratio. According to perf counters
majority of writes are 'small' ones.
Hence allocation overhead caused by small/fragmented writes has been
named as an intermediate cause.
But why it happens?
Now I'd like to share deeper analysis on what happens to objects at
BlueStore when RBD performs writes to above-mentioned EC pool.
And let me narrow the scope to a single 64K RBD data object at single
BlueStore instance which encompasses one of EC shard for 384K (16 * 24K
=> 16 * 4K) RBD data span.
Initially let's do a 384K write to RBD using 'dd if=./tmp of=/dev/rbd1
count=96 bs=4096 seek=0'
At Bluestore this dd write results in a single 64K write(append) which
lands as a single blob containing single 64K pextent(allocation).
Then do second 360K write to RBD image at 24K offset which in fact
results in 0x1000~f000 write to the same object . In ideal world this
should
reuse existing blob and data would be merged (via some bluestore magic)
and would take single 64K pextent again. But in reality this doesn't
happen and both new blob and 64K allocation are made.
As a result one has 64K stored to BlueStore and 128K allocated.
Third write at 48K offset and 322K of data results in third
blob/allocation and 196K of allocated data for the same 64K of stored one.
The same behavior lasts while target dd offset is below 384K resulting
in up to 16x space overhead.
Here is the log snippet for one of the intermediate write req handling
in this sequence:
==============================================
2020-01-22 23:41:57.808471 7f6190c55700 1 -- 10.100.2.124:6802/38298
<== osd.6 10.100.2.124:6826/39718 125 ==== MOSDECSubOpRead(3.24s1 32/26
ECSubRead(tid=30,
to_read={3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head=36864,4096,0},
attrs_to_read=)) v3 ==== 192+0+0 (1980749420 0 0) 0x55db8187af00 con
0x55db817c7000
2020-01-22 23:41:57.808669 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) read 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000
2020-01-22 23:41:57.808707 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_read 0x9000~1000 size
0x10000 (65536)
2020-01-22 23:41:57.808712 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_read defaulting to
buffered read
2020-01-22 23:41:57.808723 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_read blob
Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused crc32c/0x1000
unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 sbid
0x0)) need 0x9000~1000 cache has 0x[9000~1000]
2020-01-22 23:41:57.808745 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) read 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
0x9000~1000 = 4096
2020-01-22 23:41:57.808766 7f6179f6a700 1 -- 10.100.2.124:6802/38298
--> 10.100.2.124:6826/39718 -- MOSDECSubOpReadReply(3.24s0 32/26
ECSubReadReply(tid=30, attrs_read=0)) v2 -- 0x55db821e8580 con 0
2020-01-22 23:41:57.810251 7f6190c55700 1 -- 10.100.2.124:6802/38298
<== osd.6 10.100.2.124:6826/39718 126 ==== MOSDECSubOpWrite(3.24s1 32/26
ECSubWrite(tid=29, reqid=client.4215.0:113, at_version=32'11,
trim_to=0'0, roll_forward_to=32'10)) v2 ==== 6671+0+0 (3945069128 0 0)
0x55db81d45800 con 0x55db817c7000
2020-01-22 23:41:57.810524 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) queue_transactions existing
0x55db81bc7dc0 osr(3.24s1 0x55db81b30800)
2020-01-22 23:41:57.810540 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _txc_create osr
0x55db81bc7dc0 = 0x55db821dd200 seq 28
2020-01-22 23:41:57.810563 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _setattrs 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 2 keys
2020-01-22 23:41:57.810583 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _setattrs 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 2 keys = 0
2020-01-22 23:41:57.810591 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _set_alloc_hint 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
object_size 700416 write_size 700416 flags -
2020-01-22 23:41:57.810598 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _set_alloc_hint 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
object_size 700416 write_size 700416 flags - = 0
2020-01-22 23:41:57.810610 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
0x55db81cc0a00) get_onode oid
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b key
0x81800000000000000325900c85217262'd_data.4.10716b8b4567.0000000000000000!='0xfffffffffffffffe000000000000000b'o'
2020-01-22 23:41:57.810678 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
0x55db81cc0a00) r -2 v.len 0
2020-01-22 23:41:57.810698 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _touch 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
2020-01-22 23:41:57.810704 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _assign_nid 1193
2020-01-22 23:41:57.810706 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _touch 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b = 0
2020-01-22 23:41:57.810713 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _clone_range 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# ->
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b from
0x9000~1000 to offset 0x9000
2020-01-22 23:41:57.810721 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _do_zero 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b 0x9000~1000
2020-01-22 23:41:57.810728 7f6179f6a700 20
bluestore.extentmap(0x55db821e8990) dirty_range mark inline shard dirty
2020-01-22 23:41:57.810732 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_zero extending size to 40960
2020-01-22 23:41:57.810734 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _do_zero 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
0x9000~1000 = 0
2020-01-22 23:41:57.810740 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# ->
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
0x9000~1000 -> 0x9000~1000
2020-01-22 23:41:57.810750 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range src
0x8000~8000: 0x8000~8000 Blob(0x55db82051110 blob([0x540000~10000]
csum+has_unused crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000)
SharedBlob(0x55db82050fc0 sbid 0x0))
2020-01-22 23:41:57.810761 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _assign_blobid 10250
2020-01-22 23:41:57.810764 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
0x55db81cc0a00) make_blob_shared Blob(0x55db82051110
blob([0x540000~10000] csum+has_unused crc32c/0x1000 unused=0xff)
use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 sbid 0x0))
2020-01-22 23:41:57.810773 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
0x55db81cc0a00) make_blob_shared now Blob(0x55db82051110
blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff)
use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid
0x280a ref_map(0x540000~10000=1))))
2020-01-22 23:41:57.810787 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range new
Blob(0x55db82051030 blob([0x540000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0xff) use_tracker(0x0 0x0)
SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810795 7f6179f6a700 20
bluestore.blob(0x55db82051030) get_ref 0x9000~1000 Blob(0x55db82051030
blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff)
use_tracker(0x0 0x0) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810802 7f6179f6a700 20
bluestore.blob(0x55db82051030) get_ref init 0x10000, 10000
2020-01-22 23:41:57.810805 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range dst
0x9000~1000: 0x9000~1000 Blob(0x55db82051030 blob([0x540000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
0x1000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810813 7f6179f6a700 20
bluestore.extentmap(0x55db81b2b1d0) dirty_range mark inline shard dirty
2020-01-22 23:41:57.810817 7f6179f6a700 20
bluestore.extentmap(0x55db821e8990) dirty_range mark inline shard dirty
2020-01-22 23:41:57.810819 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _clone_range 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# ->
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b from
0x9000~1000 to offset 0x9000 = 0
2020-01-22 23:41:57.810829 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _write 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000
2020-01-22 23:41:57.810838 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
0x9000~1000 - have 0x10000 (65536) bytes fadvise_flags 0x0
2020-01-22 23:41:57.810844 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _choose_write_options prefer
csum_order 12 target_blob_size 0x80000 compress=0 buffered=0
2020-01-22 23:41:57.810849 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small 0x9000~1000
2020-01-22 23:41:57.810853 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db820511f0 blob([0x530000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0x7f) use_tracker(0x10000 0x1000)
SharedBlob(0x55db820510a0 loaded (sbid 0x2809
ref_map(0x530000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810861 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small ignoring
immutable Blob(0x55db820511f0 blob([0x530000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0x7f) use_tracker(0x10000
0x1000) SharedBlob(0x55db820510a0 loaded (sbid 0x2809
ref_map(0x530000~10000=1))))
2020-01-22 23:41:57.810868 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db820512d0 blob([0x520000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0x3f) use_tracker(0x10000 0x1000)
SharedBlob(0x55db82051180 loaded (sbid 0x2808
ref_map(0x520000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810876 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000)
SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2)))) bstart 0x0
2020-01-22 23:41:57.810883 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small ignoring
immutable Blob(0x55db82051110 blob([0x540000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810888 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db82050f50 blob([0x510000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0x1f) use_tracker(0x10000 0x1000)
SharedBlob(0x55db82051260 loaded (sbid 0x2807
ref_map(0x510000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810894 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db820513b0 blob([0x500000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0xf) use_tracker(0x10000 0x1000)
SharedBlob(0x55db82051340 loaded (sbid 0x2806
ref_map(0x500000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810900 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db81e38850 blob([0x4f0000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0x7) use_tracker(0x10000 0x1000)
SharedBlob(0x55db81db2690 loaded (sbid 0x2805
ref_map(0x4f0000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810951 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db81eadc70 blob([0x4e0000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0x3) use_tracker(0x10000 0x1000)
SharedBlob(0x55db81db3500 loaded (sbid 0x2804
ref_map(0x4e0000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810958 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering
Blob(0x55db81eadb90 blob([0x4d0000~10000] csum+shared crc32c/0x1000)
use_tracker(0x10000 0x2000) SharedBlob(0x55db81eadc00 loaded (sbid
0x2803 ref_map(0x4d0000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810967 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _pad_zeros pad 0x0 + 0x0 on
front/back, now 0x9000~1000
2020-01-22 23:41:57.810974 7f6179f6a700 20
bluestore.blob(0x55db82051110) put_ref 0x9000~1000 Blob(0x55db82051110
blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff)
use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid
0x280a ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810987 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write txc
0x55db821dd200 1 blobs
2020-01-22 23:41:57.810992 7f6179f6a700 10 stupidalloc 0x0x55db81107c80
allocate_int want_size 0x10000 alloc_unit 0x10000 hint 0x0
2020-01-22 23:41:57.811003 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write prealloc
[0x550000~10000]
2020-01-22 23:41:57.811006 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write forcing
csum_order to block_size_order 12
2020-01-22 23:41:57.811009 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write initialize
csum setting for new blob Blob(0x55db81eadb20 blob([]) use_tracker(0x0
0x0) SharedBlob(0x55db81db3650 sbid 0x0)) csum_type crc32c csum_order 12
csum_length 0x10000
2020-01-22 23:41:57.811019 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write blob
Blob(0x55db81eadb20 blob([0x550000~10000] csum crc32c/0x1000)
use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0))
2020-01-22 23:41:57.811029 7f6179f6a700 20
bluestore.blob(0x55db81eadb20) get_ref 0x9000~1000 Blob(0x55db81eadb20
blob([0x550000~10000] csum+has_unused crc32c/0x1000 unused=0xfdff)
use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0))
2020-01-22 23:41:57.811037 7f6179f6a700 20
bluestore.blob(0x55db81eadb20) get_ref init 0x10000, 10000
2020-01-22 23:41:57.811042 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write lex
0x9000~1000: 0x9000~1000 Blob(0x55db81eadb20 blob([0x550000~10000]
csum+has_unused crc32c/0x1000 unused=0xfdff) use_tracker(0x10000 0x1000)
SharedBlob(0x55db81db3650 sbid 0x0))
2020-01-22 23:41:57.811050 7f6179f6a700 20
bluestore.BufferSpace(0x55db81db3668 in 0x55db812fa2a0) _discard 0x9000~1000
2020-01-22 23:41:57.811056 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write deferring
small 0x1000 write via deferred
2020-01-22 23:41:57.811063 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _wctx_finish lex_old
0x9000~1000: 0x9000~1000 Blob(0x55db82051110 blob([0x540000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
0x7000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.811074 7f6179f6a700 20
bluestore.extentmap(0x55db81b2b1d0) dirty_range mark inline shard dirty
2020-01-22 23:41:57.811077 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _write 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
0x9000~1000 = 0==============================================
One can see that _do_write_small() function considers some blobs for
reuse but is unable to do that (presumably because they are shared
and/or relevant unused bits are cleared) and ends up with new allocation.
Actually the above writing pattern is the simplest (but quite
artificial) scenario to present the issue. Generally speaking the same
behavior is observed when we overwrite some data at RBD image and this
maps to a previously used object. If written data partially overlaps
existing blob (including its unused part) and this blob is prohibited
for reuse (it's shared which seems to be the case for EC overwrites, or
relevant unused bits are cleared, i.e. it has already been written at
certain positions) BlueStore allocates new and preserves previous ones
(remember we do partial overwrite). To some degree it reminds the
behavior with compressed blobs where a stack of partially overlapped
blobs might appear until garbage collection cleans this up.
So, e.g. full RBD image prefill and subsequent random small overwrites
will most probably result in some space overhead - up to 16x times in
the worst (certainly very seldom) case.
Additional notes:
- This issue isn't present in master with new bluestore_min_alloc_size
defaults (=4K).
- In nautilus (and octopus with bluestore_min_alloc_size_hdd set back to
64K) this behavior is less visible due to blob garbage collection we
introduced - see https://github.com/ceph/ceph/pull/30144
But up to 3x increase ratio is still observable though.
- The issue isn't observed for replicated pools.
- Shared blobs created during EC overwrite seems to lack a rollback to
non-shared state after op completion (and snapshot removal). Hence most
probably they pollute onodes and DB (remember their persistence
mechanics) and negatively impact the performance. Needs more
investigation/verification though.
The above analysis has two goals:
1) Show potential origin of space overhead for pre-Nautilus clusters.
2) Show the hidden danger of using allocation sizes higher than 4K
(i.e. device block size?) for EC pools. But our research shows that 4K
alloc size is less efficient for spinner-backed pools.
https://github.com/ceph/ceph/pull/31867 suggests 'partial' rollback in
this respect. At least for default setup.
Thanks,
Igor
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx