Hi Maged,
actually I expect no difference between various EC profiles in this
behavior.
Just verified EC42 against master branch:
Initial df report:
POOL ID STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP)
%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
ec42 3 16 KiB 16 KiB 0 B 1 384 KiB 384 KiB 0
B 0 392 GiB N/A N/A 1 0 B 0 B
executed commands:
dd if=./tmp of=/dev/rbd1 count=64 bs=4096 seek=0
dd if=./tmp of=/dev/rbd1 count=60 bs=4096 seek=4
dd if=./tmp of=/dev/rbd1 count=56 bs=4096 seek=8
Final df report:
POOL ID STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP)
%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
ec42 3 272 KiB 272 KiB 0 B 3 1.5 MiB 1.5 MiB 0
B 0 392 GiB N/A N/A 3 0 B 0 B
So the same significant ( (1.5 MIB - 384KiB) / 272 KiB = ~4.2x )
overhead as in my original report.
Thanks,
Igor
On 1/23/2020 10:34 AM, Maged Mokhtar wrote:
On 23/01/2020 01:20, Igor Fedotov wrote:
Hi All!
Preface:
Recently we've got a customer report about a discrepancy between USED
and RAW USED columns in df report for a specific pool.
Approx. 100% higher volume was reported for RAW USED. Pool in
question has EC 6+3 and keeps RBD images.
Other relevant cluster/OSD information: Luminous v12.2.12, BlueStore,
HDDs as main devices, EC stripe width = 24K.
Preliminary investigation showed significant difference between
bluestore_stored and bluestore_allocated performance counters at all
involved OSDs
with pretty the same 100% increase ratio. According to perf counters
majority of writes are 'small' ones.
Hence allocation overhead caused by small/fragmented writes has been
named as an intermediate cause.
But why it happens?
Now I'd like to share deeper analysis on what happens to objects at
BlueStore when RBD performs writes to above-mentioned EC pool.
And let me narrow the scope to a single 64K RBD data object at single
BlueStore instance which encompasses one of EC shard for 384K (16 *
24K => 16 * 4K) RBD data span.
Initially let's do a 384K write to RBD using 'dd if=./tmp
of=/dev/rbd1 count=96 bs=4096 seek=0'
At Bluestore this dd write results in a single 64K write(append)
which lands as a single blob containing single 64K pextent(allocation).
Then do second 360K write to RBD image at 24K offset which in fact
results in 0x1000~f000 write to the same object . In ideal world this
should
reuse existing blob and data would be merged (via some bluestore
magic) and would take single 64K pextent again. But in reality this
doesn't happen and both new blob and 64K allocation are made.
As a result one has 64K stored to BlueStore and 128K allocated.
Third write at 48K offset and 322K of data results in third
blob/allocation and 196K of allocated data for the same 64K of stored
one.
The same behavior lasts while target dd offset is below 384K
resulting in up to 16x space overhead.
Here is the log snippet for one of the intermediate write req
handling in this sequence:
==============================================
2020-01-22 23:41:57.808471 7f6190c55700 1 -- 10.100.2.124:6802/38298
<== osd.6 10.100.2.124:6826/39718 125 ==== MOSDECSubOpRead(3.24s1
32/26 ECSubRead(tid=30,
to_read={3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head=36864,4096,0},
attrs_to_read=)) v3 ==== 192+0+0 (1980749420 0 0) 0x55db8187af00 con
0x55db817c7000
2020-01-22 23:41:57.808669 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) read 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
0x9000~1000
2020-01-22 23:41:57.808707 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_read 0x9000~1000 size
0x10000 (65536)
2020-01-22 23:41:57.808712 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_read defaulting to
buffered read
2020-01-22 23:41:57.808723 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_read blob
Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused
crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000)
SharedBlob(0x55db82050fc0 sbid 0x0)) need 0x9000~1000 cache has
0x[9000~1000]
2020-01-22 23:41:57.808745 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) read 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
0x9000~1000 = 4096
2020-01-22 23:41:57.808766 7f6179f6a700 1 -- 10.100.2.124:6802/38298
--> 10.100.2.124:6826/39718 -- MOSDECSubOpReadReply(3.24s0 32/26
ECSubReadReply(tid=30, attrs_read=0)) v2 -- 0x55db821e8580 con 0
2020-01-22 23:41:57.810251 7f6190c55700 1 -- 10.100.2.124:6802/38298
<== osd.6 10.100.2.124:6826/39718 126 ==== MOSDECSubOpWrite(3.24s1
32/26 ECSubWrite(tid=29, reqid=client.4215.0:113, at_version=32'11,
trim_to=0'0, roll_forward_to=32'10)) v2 ==== 6671+0+0 (3945069128 0
0) 0x55db81d45800 con 0x55db817c7000
2020-01-22 23:41:57.810524 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) queue_transactions
existing 0x55db81bc7dc0 osr(3.24s1 0x55db81b30800)
2020-01-22 23:41:57.810540 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _txc_create osr
0x55db81bc7dc0 = 0x55db821dd200 seq 28
2020-01-22 23:41:57.810563 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _setattrs 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 2 keys
2020-01-22 23:41:57.810583 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _setattrs 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 2 keys = 0
2020-01-22 23:41:57.810591 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _set_alloc_hint
3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
object_size 700416 write_size 700416 flags -
2020-01-22 23:41:57.810598 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _set_alloc_hint
3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
object_size 700416 write_size 700416 flags - = 0
2020-01-22 23:41:57.810610 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
0x55db81cc0a00) get_onode oid
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b key
0x81800000000000000325900c85217262'd_data.4.10716b8b4567.0000000000000000!='0xfffffffffffffffe000000000000000b'o'
2020-01-22 23:41:57.810678 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
0x55db81cc0a00) r -2 v.len 0
2020-01-22 23:41:57.810698 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _touch 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
2020-01-22 23:41:57.810704 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _assign_nid 1193
2020-01-22 23:41:57.810706 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _touch 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b = 0
2020-01-22 23:41:57.810713 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _clone_range 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# ->
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b from
0x9000~1000 to offset 0x9000
2020-01-22 23:41:57.810721 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _do_zero 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
0x9000~1000
2020-01-22 23:41:57.810728 7f6179f6a700 20
bluestore.extentmap(0x55db821e8990) dirty_range mark inline shard dirty
2020-01-22 23:41:57.810732 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_zero extending size
to 40960
2020-01-22 23:41:57.810734 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _do_zero 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
0x9000~1000 = 0
2020-01-22 23:41:57.810740 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range
3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# ->
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b
0x9000~1000 -> 0x9000~1000
2020-01-22 23:41:57.810750 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range src
0x8000~8000: 0x8000~8000 Blob(0x55db82051110 blob([0x540000~10000]
csum+has_unused crc32c/0x1000 unused=0xff) use_tracker(0x10000
0x8000) SharedBlob(0x55db82050fc0 sbid 0x0))
2020-01-22 23:41:57.810761 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _assign_blobid 10250
2020-01-22 23:41:57.810764 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
0x55db81cc0a00) make_blob_shared Blob(0x55db82051110
blob([0x540000~10000] csum+has_unused crc32c/0x1000 unused=0xff)
use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 sbid 0x0))
2020-01-22 23:41:57.810773 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head
0x55db81cc0a00) make_blob_shared now Blob(0x55db82051110
blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000
unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0
loaded (sbid 0x280a ref_map(0x540000~10000=1))))
2020-01-22 23:41:57.810787 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range new
Blob(0x55db82051030 blob([0x540000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0xff) use_tracker(0x0 0x0)
SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810795 7f6179f6a700 20
bluestore.blob(0x55db82051030) get_ref 0x9000~1000
Blob(0x55db82051030 blob([0x540000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0xff) use_tracker(0x0 0x0)
SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810802 7f6179f6a700 20
bluestore.blob(0x55db82051030) get_ref init 0x10000, 10000
2020-01-22 23:41:57.810805 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range dst
0x9000~1000: 0x9000~1000 Blob(0x55db82051030 blob([0x540000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
0x1000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810813 7f6179f6a700 20
bluestore.extentmap(0x55db81b2b1d0) dirty_range mark inline shard dirty
2020-01-22 23:41:57.810817 7f6179f6a700 20
bluestore.extentmap(0x55db821e8990) dirty_range mark inline shard dirty
2020-01-22 23:41:57.810819 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _clone_range 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# ->
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b from
0x9000~1000 to offset 0x9000 = 0
2020-01-22 23:41:57.810829 7f6179f6a700 15
bluestore(/home/if/luminous/build/dev/osd0) _write 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
0x9000~1000
2020-01-22 23:41:57.810838 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
0x9000~1000 - have 0x10000 (65536) bytes fadvise_flags 0x0
2020-01-22 23:41:57.810844 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _choose_write_options
prefer csum_order 12 target_blob_size 0x80000 compress=0 buffered=0
2020-01-22 23:41:57.810849 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small 0x9000~1000
2020-01-22 23:41:57.810853 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
considering Blob(0x55db820511f0 blob([0x530000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0x7f) use_tracker(0x10000
0x1000) SharedBlob(0x55db820510a0 loaded (sbid 0x2809
ref_map(0x530000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810861 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small ignoring
immutable Blob(0x55db820511f0 blob([0x530000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0x7f) use_tracker(0x10000
0x1000) SharedBlob(0x55db820510a0 loaded (sbid 0x2809
ref_map(0x530000~10000=1))))
2020-01-22 23:41:57.810868 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
considering Blob(0x55db820512d0 blob([0x520000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0x3f) use_tracker(0x10000
0x1000) SharedBlob(0x55db82051180 loaded (sbid 0x2808
ref_map(0x520000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810876 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
considering Blob(0x55db82051110 blob([0x540000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2)))) bstart 0x0
2020-01-22 23:41:57.810883 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small ignoring
immutable Blob(0x55db82051110 blob([0x540000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810888 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
considering Blob(0x55db82050f50 blob([0x510000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0x1f) use_tracker(0x10000
0x1000) SharedBlob(0x55db82051260 loaded (sbid 0x2807
ref_map(0x510000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810894 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
considering Blob(0x55db820513b0 blob([0x500000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0xf) use_tracker(0x10000
0x1000) SharedBlob(0x55db82051340 loaded (sbid 0x2806
ref_map(0x500000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810900 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
considering Blob(0x55db81e38850 blob([0x4f0000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0x7) use_tracker(0x10000
0x1000) SharedBlob(0x55db81db2690 loaded (sbid 0x2805
ref_map(0x4f0000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810951 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
considering Blob(0x55db81eadc70 blob([0x4e0000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0x3) use_tracker(0x10000
0x1000) SharedBlob(0x55db81db3500 loaded (sbid 0x2804
ref_map(0x4e0000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810958 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_write_small
considering Blob(0x55db81eadb90 blob([0x4d0000~10000] csum+shared
crc32c/0x1000) use_tracker(0x10000 0x2000) SharedBlob(0x55db81eadc00
loaded (sbid 0x2803 ref_map(0x4d0000~10000=1)))) bstart 0x0
2020-01-22 23:41:57.810967 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _pad_zeros pad 0x0 + 0x0
on front/back, now 0x9000~1000
2020-01-22 23:41:57.810974 7f6179f6a700 20
bluestore.blob(0x55db82051110) put_ref 0x9000~1000
Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared
crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000)
SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.810987 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write txc
0x55db821dd200 1 blobs
2020-01-22 23:41:57.810992 7f6179f6a700 10 stupidalloc
0x0x55db81107c80 allocate_int want_size 0x10000 alloc_unit 0x10000
hint 0x0
2020-01-22 23:41:57.811003 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write prealloc
[0x550000~10000]
2020-01-22 23:41:57.811006 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write forcing
csum_order to block_size_order 12
2020-01-22 23:41:57.811009 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write
initialize csum setting for new blob Blob(0x55db81eadb20 blob([])
use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0)) csum_type
crc32c csum_order 12 csum_length 0x10000
2020-01-22 23:41:57.811019 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write blob
Blob(0x55db81eadb20 blob([0x550000~10000] csum crc32c/0x1000)
use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0))
2020-01-22 23:41:57.811029 7f6179f6a700 20
bluestore.blob(0x55db81eadb20) get_ref 0x9000~1000
Blob(0x55db81eadb20 blob([0x550000~10000] csum+has_unused
crc32c/0x1000 unused=0xfdff) use_tracker(0x0 0x0)
SharedBlob(0x55db81db3650 sbid 0x0))
2020-01-22 23:41:57.811037 7f6179f6a700 20
bluestore.blob(0x55db81eadb20) get_ref init 0x10000, 10000
2020-01-22 23:41:57.811042 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write lex
0x9000~1000: 0x9000~1000 Blob(0x55db81eadb20 blob([0x550000~10000]
csum+has_unused crc32c/0x1000 unused=0xfdff) use_tracker(0x10000
0x1000) SharedBlob(0x55db81db3650 sbid 0x0))
2020-01-22 23:41:57.811050 7f6179f6a700 20
bluestore.BufferSpace(0x55db81db3668 in 0x55db812fa2a0) _discard
0x9000~1000
2020-01-22 23:41:57.811056 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write deferring
small 0x1000 write via deferred
2020-01-22 23:41:57.811063 7f6179f6a700 20
bluestore(/home/if/luminous/build/dev/osd0) _wctx_finish lex_old
0x9000~1000: 0x9000~1000 Blob(0x55db82051110 blob([0x540000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000
0x7000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a
ref_map(0x540000~10000=2))))
2020-01-22 23:41:57.811074 7f6179f6a700 20
bluestore.extentmap(0x55db81b2b1d0) dirty_range mark inline shard dirty
2020-01-22 23:41:57.811077 7f6179f6a700 10
bluestore(/home/if/luminous/build/dev/osd0) _write 3.24s1_head
1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#
0x9000~1000 = 0==============================================
One can see that _do_write_small() function considers some blobs for
reuse but is unable to do that (presumably because they are shared
and/or relevant unused bits are cleared) and ends up with new
allocation.
Actually the above writing pattern is the simplest (but quite
artificial) scenario to present the issue. Generally speaking the
same behavior is observed when we overwrite some data at RBD image
and this maps to a previously used object. If written data partially
overlaps existing blob (including its unused part) and this blob is
prohibited for reuse (it's shared which seems to be the case for EC
overwrites, or relevant unused bits are cleared, i.e. it has already
been written at certain positions) BlueStore allocates new and
preserves previous ones (remember we do partial overwrite). To some
degree it reminds the behavior with compressed blobs where a stack of
partially overlapped blobs might appear until garbage collection
cleans this up.
So, e.g. full RBD image prefill and subsequent random small
overwrites will most probably result in some space overhead - up to
16x times in the worst (certainly very seldom) case.
Additional notes:
- This issue isn't present in master with new
bluestore_min_alloc_size defaults (=4K).
- In nautilus (and octopus with bluestore_min_alloc_size_hdd set back
to 64K) this behavior is less visible due to blob garbage collection
we introduced - see https://github.com/ceph/ceph/pull/30144
But up to 3x increase ratio is still observable though.
- The issue isn't observed for replicated pools.
- Shared blobs created during EC overwrite seems to lack a rollback
to non-shared state after op completion (and snapshot removal). Hence
most probably they pollute onodes and DB (remember their persistence
mechanics) and negatively impact the performance. Needs more
investigation/verification though.
The above analysis has two goals:
1) Show potential origin of space overhead for pre-Nautilus clusters.
2) Show the hidden danger of using allocation sizes higher than 4K
(i.e. device block size?) for EC pools. But our research shows that
4K alloc size is less efficient for spinner-backed pools.
https://github.com/ceph/ceph/pull/31867 suggests 'partial' rollback
in this respect. At least for default setup.
Thanks,
Igor
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
a while back with Luminous we tested different ec profiles for rbd :
2+1,2+2,3+2,4+2,5+2,5+3,6+2: we found 5+3 stood out with significant
higher overhead. stripe width and min alloc size were left at default.
Tests were 4k/4m rand/seq as well as file copy io with both hdd and ssd.
as per your tests, it seems over-writes at different overlapping
offsets would cause this overhead.,,so maybe the io tests we had done
just happen to cause less offset overlaps with all but 5+3, maybe but
i am not sure. it would be interesting if you can run the same test on
say 4+2 and see if you still get high alloc overhead as with 6+3.
/Maged
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx