Re: Root cause analysis for space overhead with erasure coded pools.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Maged,

actually I expect no difference between various EC profiles in this behavior.

Just verified EC42 against master branch:

Initial df report:

POOL          ID STORED  (DATA)  (OMAP) OBJECTS USED    (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR

ec42           3  16 KiB  16 KiB    0 B       1 384 KiB 384 KiB    0 B     0   392 GiB N/A           N/A             1 0 B         0 B


executed commands:

dd if=./tmp of=/dev/rbd1 count=64 bs=4096 seek=0

dd if=./tmp of=/dev/rbd1 count=60 bs=4096 seek=4

dd if=./tmp of=/dev/rbd1 count=56 bs=4096 seek=8


Final df report:

POOL          ID STORED  (DATA)  (OMAP) OBJECTS USED    (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR

ec42           3 272 KiB 272 KiB    0 B       3 1.5 MiB 1.5 MiB    0 B     0   392 GiB N/A           N/A             3 0 B         0 B

So the same significant ( (1.5 MIB - 384KiB) / 272 KiB = ~4.2x ) overhead as in my original report.


Thanks,

Igor


On 1/23/2020 10:34 AM, Maged Mokhtar wrote:

On 23/01/2020 01:20, Igor Fedotov wrote:
Hi All!

Preface:

Recently we've got a customer report about a discrepancy between USED and RAW USED columns in df report for a specific pool.

Approx. 100% higher volume was reported for RAW USED. Pool in question has EC 6+3 and keeps RBD images.

Other relevant cluster/OSD information: Luminous v12.2.12, BlueStore, HDDs as main devices, EC stripe width = 24K.

Preliminary investigation showed significant difference between bluestore_stored and bluestore_allocated performance counters at all involved OSDs

with pretty the same 100% increase ratio. According to perf counters majority of writes are 'small' ones.

Hence allocation overhead caused by small/fragmented writes has been named as an intermediate cause.


But why it happens?

Now I'd like to share deeper analysis on what happens to objects at BlueStore when RBD performs writes to above-mentioned EC pool.

And let me narrow the scope to a single 64K RBD data object at single BlueStore instance which encompasses one of EC shard for 384K (16 * 24K => 16 * 4K)  RBD data span.

Initially let's do a 384K write to RBD using 'dd if=./tmp of=/dev/rbd1 count=96 bs=4096 seek=0'

At Bluestore this dd write results in a single 64K write(append) which lands as a single blob containing single 64K pextent(allocation).

Then do second 360K write to RBD image at 24K offset which in fact results in 0x1000~f000 write to the same object . In ideal world this should

reuse existing blob and data would be merged (via some bluestore magic) and would take single 64K pextent again. But in reality this doesn't happen and both new blob and 64K allocation are made.

As a result one has 64K stored to BlueStore and 128K allocated.

Third write at 48K offset and 322K of data results in third blob/allocation and 196K of allocated data for the same 64K of stored one.

The same behavior lasts while target dd offset is below 384K resulting in up to 16x space overhead.

Here is the log snippet for one of the intermediate write req handling in this sequence:

==============================================
2020-01-22 23:41:57.808471 7f6190c55700  1 -- 10.100.2.124:6802/38298 <== osd.6 10.100.2.124:6826/39718 125 ==== MOSDECSubOpRead(3.24s1 32/26 ECSubRead(tid=30, to_read={3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head=36864,4096,0}, attrs_to_read=)) v3 ==== 192+0+0 (1980749420 0 0) 0x55db8187af00 con 0x55db817c7000 2020-01-22 23:41:57.808669 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) read 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000 2020-01-22 23:41:57.808707 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_read 0x9000~1000 size 0x10000 (65536) 2020-01-22 23:41:57.808712 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_read defaulting to buffered read 2020-01-22 23:41:57.808723 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_read  blob Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 sbid 0x0)) need 0x9000~1000 cache has 0x[9000~1000] 2020-01-22 23:41:57.808745 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) read 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000 = 4096 2020-01-22 23:41:57.808766 7f6179f6a700  1 -- 10.100.2.124:6802/38298 --> 10.100.2.124:6826/39718 -- MOSDECSubOpReadReply(3.24s0 32/26 ECSubReadReply(tid=30, attrs_read=0)) v2 -- 0x55db821e8580 con 0 2020-01-22 23:41:57.810251 7f6190c55700  1 -- 10.100.2.124:6802/38298 <== osd.6 10.100.2.124:6826/39718 126 ==== MOSDECSubOpWrite(3.24s1 32/26 ECSubWrite(tid=29, reqid=client.4215.0:113, at_version=32'11, trim_to=0'0, roll_forward_to=32'10)) v2 ==== 6671+0+0 (3945069128 0 0) 0x55db81d45800 con 0x55db817c7000 2020-01-22 23:41:57.810524 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) queue_transactions existing 0x55db81bc7dc0 osr(3.24s1 0x55db81b30800) 2020-01-22 23:41:57.810540 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _txc_create osr 0x55db81bc7dc0 = 0x55db821dd200 seq 28 2020-01-22 23:41:57.810563 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _setattrs 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 2 keys 2020-01-22 23:41:57.810583 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _setattrs 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 2 keys = 0 2020-01-22 23:41:57.810591 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _set_alloc_hint 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# object_size 700416 write_size 700416 flags - 2020-01-22 23:41:57.810598 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _set_alloc_hint 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# object_size 700416 write_size 700416 flags - = 0 2020-01-22 23:41:57.810610 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head 0x55db81cc0a00) get_onode oid 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b key 0x81800000000000000325900c85217262'd_data.4.10716b8b4567.0000000000000000!='0xfffffffffffffffe000000000000000b'o' 2020-01-22 23:41:57.810678 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head 0x55db81cc0a00)  r -2 v.len 0 2020-01-22 23:41:57.810698 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _touch 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b 2020-01-22 23:41:57.810704 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _assign_nid 1193 2020-01-22 23:41:57.810706 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _touch 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b = 0 2020-01-22 23:41:57.810713 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _clone_range 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# -> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b from 0x9000~1000 to offset 0x9000 2020-01-22 23:41:57.810721 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _do_zero 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b 0x9000~1000 2020-01-22 23:41:57.810728 7f6179f6a700 20 bluestore.extentmap(0x55db821e8990) dirty_range mark inline shard dirty 2020-01-22 23:41:57.810732 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_zero extending size to 40960 2020-01-22 23:41:57.810734 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _do_zero 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b 0x9000~1000 = 0 2020-01-22 23:41:57.810740 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# -> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b 0x9000~1000 ->  0x9000~1000 2020-01-22 23:41:57.810750 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range  src 0x8000~8000: 0x8000~8000 Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 sbid 0x0)) 2020-01-22 23:41:57.810761 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _assign_blobid 10250 2020-01-22 23:41:57.810764 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head 0x55db81cc0a00) make_blob_shared Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 sbid 0x0)) 2020-01-22 23:41:57.810773 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0).collection(3.24s1_head 0x55db81cc0a00) make_blob_shared now Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=1)))) 2020-01-22 23:41:57.810787 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range new Blob(0x55db82051030 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x0 0x0) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) 2020-01-22 23:41:57.810795 7f6179f6a700 20 bluestore.blob(0x55db82051030) get_ref 0x9000~1000 Blob(0x55db82051030 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x0 0x0) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) 2020-01-22 23:41:57.810802 7f6179f6a700 20 bluestore.blob(0x55db82051030) get_ref init 0x10000, 10000 2020-01-22 23:41:57.810805 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_clone_range  dst 0x9000~1000: 0x9000~1000 Blob(0x55db82051030 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x1000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) 2020-01-22 23:41:57.810813 7f6179f6a700 20 bluestore.extentmap(0x55db81b2b1d0) dirty_range mark inline shard dirty 2020-01-22 23:41:57.810817 7f6179f6a700 20 bluestore.extentmap(0x55db821e8990) dirty_range mark inline shard dirty 2020-01-22 23:41:57.810819 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _clone_range 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# -> 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head#b from 0x9000~1000 to offset 0x9000 = 0 2020-01-22 23:41:57.810829 7f6179f6a700 15 bluestore(/home/if/luminous/build/dev/osd0) _write 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000 2020-01-22 23:41:57.810838 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000 - have 0x10000 (65536) bytes fadvise_flags 0x0 2020-01-22 23:41:57.810844 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _choose_write_options prefer csum_order 12 target_blob_size 0x80000 compress=0 buffered=0 2020-01-22 23:41:57.810849 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small 0x9000~1000 2020-01-22 23:41:57.810853 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db820511f0 blob([0x530000~10000] csum+has_unused+shared crc32c/0x1000 unused=0x7f) use_tracker(0x10000 0x1000) SharedBlob(0x55db820510a0 loaded (sbid 0x2809 ref_map(0x530000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810861 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small ignoring immutable Blob(0x55db820511f0 blob([0x530000~10000] csum+has_unused+shared crc32c/0x1000 unused=0x7f) use_tracker(0x10000 0x1000) SharedBlob(0x55db820510a0 loaded (sbid 0x2809 ref_map(0x530000~10000=1)))) 2020-01-22 23:41:57.810868 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db820512d0 blob([0x520000~10000] csum+has_unused+shared crc32c/0x1000 unused=0x3f) use_tracker(0x10000 0x1000) SharedBlob(0x55db82051180 loaded (sbid 0x2808 ref_map(0x520000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810876 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) bstart 0x0 2020-01-22 23:41:57.810883 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small ignoring immutable Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) 2020-01-22 23:41:57.810888 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db82050f50 blob([0x510000~10000] csum+has_unused+shared crc32c/0x1000 unused=0x1f) use_tracker(0x10000 0x1000) SharedBlob(0x55db82051260 loaded (sbid 0x2807 ref_map(0x510000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810894 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db820513b0 blob([0x500000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xf) use_tracker(0x10000 0x1000) SharedBlob(0x55db82051340 loaded (sbid 0x2806 ref_map(0x500000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810900 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db81e38850 blob([0x4f0000~10000] csum+has_unused+shared crc32c/0x1000 unused=0x7) use_tracker(0x10000 0x1000) SharedBlob(0x55db81db2690 loaded (sbid 0x2805 ref_map(0x4f0000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810951 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db81eadc70 blob([0x4e0000~10000] csum+has_unused+shared crc32c/0x1000 unused=0x3) use_tracker(0x10000 0x1000) SharedBlob(0x55db81db3500 loaded (sbid 0x2804 ref_map(0x4e0000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810958 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_write_small considering Blob(0x55db81eadb90 blob([0x4d0000~10000] csum+shared crc32c/0x1000) use_tracker(0x10000 0x2000) SharedBlob(0x55db81eadc00 loaded (sbid 0x2803 ref_map(0x4d0000~10000=1)))) bstart 0x0 2020-01-22 23:41:57.810967 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _pad_zeros pad 0x0 + 0x0 on front/back, now 0x9000~1000 2020-01-22 23:41:57.810974 7f6179f6a700 20 bluestore.blob(0x55db82051110) put_ref 0x9000~1000 Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x8000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) 2020-01-22 23:41:57.810987 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write txc 0x55db821dd200 1 blobs 2020-01-22 23:41:57.810992 7f6179f6a700 10 stupidalloc 0x0x55db81107c80 allocate_int want_size 0x10000 alloc_unit 0x10000 hint 0x0 2020-01-22 23:41:57.811003 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write prealloc [0x550000~10000] 2020-01-22 23:41:57.811006 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write forcing csum_order to block_size_order 12 2020-01-22 23:41:57.811009 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write initialize csum setting for new blob Blob(0x55db81eadb20 blob([]) use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0)) csum_type crc32c csum_order 12 csum_length 0x10000 2020-01-22 23:41:57.811019 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write blob Blob(0x55db81eadb20 blob([0x550000~10000] csum crc32c/0x1000) use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0)) 2020-01-22 23:41:57.811029 7f6179f6a700 20 bluestore.blob(0x55db81eadb20) get_ref 0x9000~1000 Blob(0x55db81eadb20 blob([0x550000~10000] csum+has_unused crc32c/0x1000 unused=0xfdff) use_tracker(0x0 0x0) SharedBlob(0x55db81db3650 sbid 0x0)) 2020-01-22 23:41:57.811037 7f6179f6a700 20 bluestore.blob(0x55db81eadb20) get_ref init 0x10000, 10000 2020-01-22 23:41:57.811042 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write  lex 0x9000~1000: 0x9000~1000 Blob(0x55db81eadb20 blob([0x550000~10000] csum+has_unused crc32c/0x1000 unused=0xfdff) use_tracker(0x10000 0x1000) SharedBlob(0x55db81db3650 sbid 0x0)) 2020-01-22 23:41:57.811050 7f6179f6a700 20 bluestore.BufferSpace(0x55db81db3668 in 0x55db812fa2a0) _discard 0x9000~1000 2020-01-22 23:41:57.811056 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _do_alloc_write deferring small 0x1000 write via deferred 2020-01-22 23:41:57.811063 7f6179f6a700 20 bluestore(/home/if/luminous/build/dev/osd0) _wctx_finish lex_old 0x9000~1000: 0x9000~1000 Blob(0x55db82051110 blob([0x540000~10000] csum+has_unused+shared crc32c/0x1000 unused=0xff) use_tracker(0x10000 0x7000) SharedBlob(0x55db82050fc0 loaded (sbid 0x280a ref_map(0x540000~10000=2)))) 2020-01-22 23:41:57.811074 7f6179f6a700 20 bluestore.extentmap(0x55db81b2b1d0) dirty_range mark inline shard dirty 2020-01-22 23:41:57.811077 7f6179f6a700 10 bluestore(/home/if/luminous/build/dev/osd0) _write 3.24s1_head 1#3:25900c85:::rbd_data.4.10716b8b4567.0000000000000000:head# 0x9000~1000 = 0==============================================

One can see that _do_write_small() function considers some blobs for reuse but is unable to do that (presumably because they are shared and/or relevant unused bits are cleared) and ends up with new allocation.

Actually the above writing pattern is the simplest (but quite artificial) scenario to present the issue. Generally speaking the same behavior is observed when we overwrite some data at RBD image and this maps to a previously used object. If written data partially overlaps existing blob (including its unused part) and this blob is prohibited for reuse (it's shared which seems to be the case for EC overwrites, or relevant unused bits are cleared, i.e. it has already been written at certain positions) BlueStore allocates new and preserves previous ones (remember we do partial overwrite). To some degree it reminds the behavior with compressed blobs where a stack of partially overlapped blobs might appear until garbage collection cleans this up.

So, e.g. full RBD image prefill and subsequent random small overwrites will most probably result in some space overhead - up to 16x times in the worst (certainly very seldom) case.


Additional notes:

- This issue isn't present in master with new bluestore_min_alloc_size defaults (=4K).

- In nautilus (and octopus with bluestore_min_alloc_size_hdd set back to 64K) this behavior is less visible due to blob garbage collection we introduced - see https://github.com/ceph/ceph/pull/30144

  But up to 3x increase ratio is still observable though.

- The issue isn't observed for replicated pools.

- Shared blobs created during EC overwrite seems to lack a rollback to non-shared state after op completion (and snapshot removal). Hence most probably they pollute onodes and DB (remember their persistence mechanics) and negatively impact the performance. Needs more investigation/verification though.


The above analysis has two goals:

1) Show potential origin of space overhead for pre-Nautilus clusters.

2) Show the hidden danger of using allocation sizes higher than 4K  (i.e. device block size?) for EC pools. But our research shows that 4K alloc size is less efficient for spinner-backed pools.

https://github.com/ceph/ceph/pull/31867 suggests 'partial' rollback in this respect.  At least for default setup.


Thanks,

Igor

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

a while back with Luminous we tested different ec profiles for rbd : 2+1,2+2,3+2,4+2,5+2,5+3,6+2: we found 5+3 stood out with significant higher overhead. stripe width and min alloc size were left at default. Tests were 4k/4m rand/seq as well as file copy io with both hdd and ssd.

as per your tests, it seems over-writes at different overlapping offsets would cause this overhead.,,so maybe the io tests we had done just happen to cause less offset overlaps with all but 5+3, maybe but i am not sure. it would be interesting if you can run the same test on say 4+2 and see if you still get high alloc overhead as with 6+3.

/Maged





_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux