Re: Root cause analysis for space overhead with erasure coded pools.

Igor Fedotov <ifedotov@xxxxxxx> · Thu, 23 Jan 2020 18:28:40 +0300

On 1/23/2020 2:20 AM, Igor Fedotov wrote:
Additional notes:

...
- Shared blobs created during EC overwrite seems to lack a rollback to 
non-shared state after op completion (and snapshot removal). Hence 
most probably they pollute onodes and DB (remember their persistence 
mechanics) and negatively impact the performance. Needs more 
investigation/verification though.

Additional update on the above. Here is object dump snippet for an 
object from EC4+2 pool which has got 2 partial overwrites (3 writes 
total) using the access pattern from my analysis.

2020-01-23T17:53:34.585+0300 7fc6026040c0 30 _dump_onode 0x5620616ca580 
3#3:1b231130:::rbd_data.4.110088be0420.0000000000000000:head# nid 1652 
size 0x10000 (65536)
 expected_object_size 1048576 expected_write_size 1048576 in 0 shards, 
0 spanning blobs

...

2020-01-23T17:53:34.585+0300 7fc6026040c0 20 bluestore(dev/osd0) 
fsck_check_objects_shallow    0x0~3000: 0x0~3000 Blob(0x562061641f80 
blob([0x1ee0000~10000] csum+s
hared crc32c/0x1000) use_tracker(0x10000 0x3000) 
SharedBlob(0x562061640fc0 sbid 0x2804))
2020-01-23T17:53:34.585+0300 7fc6026040c0 20 bluestore(dev/osd0) 
fsck_check_objects_shallow    0x3000~1000: 0x3000~1000 
Blob(0x56206235c000 blob([0x2400000~10000]
csum+has_unused+shared crc32c/0x1000 unused=0x7) use_tracker(0x10000 
0x1000) SharedBlob(0x56206235c070 sbid 0x2805))
2020-01-23T17:53:34.585+0300 7fc6026040c0 20 bluestore(dev/osd0) 
fsck_check_objects_shallow    0x4000~c000: 0x4000~c000 
Blob(0x56206235c0e0 blob([0x2410000~10000]
csum+has_unused crc32c/0x1000 unused=0xf) use_tracker(0x10000 0xc000) 
SharedBlob(0x56206235c150 sbid 0x0))
2020-01-23T17:53:34.585+0300 7fc6026040c0 30 bluestore(dev/osd0) 
_fsck_check_extents oid 
3#3:1b231130:::rbd_data.4.110088be0420.0000000000000000:head# extents [0x2
410000~10000]
...

2020-01-23T17:53:34.585+0300 7fc6026040c0  1 bluestore(dev/osd0) 
_fsck_on_open checking shared_blobs
2020-01-23T17:53:34.585+0300 7fc6026040c0 20 bluestore(dev/osd0) 
_fsck_on_open  SharedBlob(0x562061640fc0 sbid 0x2804) (sbid 0x2804 
ref_map(0x1ee0000~10000=1))
2020-01-23T17:53:34.585+0300 7fc6026040c0 30 bluestore(dev/osd0) 
_fsck_check_extents oid 
3#3:1b231130:::rbd_data.4.110088be0420.0000000000000000:head# extents 
[0x1ee0000~10000]
2020-01-23T17:53:34.585+0300 7fc6026040c0 20 bluestore(dev/osd0) 
_fsck_on_open  SharedBlob(0x56206235c070 sbid 0x2805) (sbid 0x2805 
ref_map(0x2400000~10000=1))
2020-01-23T17:53:34.585+0300 7fc6026040c0 30 bluestore(dev/osd0) 
_fsck_check_extents oid 3#3:1b231130:::rbd_data.4.110088b

One can see 3 blobs, two of them (which relates to the initial write and 
the first overwrite) are shared. I.e. each partial overwrite might 
result in a shared blob appearance which is quite expensive - each 
shared blob has corresponding record in RocksDB (hence additional 
lookup/update op on access), they share common container at collection 
level, their handling is more complicated, etc...

And it makes no sense at this point of onode life, ref_map denotes just 
a singe reference per each blob.

Hence IMO this behavior is suboptimal and might need some improvement.

Thanks,

Igor

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx