Re: adding block.db to OSD

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Mon, 11 May 2020 08:44:46 +0200

Hi Igor,

where to post the logs?

Am 06.05.20 um 09:23 schrieb Stefan Priebe - Profihost AG:
> Hi Igor,
> 
> Am 05.05.20 um 16:10 schrieb Igor Fedotov:
>> Hi Stefan,
>>
>> so (surprise!) some DB access counters show a significant difference, e.g.
>>
>>         "kv_flush_lat": {
>>             "avgcount": 1423,
>>             "sum": 0.000906419,
>>             "avgtime": 0.000000636
>>         },
>>         "kv_sync_lat": {
>>             "avgcount": 1423,
>>             "sum": 0.712888091,
>>             "avgtime": 0.000500975
>>         },
>> vs.
>>
>>      "kv_flush_lat": {
>>             "avgcount": 1146,
>>             "sum": 3.346228802,
>>             "avgtime": 0.002919920
>>         },
>>       "kv_sync_lat": {
>>             "avgcount": 1146,
>>             "sum": 3.754915016,
>>             "avgtime": 0.003276540
>>         },
>>
>> Also for bluefs:
>> "bytes_written_sst": 0,
>> vs.
>>  "bytes_written_sst": 59785361,
>>
>> Could you please rerun these benchmark/perf counter gathering steps a couple more times and check if the difference is persistent.
> 
> I reset all perf counters and ran the bench 10 times on each osd.
> 
> OSD 38:
> bench: wrote 12 MiB in blocks of 4 KiB in 1.22796 sec at 9.5 MiB/sec
> 2.44k IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 1.26407 sec at 9.3 MiB/sec
> 2.37k IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 1.24987 sec at 9.4 MiB/sec
> 2.40k IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 1.37125 sec at 8.5 MiB/sec
> 2.19k IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 1.25549 sec at 9.3 MiB/sec
> 2.39k IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 1.24358 sec at 9.4 MiB/sec
> 2.41k IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 1.24208 sec at 9.4 MiB/sec
> 2.42k IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 1.2433 sec at 9.4 MiB/sec
> 2.41k IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 1.26548 sec at 9.3 MiB/sec
> 2.37k IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 1.31509 sec at 8.9 MiB/sec
> 2.28k IOPS
> 
> kv_flush_lat.sum: 8.955978864
> kv_sync_lat.sum: 10.869536503
> bytes_written_sst: 0
> 
> 
> OSD 0:
> bench: wrote 12 MiB in blocks of 4 KiB in 5.71447 sec at 2.1 MiB/sec 524
> IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 6.18679 sec at 1.9 MiB/sec 484
> IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 6.69068 sec at 1.8 MiB/sec 448
> IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 7.06413 sec at 1.7 MiB/sec 424
> IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 7.50321 sec at 1.6 MiB/sec 399
> IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 6.86882 sec at 1.7 MiB/sec 436
> IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 7.11702 sec at 1.6 MiB/sec 421
> IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 7.10497 sec at 1.6 MiB/sec 422
> IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 6.69801 sec at 1.7 MiB/sec 447
> IOPS
> bench: wrote 12 MiB in blocks of 4 KiB in 7.13588 sec at 1.6 MiB/sec 420
> IOPS
> kv_flush_lat.sum: 0.003866224
> kv_sync_lat.sum: 2.667407139
> bytes_written_sst: 34904457
> 
>> If that's particularly true for "kv_flush_lat" counter - please rerun with debug-bluefs set to 20 and collect OSD logs for both cases
> 
> Yes it's still true for kv_flush_lat - see above. Where to upload / put
> those logs?
> 
> greets,
> Stefan
> 
>>
>> Thanks,
>> Igor
>>
>> On 5/5/2020 11:46 AM, Stefan Priebe - Profihost AG wrote:
>>> Hello Igor,
>>>
>>> Am 30.04.20 um 15:52 schrieb Igor Fedotov:
>>>> 1) reset perf counters for the specific OSD
>>>>
>>>> 2) run bench
>>>>
>>>> 3) dump perf counters.
>>> This is OSD 0:
>>>
>>> # ceph tell osd.0 bench -f plain 12288000 4096
>>> bench: wrote 12 MiB in blocks of 4 KiB in 6.70482 sec at 1.7 MiB/sec 447
>>> IOPS
>>>
>>> https://pastebin.com/raw/hbKcU07g
>>>
>>> This is OSD 38:
>>>
>>> # ceph tell osd.38 bench -f plain 12288000 4096
>>> bench: wrote 12 MiB in blocks of 4 KiB in 2.01763 sec at 5.8 MiB/sec
>>> 1.49k IOPS
>>>
>>> https://pastebin.com/raw/Tx2ckVm1
>>>
>>>> Collecting disks' (both main and db) activity with iostat would be nice
>>>> too. But please either increase benchmark duration or reduce iostat
>>>> probe period to 0.1 or 0.05 second
>>> This gives me:
>>>
>>> # ceph tell osd.38 bench -f plain 122880000 4096
>>> Error EINVAL: 'count' values greater than 12288000 for a block size of 4
>>> KiB, assuming 100 IOPS, for 30 seconds, can cause ill effects on osd.
>>> Please adjust 'osd_bench_small_size_max_iops' with a higher value if you
>>> wish to use a higher 'count'.
>>>
>>> Stefan
>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>> On 4/28/2020 8:42 PM, Stefan Priebe - Profihost AG wrote:
>>>>> HI Igor,
>>>>>
>>>>> but the performance issue is still present even on the recreated OSD.
>>>>>
>>>>> # ceph tell osd.38 bench -f plain 12288000 4096
>>>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.63389 sec at 7.2 MiB/sec
>>>>> 1.84k IOPS
>>>>>
>>>>> vs.
>>>>>
>>>>> # ceph tell osd.10 bench -f plain 12288000 4096
>>>>> bench: wrote 12 MiB in blocks of 4 KiB in 10.7454 sec at 1.1 MiB/sec 279
>>>>> IOPS
>>>>>
>>>>> both baked by the same SAMSUNG SSD as block.db.
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>> Am 28.04.20 um 19:12 schrieb Stefan Priebe - Profihost AG:
>>>>>> Hi Igore,
>>>>>> Am 27.04.20 um 15:03 schrieb Igor Fedotov:
>>>>>>> Just left a comment at https://tracker.ceph.com/issues/44509
>>>>>>>
>>>>>>> Generally bdev-new-db performs no migration, RocksDB might
>>>>>>> eventually do
>>>>>>> that but no guarantee it moves everything.
>>>>>>>
>>>>>>> One should use bluefs-bdev-migrate to do actual migration.
>>>>>>>
>>>>>>> And I think that's the root cause for the above ticket.
>>>>>> perfect - this removed all spillover in seconds.
>>>>>>
>>>>>> Greets,
>>>>>> Stefan
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Igor
>>>>>>>
>>>>>>> On 4/24/2020 2:37 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>> No not a standalone Wal I wanted to ask whether bdev-new-db migrated
>>>>>>>> dB and Wal from hdd to ssd.
>>>>>>>>
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>>> Am 24.04.2020 um 13:01 schrieb Igor Fedotov <ifedotov@xxxxxxx>:
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>> Unless you have 3 different types of disks beyond OSD (e.g. HDD, SSD,
>>>>>>>>> NVMe) standalone WAL makes no sense.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 4/24/2020 1:58 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>>> Is Wal device missing? Do I need to run *bluefs-bdev-new-db and
>>>>>>>>>> Wal?*
>>>>>>>>>>
>>>>>>>>>> Greets,
>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>>>> Am 24.04.2020 um 11:32 schrieb Stefan Priebe - Profihost AG
>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx>:
>>>>>>>>>>>
>>>>>>>>>>> Hi Igor,
>>>>>>>>>>>
>>>>>>>>>>> there must be a difference. I purged osd.0 and recreated it.
>>>>>>>>>>>
>>>>>>>>>>> Now it gives:
>>>>>>>>>>> ceph tell osd.0 bench
>>>>>>>>>>> {
>>>>>>>>>>>     "bytes_written": 1073741824,
>>>>>>>>>>>     "blocksize": 4194304,
>>>>>>>>>>>     "elapsed_sec": 8.1554735639999993,
>>>>>>>>>>>     "bytes_per_sec": 131659040.46819863,
>>>>>>>>>>>     "iops": 31.389961354303033
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> What's wrong wiht adding a block.db device later?
>>>>>>>>>>>
>>>>>>>>>>> Stefan
>>>>>>>>>>>
>>>>>>>>>>> Am 23.04.20 um 20:34 schrieb Stefan Priebe - Profihost AG:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> if the OSDs are idle the difference is even more worse:
>>>>>>>>>>>> # ceph tell osd.0 bench
>>>>>>>>>>>> {
>>>>>>>>>>>>      "bytes_written": 1073741824,
>>>>>>>>>>>>      "blocksize": 4194304,
>>>>>>>>>>>>      "elapsed_sec": 15.396707875000001,
>>>>>>>>>>>>      "bytes_per_sec": 69738403.346825853,
>>>>>>>>>>>>      "iops": 16.626931034761871
>>>>>>>>>>>> }
>>>>>>>>>>>> # ceph tell osd.38 bench
>>>>>>>>>>>> {
>>>>>>>>>>>>      "bytes_written": 1073741824,
>>>>>>>>>>>>      "blocksize": 4194304,
>>>>>>>>>>>>      "elapsed_sec": 6.8903985170000004,
>>>>>>>>>>>>      "bytes_per_sec": 155831599.77624846,
>>>>>>>>>>>>      "iops": 37.153148597776521
>>>>>>>>>>>> }
>>>>>>>>>>>> Stefan
>>>>>>>>>>>> Am 23.04.20 um 14:39 schrieb Stefan Priebe - Profihost AG:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> Am 23.04.20 um 14:06 schrieb Igor Fedotov:
>>>>>>>>>>>>>> I don't recall any additional tuning to be applied to new DB
>>>>>>>>>>>>>> volume. And assume the hardware is pretty the same...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you still have any significant amount of data spilled over
>>>>>>>>>>>>>> for these updated OSDs? If not I don't have any valid
>>>>>>>>>>>>>> explanation for the phenomena.
>>>>>>>>>>>>> just the 64k from here:
>>>>>>>>>>>>> https://tracker.ceph.com/issues/44509
>>>>>>>>>>>>>
>>>>>>>>>>>>>> You might want to try "ceph osd bench" to compare OSDs under
>>>>>>>>>>>>>> pretty the same load. Any difference observed
>>>>>>>>>>>>> Servers are the same HW. OSD Bench is:
>>>>>>>>>>>>> # ceph tell osd.0 bench
>>>>>>>>>>>>> {
>>>>>>>>>>>>>       "bytes_written": 1073741824,
>>>>>>>>>>>>>       "blocksize": 4194304,
>>>>>>>>>>>>>       "elapsed_sec": 16.091414781000001,
>>>>>>>>>>>>>       "bytes_per_sec": 66727620.822242722,
>>>>>>>>>>>>>       "iops": 15.909104543266945
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> # ceph tell osd.36 bench
>>>>>>>>>>>>> {
>>>>>>>>>>>>>       "bytes_written": 1073741824,
>>>>>>>>>>>>>       "blocksize": 4194304,
>>>>>>>>>>>>>       "elapsed_sec": 10.023828538,
>>>>>>>>>>>>>       "bytes_per_sec": 107118933.6419194,
>>>>>>>>>>>>>       "iops": 25.539143953780986
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> OSD 0 is a Toshiba MG07SCA12TA SAS 12G
>>>>>>>>>>>>> OSD 36 is a Seagate ST12000NM0008-2H SATA 6G
>>>>>>>>>>>>>
>>>>>>>>>>>>> SSDs are all the same like the rest of the HW. But both drives
>>>>>>>>>>>>> should give the same performance from their specs. The only other
>>>>>>>>>>>>> difference is that OSD 36 was directly created with the block.db
>>>>>>>>>>>>> device (Nautilus 14.2.7) and OSD 0 (14.2.8) does not.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Stefan
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 4/23/2020 8:35 AM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is there anything else needed beside running:
>>>>>>>>>>>>>>> ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-${OSD}
>>>>>>>>>>>>>>> bluefs-bdev-new-db --dev-target /dev/vgroup/lvdb-1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I did so some weeks ago and currently i'm seeing that all osds
>>>>>>>>>>>>>>> originally deployed with --block-db show 10-20% I/O waits while
>>>>>>>>>>>>>>> all those got converted using ceph-bluestore-tool show 80-100%
>>>>>>>>>>>>>>> I/O waits.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also is there some tuning available to use more of the SSD? The
>>>>>>>>>>>>>>> SSD (block-db) is only saturated at 0-2%.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Greets,
>>>>>>>>>>>>>>> Stefan
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx