Re: adding block.db to OSD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 11.05.20 um 13:25 schrieb Igor Fedotov:
> Hi Stefan,
> 
> I don't have specific preferences, hence any public storage you prefer.
> 
> Just one note - I presume you collected the logs for the full set of 10
> runs. Which is redundant, could you please collect detailed logs (one
> per OSD) for single shot runs.
> 
> Sorry for the unclear previous inquiry.

no problem - i'll recreate them and sent you in private those logs.

> Additionally I realized that it's faster OSD.38 which has higher
> flush/sync latency values. Which is valid for both attempts.
> 
> This seems pretty odd to be honest. Is that correct indeed, wasn't
> anything misplaced along the road?

Yes this is indeed correct and valid for all attempts.

I hope the logs will give more information. Currently i suspect strange
behaviour of the Thoshiba drives (no SMR).

Stefan

> 
> Thanks,
> 
> Igor
> 
> 
> On 5/11/2020 9:44 AM, Stefan Priebe - Profihost AG wrote:
>> Hi Igor,
>>
>> where to post the logs?
>>
>> Am 06.05.20 um 09:23 schrieb Stefan Priebe - Profihost AG:
>>> Hi Igor,
>>>
>>> Am 05.05.20 um 16:10 schrieb Igor Fedotov:
>>>> Hi Stefan,
>>>>
>>>> so (surprise!) some DB access counters show a significant
>>>> difference, e.g.
>>>>
>>>>          "kv_flush_lat": {
>>>>              "avgcount": 1423,
>>>>              "sum": 0.000906419,
>>>>              "avgtime": 0.000000636
>>>>          },
>>>>          "kv_sync_lat": {
>>>>              "avgcount": 1423,
>>>>              "sum": 0.712888091,
>>>>              "avgtime": 0.000500975
>>>>          },
>>>> vs.
>>>>
>>>>       "kv_flush_lat": {
>>>>              "avgcount": 1146,
>>>>              "sum": 3.346228802,
>>>>              "avgtime": 0.002919920
>>>>          },
>>>>        "kv_sync_lat": {
>>>>              "avgcount": 1146,
>>>>              "sum": 3.754915016,
>>>>              "avgtime": 0.003276540
>>>>          },
>>>>
>>>> Also for bluefs:
>>>> "bytes_written_sst": 0,
>>>> vs.
>>>>   "bytes_written_sst": 59785361,
>>>>
>>>> Could you please rerun these benchmark/perf counter gathering steps
>>>> a couple more times and check if the difference is persistent.
>>> I reset all perf counters and ran the bench 10 times on each osd.
>>>
>>> OSD 38:
>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.22796 sec at 9.5 MiB/sec
>>> 2.44k IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.26407 sec at 9.3 MiB/sec
>>> 2.37k IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.24987 sec at 9.4 MiB/sec
>>> 2.40k IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.37125 sec at 8.5 MiB/sec
>>> 2.19k IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.25549 sec at 9.3 MiB/sec
>>> 2.39k IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.24358 sec at 9.4 MiB/sec
>>> 2.41k IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.24208 sec at 9.4 MiB/sec
>>> 2.42k IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.2433 sec at 9.4 MiB/sec
>>> 2.41k IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.26548 sec at 9.3 MiB/sec
>>> 2.37k IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.31509 sec at 8.9 MiB/sec
>>> 2.28k IOPS
>>>
>>> kv_flush_lat.sum: 8.955978864
>>> kv_sync_lat.sum: 10.869536503
>>> bytes_written_sst: 0
>>>
>>>
>>> OSD 0:
>>> bench: wrote 12 MiB in blocks of 4 KiB in 5.71447 sec at 2.1 MiB/sec 524
>>> IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 6.18679 sec at 1.9 MiB/sec 484
>>> IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 6.69068 sec at 1.8 MiB/sec 448
>>> IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 7.06413 sec at 1.7 MiB/sec 424
>>> IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 7.50321 sec at 1.6 MiB/sec 399
>>> IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 6.86882 sec at 1.7 MiB/sec 436
>>> IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 7.11702 sec at 1.6 MiB/sec 421
>>> IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 7.10497 sec at 1.6 MiB/sec 422
>>> IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 6.69801 sec at 1.7 MiB/sec 447
>>> IOPS
>>> bench: wrote 12 MiB in blocks of 4 KiB in 7.13588 sec at 1.6 MiB/sec 420
>>> IOPS
>>> kv_flush_lat.sum: 0.003866224
>>> kv_sync_lat.sum: 2.667407139
>>> bytes_written_sst: 34904457
>>>
>>>> If that's particularly true for "kv_flush_lat" counter - please
>>>> rerun with debug-bluefs set to 20 and collect OSD logs for both cases
>>> Yes it's still true for kv_flush_lat - see above. Where to upload / put
>>> those logs?
>>>
>>> greets,
>>> Stefan
>>>
>>>> Thanks,
>>>> Igor
>>>>
>>>> On 5/5/2020 11:46 AM, Stefan Priebe - Profihost AG wrote:
>>>>> Hello Igor,
>>>>>
>>>>> Am 30.04.20 um 15:52 schrieb Igor Fedotov:
>>>>>> 1) reset perf counters for the specific OSD
>>>>>>
>>>>>> 2) run bench
>>>>>>
>>>>>> 3) dump perf counters.
>>>>> This is OSD 0:
>>>>>
>>>>> # ceph tell osd.0 bench -f plain 12288000 4096
>>>>> bench: wrote 12 MiB in blocks of 4 KiB in 6.70482 sec at 1.7
>>>>> MiB/sec 447
>>>>> IOPS
>>>>>
>>>>> https://pastebin.com/raw/hbKcU07g
>>>>>
>>>>> This is OSD 38:
>>>>>
>>>>> # ceph tell osd.38 bench -f plain 12288000 4096
>>>>> bench: wrote 12 MiB in blocks of 4 KiB in 2.01763 sec at 5.8 MiB/sec
>>>>> 1.49k IOPS
>>>>>
>>>>> https://pastebin.com/raw/Tx2ckVm1
>>>>>
>>>>>> Collecting disks' (both main and db) activity with iostat would be
>>>>>> nice
>>>>>> too. But please either increase benchmark duration or reduce iostat
>>>>>> probe period to 0.1 or 0.05 second
>>>>> This gives me:
>>>>>
>>>>> # ceph tell osd.38 bench -f plain 122880000 4096
>>>>> Error EINVAL: 'count' values greater than 12288000 for a block size
>>>>> of 4
>>>>> KiB, assuming 100 IOPS, for 30 seconds, can cause ill effects on osd.
>>>>> Please adjust 'osd_bench_small_size_max_iops' with a higher value
>>>>> if you
>>>>> wish to use a higher 'count'.
>>>>>
>>>>> Stefan
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Igor
>>>>>>
>>>>>> On 4/28/2020 8:42 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>> HI Igor,
>>>>>>>
>>>>>>> but the performance issue is still present even on the recreated
>>>>>>> OSD.
>>>>>>>
>>>>>>> # ceph tell osd.38 bench -f plain 12288000 4096
>>>>>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.63389 sec at 7.2 MiB/sec
>>>>>>> 1.84k IOPS
>>>>>>>
>>>>>>> vs.
>>>>>>>
>>>>>>> # ceph tell osd.10 bench -f plain 12288000 4096
>>>>>>> bench: wrote 12 MiB in blocks of 4 KiB in 10.7454 sec at 1.1
>>>>>>> MiB/sec 279
>>>>>>> IOPS
>>>>>>>
>>>>>>> both baked by the same SAMSUNG SSD as block.db.
>>>>>>>
>>>>>>> Greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 28.04.20 um 19:12 schrieb Stefan Priebe - Profihost AG:
>>>>>>>> Hi Igore,
>>>>>>>> Am 27.04.20 um 15:03 schrieb Igor Fedotov:
>>>>>>>>> Just left a comment at https://tracker.ceph.com/issues/44509
>>>>>>>>>
>>>>>>>>> Generally bdev-new-db performs no migration, RocksDB might
>>>>>>>>> eventually do
>>>>>>>>> that but no guarantee it moves everything.
>>>>>>>>>
>>>>>>>>> One should use bluefs-bdev-migrate to do actual migration.
>>>>>>>>>
>>>>>>>>> And I think that's the root cause for the above ticket.
>>>>>>>> perfect - this removed all spillover in seconds.
>>>>>>>>
>>>>>>>> Greets,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Igor
>>>>>>>>>
>>>>>>>>> On 4/24/2020 2:37 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>>> No not a standalone Wal I wanted to ask whether bdev-new-db
>>>>>>>>>> migrated
>>>>>>>>>> dB and Wal from hdd to ssd.
>>>>>>>>>>
>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>>>> Am 24.04.2020 um 13:01 schrieb Igor Fedotov <ifedotov@xxxxxxx>:
>>>>>>>>>>>
>>>>>>>>>>> 
>>>>>>>>>>>
>>>>>>>>>>> Unless you have 3 different types of disks beyond OSD (e.g.
>>>>>>>>>>> HDD, SSD,
>>>>>>>>>>> NVMe) standalone WAL makes no sense.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 4/24/2020 1:58 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>>>>> Is Wal device missing? Do I need to run *bluefs-bdev-new-db and
>>>>>>>>>>>> Wal?*
>>>>>>>>>>>>
>>>>>>>>>>>> Greets,
>>>>>>>>>>>> Stefan
>>>>>>>>>>>>
>>>>>>>>>>>>> Am 24.04.2020 um 11:32 schrieb Stefan Priebe - Profihost AG
>>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx>:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Igor,
>>>>>>>>>>>>>
>>>>>>>>>>>>> there must be a difference. I purged osd.0 and recreated it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now it gives:
>>>>>>>>>>>>> ceph tell osd.0 bench
>>>>>>>>>>>>> {
>>>>>>>>>>>>>      "bytes_written": 1073741824,
>>>>>>>>>>>>>      "blocksize": 4194304,
>>>>>>>>>>>>>      "elapsed_sec": 8.1554735639999993,
>>>>>>>>>>>>>      "bytes_per_sec": 131659040.46819863,
>>>>>>>>>>>>>      "iops": 31.389961354303033
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> What's wrong wiht adding a block.db device later?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Stefan
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 23.04.20 um 20:34 schrieb Stefan Priebe - Profihost AG:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> if the OSDs are idle the difference is even more worse:
>>>>>>>>>>>>>> # ceph tell osd.0 bench
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>       "bytes_written": 1073741824,
>>>>>>>>>>>>>>       "blocksize": 4194304,
>>>>>>>>>>>>>>       "elapsed_sec": 15.396707875000001,
>>>>>>>>>>>>>>       "bytes_per_sec": 69738403.346825853,
>>>>>>>>>>>>>>       "iops": 16.626931034761871
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> # ceph tell osd.38 bench
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>       "bytes_written": 1073741824,
>>>>>>>>>>>>>>       "blocksize": 4194304,
>>>>>>>>>>>>>>       "elapsed_sec": 6.8903985170000004,
>>>>>>>>>>>>>>       "bytes_per_sec": 155831599.77624846,
>>>>>>>>>>>>>>       "iops": 37.153148597776521
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> Stefan
>>>>>>>>>>>>>> Am 23.04.20 um 14:39 schrieb Stefan Priebe - Profihost AG:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> Am 23.04.20 um 14:06 schrieb Igor Fedotov:
>>>>>>>>>>>>>>>> I don't recall any additional tuning to be applied to
>>>>>>>>>>>>>>>> new DB
>>>>>>>>>>>>>>>> volume. And assume the hardware is pretty the same...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Do you still have any significant amount of data spilled
>>>>>>>>>>>>>>>> over
>>>>>>>>>>>>>>>> for these updated OSDs? If not I don't have any valid
>>>>>>>>>>>>>>>> explanation for the phenomena.
>>>>>>>>>>>>>>> just the 64k from here:
>>>>>>>>>>>>>>> https://tracker.ceph.com/issues/44509
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You might want to try "ceph osd bench" to compare OSDs
>>>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>> pretty the same load. Any difference observed
>>>>>>>>>>>>>>> Servers are the same HW. OSD Bench is:
>>>>>>>>>>>>>>> # ceph tell osd.0 bench
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>        "bytes_written": 1073741824,
>>>>>>>>>>>>>>>        "blocksize": 4194304,
>>>>>>>>>>>>>>>        "elapsed_sec": 16.091414781000001,
>>>>>>>>>>>>>>>        "bytes_per_sec": 66727620.822242722,
>>>>>>>>>>>>>>>        "iops": 15.909104543266945
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # ceph tell osd.36 bench
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>        "bytes_written": 1073741824,
>>>>>>>>>>>>>>>        "blocksize": 4194304,
>>>>>>>>>>>>>>>        "elapsed_sec": 10.023828538,
>>>>>>>>>>>>>>>        "bytes_per_sec": 107118933.6419194,
>>>>>>>>>>>>>>>        "iops": 25.539143953780986
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> OSD 0 is a Toshiba MG07SCA12TA SAS 12G
>>>>>>>>>>>>>>> OSD 36 is a Seagate ST12000NM0008-2H SATA 6G
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> SSDs are all the same like the rest of the HW. But both
>>>>>>>>>>>>>>> drives
>>>>>>>>>>>>>>> should give the same performance from their specs. The
>>>>>>>>>>>>>>> only other
>>>>>>>>>>>>>>> difference is that OSD 36 was directly created with the
>>>>>>>>>>>>>>> block.db
>>>>>>>>>>>>>>> device (Nautilus 14.2.7) and OSD 0 (14.2.8) does not.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Stefan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 4/23/2020 8:35 AM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> is there anything else needed beside running:
>>>>>>>>>>>>>>>>> ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-${OSD}
>>>>>>>>>>>>>>>>> bluefs-bdev-new-db --dev-target /dev/vgroup/lvdb-1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I did so some weeks ago and currently i'm seeing that
>>>>>>>>>>>>>>>>> all osds
>>>>>>>>>>>>>>>>> originally deployed with --block-db show 10-20% I/O
>>>>>>>>>>>>>>>>> waits while
>>>>>>>>>>>>>>>>> all those got converted using ceph-bluestore-tool show
>>>>>>>>>>>>>>>>> 80-100%
>>>>>>>>>>>>>>>>> I/O waits.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also is there some tuning available to use more of the
>>>>>>>>>>>>>>>>> SSD? The
>>>>>>>>>>>>>>>>> SSD (block-db) is only saturated at 0-2%.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Greets,
>>>>>>>>>>>>>>>>> Stefan
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux