Am 11.05.20 um 13:25 schrieb Igor Fedotov: > Hi Stefan, > > I don't have specific preferences, hence any public storage you prefer. > > Just one note - I presume you collected the logs for the full set of 10 > runs. Which is redundant, could you please collect detailed logs (one > per OSD) for single shot runs. > > Sorry for the unclear previous inquiry. no problem - i'll recreate them and sent you in private those logs. > Additionally I realized that it's faster OSD.38 which has higher > flush/sync latency values. Which is valid for both attempts. > > This seems pretty odd to be honest. Is that correct indeed, wasn't > anything misplaced along the road? Yes this is indeed correct and valid for all attempts. I hope the logs will give more information. Currently i suspect strange behaviour of the Thoshiba drives (no SMR). Stefan > > Thanks, > > Igor > > > On 5/11/2020 9:44 AM, Stefan Priebe - Profihost AG wrote: >> Hi Igor, >> >> where to post the logs? >> >> Am 06.05.20 um 09:23 schrieb Stefan Priebe - Profihost AG: >>> Hi Igor, >>> >>> Am 05.05.20 um 16:10 schrieb Igor Fedotov: >>>> Hi Stefan, >>>> >>>> so (surprise!) some DB access counters show a significant >>>> difference, e.g. >>>> >>>> "kv_flush_lat": { >>>> "avgcount": 1423, >>>> "sum": 0.000906419, >>>> "avgtime": 0.000000636 >>>> }, >>>> "kv_sync_lat": { >>>> "avgcount": 1423, >>>> "sum": 0.712888091, >>>> "avgtime": 0.000500975 >>>> }, >>>> vs. >>>> >>>> "kv_flush_lat": { >>>> "avgcount": 1146, >>>> "sum": 3.346228802, >>>> "avgtime": 0.002919920 >>>> }, >>>> "kv_sync_lat": { >>>> "avgcount": 1146, >>>> "sum": 3.754915016, >>>> "avgtime": 0.003276540 >>>> }, >>>> >>>> Also for bluefs: >>>> "bytes_written_sst": 0, >>>> vs. >>>> "bytes_written_sst": 59785361, >>>> >>>> Could you please rerun these benchmark/perf counter gathering steps >>>> a couple more times and check if the difference is persistent. >>> I reset all perf counters and ran the bench 10 times on each osd. >>> >>> OSD 38: >>> bench: wrote 12 MiB in blocks of 4 KiB in 1.22796 sec at 9.5 MiB/sec >>> 2.44k IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 1.26407 sec at 9.3 MiB/sec >>> 2.37k IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 1.24987 sec at 9.4 MiB/sec >>> 2.40k IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 1.37125 sec at 8.5 MiB/sec >>> 2.19k IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 1.25549 sec at 9.3 MiB/sec >>> 2.39k IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 1.24358 sec at 9.4 MiB/sec >>> 2.41k IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 1.24208 sec at 9.4 MiB/sec >>> 2.42k IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 1.2433 sec at 9.4 MiB/sec >>> 2.41k IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 1.26548 sec at 9.3 MiB/sec >>> 2.37k IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 1.31509 sec at 8.9 MiB/sec >>> 2.28k IOPS >>> >>> kv_flush_lat.sum: 8.955978864 >>> kv_sync_lat.sum: 10.869536503 >>> bytes_written_sst: 0 >>> >>> >>> OSD 0: >>> bench: wrote 12 MiB in blocks of 4 KiB in 5.71447 sec at 2.1 MiB/sec 524 >>> IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 6.18679 sec at 1.9 MiB/sec 484 >>> IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 6.69068 sec at 1.8 MiB/sec 448 >>> IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 7.06413 sec at 1.7 MiB/sec 424 >>> IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 7.50321 sec at 1.6 MiB/sec 399 >>> IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 6.86882 sec at 1.7 MiB/sec 436 >>> IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 7.11702 sec at 1.6 MiB/sec 421 >>> IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 7.10497 sec at 1.6 MiB/sec 422 >>> IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 6.69801 sec at 1.7 MiB/sec 447 >>> IOPS >>> bench: wrote 12 MiB in blocks of 4 KiB in 7.13588 sec at 1.6 MiB/sec 420 >>> IOPS >>> kv_flush_lat.sum: 0.003866224 >>> kv_sync_lat.sum: 2.667407139 >>> bytes_written_sst: 34904457 >>> >>>> If that's particularly true for "kv_flush_lat" counter - please >>>> rerun with debug-bluefs set to 20 and collect OSD logs for both cases >>> Yes it's still true for kv_flush_lat - see above. Where to upload / put >>> those logs? >>> >>> greets, >>> Stefan >>> >>>> Thanks, >>>> Igor >>>> >>>> On 5/5/2020 11:46 AM, Stefan Priebe - Profihost AG wrote: >>>>> Hello Igor, >>>>> >>>>> Am 30.04.20 um 15:52 schrieb Igor Fedotov: >>>>>> 1) reset perf counters for the specific OSD >>>>>> >>>>>> 2) run bench >>>>>> >>>>>> 3) dump perf counters. >>>>> This is OSD 0: >>>>> >>>>> # ceph tell osd.0 bench -f plain 12288000 4096 >>>>> bench: wrote 12 MiB in blocks of 4 KiB in 6.70482 sec at 1.7 >>>>> MiB/sec 447 >>>>> IOPS >>>>> >>>>> https://pastebin.com/raw/hbKcU07g >>>>> >>>>> This is OSD 38: >>>>> >>>>> # ceph tell osd.38 bench -f plain 12288000 4096 >>>>> bench: wrote 12 MiB in blocks of 4 KiB in 2.01763 sec at 5.8 MiB/sec >>>>> 1.49k IOPS >>>>> >>>>> https://pastebin.com/raw/Tx2ckVm1 >>>>> >>>>>> Collecting disks' (both main and db) activity with iostat would be >>>>>> nice >>>>>> too. But please either increase benchmark duration or reduce iostat >>>>>> probe period to 0.1 or 0.05 second >>>>> This gives me: >>>>> >>>>> # ceph tell osd.38 bench -f plain 122880000 4096 >>>>> Error EINVAL: 'count' values greater than 12288000 for a block size >>>>> of 4 >>>>> KiB, assuming 100 IOPS, for 30 seconds, can cause ill effects on osd. >>>>> Please adjust 'osd_bench_small_size_max_iops' with a higher value >>>>> if you >>>>> wish to use a higher 'count'. >>>>> >>>>> Stefan >>>>> >>>>>> Thanks, >>>>>> >>>>>> Igor >>>>>> >>>>>> On 4/28/2020 8:42 PM, Stefan Priebe - Profihost AG wrote: >>>>>>> HI Igor, >>>>>>> >>>>>>> but the performance issue is still present even on the recreated >>>>>>> OSD. >>>>>>> >>>>>>> # ceph tell osd.38 bench -f plain 12288000 4096 >>>>>>> bench: wrote 12 MiB in blocks of 4 KiB in 1.63389 sec at 7.2 MiB/sec >>>>>>> 1.84k IOPS >>>>>>> >>>>>>> vs. >>>>>>> >>>>>>> # ceph tell osd.10 bench -f plain 12288000 4096 >>>>>>> bench: wrote 12 MiB in blocks of 4 KiB in 10.7454 sec at 1.1 >>>>>>> MiB/sec 279 >>>>>>> IOPS >>>>>>> >>>>>>> both baked by the same SAMSUNG SSD as block.db. >>>>>>> >>>>>>> Greets, >>>>>>> Stefan >>>>>>> >>>>>>> Am 28.04.20 um 19:12 schrieb Stefan Priebe - Profihost AG: >>>>>>>> Hi Igore, >>>>>>>> Am 27.04.20 um 15:03 schrieb Igor Fedotov: >>>>>>>>> Just left a comment at https://tracker.ceph.com/issues/44509 >>>>>>>>> >>>>>>>>> Generally bdev-new-db performs no migration, RocksDB might >>>>>>>>> eventually do >>>>>>>>> that but no guarantee it moves everything. >>>>>>>>> >>>>>>>>> One should use bluefs-bdev-migrate to do actual migration. >>>>>>>>> >>>>>>>>> And I think that's the root cause for the above ticket. >>>>>>>> perfect - this removed all spillover in seconds. >>>>>>>> >>>>>>>> Greets, >>>>>>>> Stefan >>>>>>>> >>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Igor >>>>>>>>> >>>>>>>>> On 4/24/2020 2:37 PM, Stefan Priebe - Profihost AG wrote: >>>>>>>>>> No not a standalone Wal I wanted to ask whether bdev-new-db >>>>>>>>>> migrated >>>>>>>>>> dB and Wal from hdd to ssd. >>>>>>>>>> >>>>>>>>>> Stefan >>>>>>>>>> >>>>>>>>>>> Am 24.04.2020 um 13:01 schrieb Igor Fedotov <ifedotov@xxxxxxx>: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Unless you have 3 different types of disks beyond OSD (e.g. >>>>>>>>>>> HDD, SSD, >>>>>>>>>>> NVMe) standalone WAL makes no sense. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 4/24/2020 1:58 PM, Stefan Priebe - Profihost AG wrote: >>>>>>>>>>>> Is Wal device missing? Do I need to run *bluefs-bdev-new-db and >>>>>>>>>>>> Wal?* >>>>>>>>>>>> >>>>>>>>>>>> Greets, >>>>>>>>>>>> Stefan >>>>>>>>>>>> >>>>>>>>>>>>> Am 24.04.2020 um 11:32 schrieb Stefan Priebe - Profihost AG >>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx>: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Igor, >>>>>>>>>>>>> >>>>>>>>>>>>> there must be a difference. I purged osd.0 and recreated it. >>>>>>>>>>>>> >>>>>>>>>>>>> Now it gives: >>>>>>>>>>>>> ceph tell osd.0 bench >>>>>>>>>>>>> { >>>>>>>>>>>>> "bytes_written": 1073741824, >>>>>>>>>>>>> "blocksize": 4194304, >>>>>>>>>>>>> "elapsed_sec": 8.1554735639999993, >>>>>>>>>>>>> "bytes_per_sec": 131659040.46819863, >>>>>>>>>>>>> "iops": 31.389961354303033 >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> What's wrong wiht adding a block.db device later? >>>>>>>>>>>>> >>>>>>>>>>>>> Stefan >>>>>>>>>>>>> >>>>>>>>>>>>> Am 23.04.20 um 20:34 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> if the OSDs are idle the difference is even more worse: >>>>>>>>>>>>>> # ceph tell osd.0 bench >>>>>>>>>>>>>> { >>>>>>>>>>>>>> "bytes_written": 1073741824, >>>>>>>>>>>>>> "blocksize": 4194304, >>>>>>>>>>>>>> "elapsed_sec": 15.396707875000001, >>>>>>>>>>>>>> "bytes_per_sec": 69738403.346825853, >>>>>>>>>>>>>> "iops": 16.626931034761871 >>>>>>>>>>>>>> } >>>>>>>>>>>>>> # ceph tell osd.38 bench >>>>>>>>>>>>>> { >>>>>>>>>>>>>> "bytes_written": 1073741824, >>>>>>>>>>>>>> "blocksize": 4194304, >>>>>>>>>>>>>> "elapsed_sec": 6.8903985170000004, >>>>>>>>>>>>>> "bytes_per_sec": 155831599.77624846, >>>>>>>>>>>>>> "iops": 37.153148597776521 >>>>>>>>>>>>>> } >>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>> Am 23.04.20 um 14:39 schrieb Stefan Priebe - Profihost AG: >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> Am 23.04.20 um 14:06 schrieb Igor Fedotov: >>>>>>>>>>>>>>>> I don't recall any additional tuning to be applied to >>>>>>>>>>>>>>>> new DB >>>>>>>>>>>>>>>> volume. And assume the hardware is pretty the same... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Do you still have any significant amount of data spilled >>>>>>>>>>>>>>>> over >>>>>>>>>>>>>>>> for these updated OSDs? If not I don't have any valid >>>>>>>>>>>>>>>> explanation for the phenomena. >>>>>>>>>>>>>>> just the 64k from here: >>>>>>>>>>>>>>> https://tracker.ceph.com/issues/44509 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You might want to try "ceph osd bench" to compare OSDs >>>>>>>>>>>>>>>> under >>>>>>>>>>>>>>>> pretty the same load. Any difference observed >>>>>>>>>>>>>>> Servers are the same HW. OSD Bench is: >>>>>>>>>>>>>>> # ceph tell osd.0 bench >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> "bytes_written": 1073741824, >>>>>>>>>>>>>>> "blocksize": 4194304, >>>>>>>>>>>>>>> "elapsed_sec": 16.091414781000001, >>>>>>>>>>>>>>> "bytes_per_sec": 66727620.822242722, >>>>>>>>>>>>>>> "iops": 15.909104543266945 >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> # ceph tell osd.36 bench >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> "bytes_written": 1073741824, >>>>>>>>>>>>>>> "blocksize": 4194304, >>>>>>>>>>>>>>> "elapsed_sec": 10.023828538, >>>>>>>>>>>>>>> "bytes_per_sec": 107118933.6419194, >>>>>>>>>>>>>>> "iops": 25.539143953780986 >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> OSD 0 is a Toshiba MG07SCA12TA SAS 12G >>>>>>>>>>>>>>> OSD 36 is a Seagate ST12000NM0008-2H SATA 6G >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> SSDs are all the same like the rest of the HW. But both >>>>>>>>>>>>>>> drives >>>>>>>>>>>>>>> should give the same performance from their specs. The >>>>>>>>>>>>>>> only other >>>>>>>>>>>>>>> difference is that OSD 36 was directly created with the >>>>>>>>>>>>>>> block.db >>>>>>>>>>>>>>> device (Nautilus 14.2.7) and OSD 0 (14.2.8) does not. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 4/23/2020 8:35 AM, Stefan Priebe - Profihost AG wrote: >>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> is there anything else needed beside running: >>>>>>>>>>>>>>>>> ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-${OSD} >>>>>>>>>>>>>>>>> bluefs-bdev-new-db --dev-target /dev/vgroup/lvdb-1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I did so some weeks ago and currently i'm seeing that >>>>>>>>>>>>>>>>> all osds >>>>>>>>>>>>>>>>> originally deployed with --block-db show 10-20% I/O >>>>>>>>>>>>>>>>> waits while >>>>>>>>>>>>>>>>> all those got converted using ceph-bluestore-tool show >>>>>>>>>>>>>>>>> 80-100% >>>>>>>>>>>>>>>>> I/O waits. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Also is there some tuning available to use more of the >>>>>>>>>>>>>>>>> SSD? The >>>>>>>>>>>>>>>>> SSD (block-db) is only saturated at 0-2%. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Greets, >>>>>>>>>>>>>>>>> Stefan >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx