Re: 50% IOPS performance drop after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

ceph@xxxxxxxxxx · Tue, 7 Dec 2021 17:46:25 +0100

Hi Frank, thanks for the input. Im still a bit sceptical to be honest that this is all, since a.) our bench values are pretty stable over time (natilus times and octopus times) with a variance of maybe 20% which i would put on normal cluster load.

Furthermore the HDD pool also halved its performance and the IO waitstates also halved and the raw OSD IO Utilisation dropped by 50% since the update.

>From old testings (actually done with FIO) i still see, that in our old setup (10GE only) we could achieve 310k IOPS on NVME only test storage and our current SSD’s do around 35k per Disk, so i guess we should be able to reach higher values than we do right now with enough clients.

i need to know if there is a proper explanation for the waitstates vs. performance drop… ;-)

Cheers Kai

> On 7 Dec 2021, at 12:57, Frank Schilder <frans@xxxxxx> wrote:
> 
> Hi,
> 
> I also executed some benchmarks on RBD and found that the ceph built-in benchmark commands were both, way too optimistic and highly unreliable. Successive executions of the same command (say, rbd bench for a 30 minute interval) would give results with factor 2-3 between averages. I moved to use fio, which gave consistent, realistic and reproducible results.
> 
> I vaguely remember that there was a fix for this behaviour in the OSD code quite some time ago. I believe there was a caching issue that was in the way of realistic results. It might be due to this fix that you now see realistic IOP/s instead of wishful thinking IOP/s. It might be possible that your earlier test results were a factor of 2 too optimistic and your current results are the right ones. The factor 2 you see is what I saw between rbd bench and fio with rbd engine (fio was factor 2 lower).
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: ceph@xxxxxxxxxx <mailto:ceph@xxxxxxxxxx> <ceph@xxxxxxxxxx <mailto:ceph@xxxxxxxxxx>>
> Sent: 07 December 2021 11:55:38
> To: Dan van der Ster
> Cc: Ceph Users
> Subject:  Re: 50% IOPS performance drop after upgrade from Nautilus 14.2.22 to Octopus 15.2.15
> 
> Hi Dan, Josh,
> 
> thanks for the input, bluefs_buffered_io with true and false, no real differences to be seen (hard to say in a productive cluster. maybe some little percent).
> 
> We now disabled the write cache on our SSD’s and see a “felt” increase of the performance up to 17k IOPS with 4k blocks but still far from the original values, anyway thanks! ;-)
> 
> For the 1024 bytes i think at some point we also wanted to test network latencies, but i agree, using 4k is the default minimum.
> 
> Attached you can find the links to the IO of the boxes i referenced in the first mail, still new to good old mailing lists ;-)
> 
> The IOPS Graph is the most significant, the IO waitstates of the system dropped by nearly 50% which is the reason for our drop in IOPS overall i think. just no clue why… (Update was on the 10th of Nov). i guess i want these waitstates back :-o
> 
> https://kai.freshx.de/img/ceph-io.png <https://kai.freshx.de/img/ceph-io.png> <https://kai.freshx.de/img/ceph-io.png <https://kai.freshx.de/img/ceph-io.png>>
> 
> The latencies of some disks (OSD.12 is a small SSD (1.5TB) OSD.15 a big one (7TB)) but here i guess that the big one gets 4 times more IOPS because of its weight in CEPH
> 
> https://kai.freshx.de/img/ceph-lat.png
> 
> Cheers Kai
> 
>> On 6 Dec 2021, at 18:12, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>> 
>> Hi,
>> 
>> It's a bit weird that you benchmark 1024 bytes -- or is that your
>> realistic use-case?
>> This is smaller than the min alloc unit for even SSDs, so will need a
>> read/modify/write cycle to update, slowing substantially.
>> 
>> Anyway, since you didn't mention it, have you disabled the write cache
>> on your drives? See
>> https://docs.ceph.com/en/latest/start/hardware-recommendations/#write-caches
>> for the latest related docs.
>> 
>> -- Dan
>> 
>> 
>> 
>> 
>> 
>> On Mon, Dec 6, 2021 at 5:28 PM <ceph@xxxxxxxxxx> wrote:
>>> 
>>> Dear List,
>>> 
>>> until we upgraded our cluster 3 weeks ago we had a cute high performing small productive CEPH cluster running Nautilus 14.2.22 on Proxmox 6.4 (Kernel 5.4-143 at this time). Then we started the upgrade to Octopus 15.2.15. Since we did an online upgrade, we stopped the autoconvert with
>>> 
>>> ceph config set osd bluestore_fsck_quick_fix_on_mount false
>>> 
>>> but followed up the OMAP conversion after the complete upgrade step by step by restarting one OSD after the other.
>>> 
>>> Our Setup is
>>> 5 x Storage Node, each : 16 x 2.3GHz, 64GB RAM, 1 x SSD OSD 1.6TB, 1 x 7.68TB (both WD Enterprise, SAS-12), 3 HDD OSD (10TB, SAS-12) with Optane Cache)
>>> 4 x Compute Nodes
>>> 40 GE Storage network (Mellanox Switch + Mellanox CX354 40GE Dual Port Cards, Linux OSS drivers)
>>> 10 GE Cluster/Mgmt Network
>>> 
>>> Our performance before the upgrade, Ceph 14.2.22 (about 36k IOPS on the SSD Pool)
>>> 
>>> ### SSD Pool on 40GE Switches
>>> # rados bench -p SSD 30 -t 256 -b 1024 write
>>> hints = 1
>>> Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for up to 30 seconds or 0 objects
>>> ...
>>> Total time run:         30.004
>>> Total writes made:      1094320
>>> Write size:             1024
>>> Object size:            1024
>>> Bandwidth (MB/sec):     35.6177
>>> Stddev Bandwidth:       4.71909
>>> Max bandwidth (MB/sec): 40.7314
>>> Min bandwidth (MB/sec): 21.3037
>>> Average IOPS:           36472
>>> Stddev IOPS:            4832.35
>>> Max IOPS:               41709
>>> Min IOPS:               21815
>>> Average Latency(s):     0.00701759
>>> Stddev Latency(s):      0.00854068
>>> Max latency(s):         0.445397
>>> Min latency(s):         0.000909089
>>> Cleaning up (deleting benchmark objects)
>>> 
>>> Our performance after the update CEPH 15.2.15 (drops to max 17k IOPS on the SSD Pool)
>>> # rados bench -p SSD 30 -t 256 -b 1024 write
>>> hints = 1
>>> Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for up to 30 seconds or 0 objects
>>> ...
>>> Total time run:         30.0146
>>> Total writes made:      468513
>>> Write size:             1024
>>> Object size:            1024
>>> Bandwidth (MB/sec):     15.2437
>>> Stddev Bandwidth:       0.78677
>>> Max bandwidth (MB/sec): 16.835
>>> Min bandwidth (MB/sec): 13.3184
>>> Average IOPS:           15609
>>> Stddev IOPS:            805.652
>>> Max IOPS:               17239
>>> Min IOPS:               13638
>>> Average Latency(s):     0.016396
>>> Stddev Latency(s):      0.00777054
>>> Max latency(s):         0.140793
>>> Min latency(s):         0.00106735
>>> Cleaning up (deleting benchmark objects)
>>> Note : OSD.17 is out on purpose
>>> # ceph osd tree
>>> ID   CLASS  WEIGHT     TYPE NAME            STATUS  REWEIGHT  PRI-AFF
>>> -1         208.94525  root default
>>> -3          41.43977      host xx-ceph01
>>> 0    hdd    9.17380          osd.0            up   1.00000  1.00000
>>> 5    hdd    9.17380          osd.5            up   1.00000  1.00000
>>> 23    hdd   14.65039          osd.23           up   1.00000  1.00000
>>> 7    ssd    1.45549          osd.7            up   1.00000  1.00000
>>> 15    ssd    6.98630          osd.15           up   1.00000  1.00000
>>> -5          41.43977      host xx-ceph02
>>> 1    hdd    9.17380          osd.1            up   1.00000  1.00000
>>> 4    hdd    9.17380          osd.4            up   1.00000  1.00000
>>> 24    hdd   14.65039          osd.24           up   1.00000  1.00000
>>> 9    ssd    1.45549          osd.9            up   1.00000  1.00000
>>> 20    ssd    6.98630          osd.20           up   1.00000  1.00000
>>> -7          41.43977      host xx-ceph03
>>> 2    hdd    9.17380          osd.2            up   1.00000  1.00000
>>> 3    hdd    9.17380          osd.3            up   1.00000  1.00000
>>> 25    hdd   14.65039          osd.25           up   1.00000  1.00000
>>> 8    ssd    1.45549          osd.8            up   1.00000  1.00000
>>> 21    ssd    6.98630          osd.21           up   1.00000  1.00000
>>> -17          41.43977      host xx-ceph04
>>> 10    hdd    9.17380          osd.10           up   1.00000  1.00000
>>> 11    hdd    9.17380          osd.11           up   1.00000  1.00000
>>> 26    hdd   14.65039          osd.26           up   1.00000  1.00000
>>> 6    ssd    1.45549          osd.6            up   1.00000  1.00000
>>> 22    ssd    6.98630          osd.22           up   1.00000  1.00000
>>> -21          43.18616      host xx-ceph05
>>> 13    hdd    9.17380          osd.13           up   1.00000  1.00000
>>> 14    hdd    9.17380          osd.14           up   1.00000  1.00000
>>> 27    hdd   14.65039          osd.27           up   1.00000  1.00000
>>> 12    ssd    1.45540          osd.12           up   1.00000  1.00000
>>> 16    ssd    1.74660          osd.16           up   1.00000  1.00000
>>> 17    ssd    3.49309          osd.17           up         0  1.00000
>>> 18    ssd    1.74660          osd.18           up   1.00000  1.00000
>>> 19    ssd    1.74649          osd.19           up   1.00000  1.00000
>>> 
>>> # ceph osd df
>>> ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
>>> 0    hdd   9.17380   1.00000  9.2 TiB  2.5 TiB  2.4 TiB   28 MiB  5.0 GiB  6.6 TiB  27.56  0.96   88      up
>>> 5    hdd   9.17380   1.00000  9.2 TiB  2.6 TiB  2.5 TiB   57 MiB  5.1 GiB  6.6 TiB  27.89  0.98   89      up
>>> 23    hdd  14.65039   1.00000   15 TiB  3.9 TiB  3.8 TiB   40 MiB  7.2 GiB   11 TiB  26.69  0.93  137      up
>>> 7    ssd   1.45549   1.00000  1.5 TiB  634 GiB  633 GiB   33 MiB  1.8 GiB  856 GiB  42.57  1.49   64      up
>>> 15    ssd   6.98630   1.00000  7.0 TiB  2.6 TiB  2.6 TiB  118 MiB  5.9 GiB  4.4 TiB  37.70  1.32  272      up
>>> 1    hdd   9.17380   1.00000  9.2 TiB  2.4 TiB  2.3 TiB   31 MiB  4.7 GiB  6.8 TiB  26.04  0.91   83      up
>>> 4    hdd   9.17380   1.00000  9.2 TiB  2.6 TiB  2.5 TiB   28 MiB  5.2 GiB  6.6 TiB  28.51  1.00   91      up
>>> 24    hdd  14.65039   1.00000   15 TiB  4.0 TiB  3.9 TiB   38 MiB  7.2 GiB   11 TiB  27.06  0.95  139      up
>>> 9    ssd   1.45549   1.00000  1.5 TiB  583 GiB  582 GiB   30 MiB  1.6 GiB  907 GiB  39.13  1.37   59      up
>>> 20    ssd   6.98630   1.00000  7.0 TiB  2.5 TiB  2.5 TiB   81 MiB  7.4 GiB  4.5 TiB  35.45  1.24  260      up
>>> 2    hdd   9.17380   1.00000  9.2 TiB  2.4 TiB  2.3 TiB   26 MiB  4.8 GiB  6.8 TiB  26.01  0.91   83      up
>>> 3    hdd   9.17380   1.00000  9.2 TiB  2.7 TiB  2.6 TiB   29 MiB  5.4 GiB  6.5 TiB  29.38  1.03   94      up
>>> 25    hdd  14.65039   1.00000   15 TiB  4.2 TiB  4.1 TiB   41 MiB  7.7 GiB   10 TiB  28.79  1.01  149      up
>>> 8    ssd   1.45549   1.00000  1.5 TiB  637 GiB  635 GiB   34 MiB  1.7 GiB  854 GiB  42.71  1.49   65      up
>>> 21    ssd   6.98630   1.00000  7.0 TiB  2.5 TiB  2.5 TiB   96 MiB  7.5 GiB  4.5 TiB  35.49  1.24  260      up
>>> 10    hdd   9.17380   1.00000  9.2 TiB  2.2 TiB  2.1 TiB   26 MiB  4.5 GiB  7.0 TiB  24.21  0.85   77      up
>>> 11    hdd   9.17380   1.00000  9.2 TiB  2.5 TiB  2.4 TiB   30 MiB  5.0 GiB  6.7 TiB  27.24  0.95   87      up
>>> 26    hdd  14.65039   1.00000   15 TiB  3.6 TiB  3.5 TiB   37 MiB  6.6 GiB   11 TiB  24.64  0.86  127      up
>>> 6    ssd   1.45549   1.00000  1.5 TiB  572 GiB  570 GiB   29 MiB  1.5 GiB  918 GiB  38.38  1.34   57      up
>>> 22    ssd   6.98630   1.00000  7.0 TiB  2.3 TiB  2.3 TiB   77 MiB  7.0 GiB  4.7 TiB  33.23  1.16  243      up
>>> 13    hdd   9.17380   1.00000  9.2 TiB  2.4 TiB  2.3 TiB   25 MiB  4.8 GiB  6.8 TiB  26.07  0.91   84      up
>>> 14    hdd   9.17380   1.00000  9.2 TiB  2.3 TiB  2.2 TiB   54 MiB  4.6 GiB  6.9 TiB  25.13  0.88   80      up
>>> 27    hdd  14.65039   1.00000   15 TiB  3.7 TiB  3.6 TiB   54 MiB  6.9 GiB   11 TiB  25.55  0.89  131      up
>>> 12    ssd   1.45540   1.00000  1.5 TiB  619 GiB  617 GiB  163 MiB  2.3 GiB  871 GiB  41.53  1.45   63      up
>>> 16    ssd   1.74660   1.00000  1.7 TiB  671 GiB  669 GiB   23 MiB  2.2 GiB  1.1 TiB  37.51  1.31   69      up
>>> 17    ssd   3.49309         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0      up
>>> 18    ssd   1.74660   1.00000  1.7 TiB  512 GiB  509 GiB   18 MiB  2.3 GiB  1.2 TiB  28.62  1.00   52      up
>>> 19    ssd   1.74649   1.00000  1.7 TiB  709 GiB  707 GiB   64 MiB  2.0 GiB  1.1 TiB  39.64  1.39   72      up
>>>                       TOTAL  205 TiB   59 TiB   57 TiB  1.3 GiB  128 GiB  147 TiB  28.60
>>> MIN/MAX VAR: 0.85/1.49  STDDEV: 6.81
>>> 
>>> 
>>> What we have done so far (no success)
>>> 
>>> - reformat two of the SSD OSD's (one was still from luminos, non LVM)
>>> - set bluestore_allocator from hybrid back to bitmap
>>> - set osd_memory_target to 6442450944 for some of the SSD OSDs
>>> - cpupower idle-set -D 11
>>> - bluefs_buffered_io to true
>>> - disabled default firewalls between CEPH nodes (for testing only)
>>> - disabled apparmor
>>> - added memory (runs now on 128GB per Node)
>>> - upgraded OS, runs now on kernel 5.13.19-1
>>> 
>>> What we observe
>>> - HDD Pool has similar behaviour
>>> - load is higher since update, seems like more CPU consumption (see graph1), migration was on 10. Nov, around 10pm
>>> - latency on the "big" 7TB SSD's (i.e. OSD.15) is significantly higher than on the small 1.6TB SSDs (OSD.12), see graph2, must be due to the higher weight though
>>> - load of OSD.15 is 4 times higher than load of OSD.12, must be due to the higher weight though as well
>>> - start of OSD.15 (the 7TB SSD's is significantly slower (~10 sec) compared to the 1.6TB SSDs
>>> - increasing the block size in the benchmark to 4k, 8k or even 16k increases the throughput but keeps the IOPS more or less stable, the drop at 32k is minimal to ~14k IOPS in average
>>> 
>>> We already checked the ProxMoxx List without any remedies yet and we are a bit helpless, any suggestions and / or does someone else has similar experiences?
>>> 
>>> We are a bit hesitant to upgrade to Pacific, given the current situation.
>>> 
>>> Thanks,
>>> 
>>> Kai
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx