Eugen, Thanks again for your suggestions! The cluster is balanced, OSDs on this host and other OSDs in the cluster are almost evenly utilized: ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS ... 11 hdd 9.38680 1.00000 9.4 TiB 1.2 TiB 883 GiB 6.6 MiB 2.7 GiB 8.2 TiB 12.29 1.16 90 up 12 hdd 9.38680 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up -- this one is intentionally out 13 hdd 9.38680 1.00000 9.4 TiB 1.1 TiB 838 GiB 8.2 MiB 2.7 GiB 8.3 TiB 11.82 1.12 92 up 14 hdd 9.38680 1.00000 9.4 TiB 1.1 TiB 838 GiB 7.6 MiB 2.4 GiB 8.3 TiB 11.82 1.12 86 up 15 hdd 9.38680 1.00000 9.4 TiB 1.1 TiB 830 GiB 6.2 MiB 2.7 GiB 8.3 TiB 11.74 1.11 80 up 16 hdd 9.38680 1.00000 9.4 TiB 1.1 TiB 809 GiB 11 MiB 2.7 GiB 8.3 TiB 11.52 1.09 89 up 17 hdd 9.38680 1.00000 9.4 TiB 1.1 TiB 876 GiB 3.2 MiB 2.8 GiB 8.2 TiB 12.22 1.16 86 up 18 hdd 9.38680 1.00000 9.4 TiB 1.1 TiB 826 GiB 3.0 MiB 2.4 GiB 8.3 TiB 11.70 1.11 83 up 19 hdd 9.38680 1.00000 9.4 TiB 1.2 TiB 916 GiB 5.7 MiB 2.7 GiB 8.2 TiB 12.64 1.20 99 up I tried primary-affinity=0 for this OSD, this didn't have a noticeable effect. The drive utilization is actually lower than the other ones: 04/27/2023 01:51:41 PM Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz aqu-sz %util sda 23.73 1.08 9.89 29.42 1.73 46.45 52.71 0.79 5.63 9.66 0.39 15.37 0.00 0.00 0.00 0.00 0.00 0.00 0.05 6.59 sdb 16.60 0.72 6.69 28.74 2.47 44.66 39.13 0.59 4.83 10.98 0.47 15.49 0.00 0.00 0.00 0.00 0.00 0.00 0.02 3.07 sdc 20.33 0.99 9.33 31.46 1.44 50.08 50.48 0.78 5.27 9.45 0.53 15.76 0.00 0.00 0.00 0.00 0.00 0.00 0.06 2.46 >>> sdd 20.40 1.01 9.65 32.11 0.19 50.89 52.07 0.83 5.84 10.08 0.80 16.40 0.00 0.00 0.00 0.00 0.00 0.00 0.04 2.34 sde 20.84 0.98 9.12 30.43 0.78 48.34 49.57 0.75 4.86 8.93 0.04 15.56 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.79 sdf 21.53 1.03 9.58 30.78 1.59 49.01 48.30 0.79 5.10 9.54 1.06 16.70 0.00 0.00 0.00 0.00 0.00 0.00 0.02 5.85 sdg 22.41 1.06 9.85 30.54 0.93 48.32 48.60 0.81 5.58 10.29 0.14 16.99 0.00 0.00 0.00 0.00 0.00 0.00 0.03 1.42 sdh 20.09 0.97 9.20 31.41 1.83 49.55 50.06 0.77 5.20 9.42 0.18 15.66 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.02 sdi 24.95 1.14 10.42 29.45 1.29 46.81 54.25 0.88 6.10 10.10 0.21 16.55 0.00 0.00 0.00 0.00 0.00 0.00 0.03 5.21 There's a considerable difference between this and other OSDs in terms of write speed: # ceph tell osd.13 bench -f plain { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 11.20604339, "bytes_per_sec": 95818103.377877429, "iops": 22.844816059560163 } # ceph tell osd.14 bench -f plain { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 26.160931698999999, "bytes_per_sec": 41043714.969870269, "iops": 9.7855842041659997 } In general, somehow OSD performance isn't great although the drives are plenty fast and can easily do about 200 MB/s sequential reads and writes, specifically the one showing high latency is only half as fast as the other OSDs. I added `osd_scrub_sleep=0.1` for now in case scrubbing , will observe whether that changes anything, so far no effect. /Z On Thu, 27 Apr 2023 at 15:49, Eugen Block <eblock@xxxxxx> wrote: > I don't see anything obvious in the pg output, they are relatively > small and don't hold many objects. If deep-scrubs would impact > performance that much you would see that in the iostat output as well. > Have you watched it for a while, maybe with -xmt options to see the > %util column as well? Does that OSD show a higher utilization than > other OSDs? Is the cluster evenly balanced (ceph osd df)? And also try > the primary-affinity = 0 part, this would set most of the primary PGs > on that OSD to non-primary and others would take over. If the new > primary OSDs show increased latencies as well there might be something > else going on. > > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > > > Thanks, Eugen. I very much appreciate your time and replies. > > > > It's a hybrid OSD with DB/WAL on NVME (Micron_7300_MTFDHBE1T6TDG) and > block > > storage on HDD (Toshiba MG06SCA10TE). There are 6 uniform hosts with 2 x > > DB/WAL NVMEs and 9 x HDDs each, each NVME hosts DB/WAL for 4-5 OSDs. The > > cluster was installed with Ceph 16.2.0, i.e. not upgraded from a previous > > Ceph version. The general host utilization is minimal: > > > > --- > > total used free shared buff/cache > > available > > Mem: 394859228 162089492 2278392 4468 230491344 > > 230135560 > > Swap: 8388604 410624 7977980 > > --- > > > > The host has 2 x Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz CPUs, 48 cores > > and 96 threads total. The load averages are < 1.5 most of the time. > Iostat > > doesn't show anything dodgy: > > > > --- > > Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read > > kB_wrtn kB_dscd > > dm-0 40.61 14.05 1426.88 274.81 899585953 > > 91343089664 17592304524 > > dm-1 95.80 1085.29 804.00 0.00 69476048192 > > 51469036032 0 > > dm-10 261.82 1.64 1046.98 0.00 104964624 > > 67023666460 0 > > dm-11 87.62 1036.39 801.12 0.00 66345393128 > > 51284224352 0 > > dm-12 265.95 1.65 1063.50 0.00 105717636 > > 68081300084 0 > > dm-13 90.39 1064.38 820.32 0.00 68137422692 > > 52513309008 0 > > dm-14 260.81 1.65 1042.94 0.00 105460360 > > 66764843944 0 > > dm-15 88.73 976.58 778.68 0.00 62516667260 > > 49847871016 0 > > dm-16 266.54 1.62 1065.84 0.00 103731332 > > 68230531868 0 > > dm-17 100.70 1148.40 892.47 0.00 73516251072 > > 57132462352 0 > > dm-18 279.91 1.77 1119.29 0.00 113498508 > > 71652321256 0 > > dm-19 46.05 158.57 283.19 0.00 10150971936 > > 18128700644 0 > > dm-2 277.15 1.75 1108.26 0.00 112204480 > > 70946082624 0 > > dm-20 49.98 161.48 248.13 0.00 10337605436 > > 15884020104 0 > > dm-3 69.60 722.59 596.21 0.00 46257546968 > > 38166860968 0 > > dm-4 210.51 1.02 841.90 0.00 65369612 > > 53894908104 0 > > dm-5 89.88 1000.15 789.46 0.00 64025323664 > > 50537848140 0 > > dm-6 273.40 1.65 1093.31 0.00 105643468 > > 69989257428 0 > > dm-7 87.50 1019.36 847.10 0.00 65255481416 > > 54228140196 0 > > dm-8 254.77 1.70 1018.76 0.00 109124164 > > 65217134588 0 > > dm-9 88.66 989.21 766.84 0.00 63325285524 > > 49089975468 0 > > loop0 0.01 1.54 0.00 0.00 98623259 > > 0 0 > > loop1 0.01 1.62 0.00 0.00 103719536 > > 0 0 > > loop10 0.01 1.04 0.00 0.00 66341543 > > 0 0 > > loop11 0.00 0.00 0.00 0.00 36 > > 0 0 > > loop2 0.01 1.61 0.00 0.00 102824919 > > 0 0 > > loop3 0.01 1.57 0.00 0.00 100808077 > > 0 0 > > loop4 0.01 1.56 0.00 0.00 100081689 > > 0 0 > > loop5 0.01 1.53 0.00 0.00 97741555 > > 0 0 > > loop6 0.01 1.47 0.00 0.00 93867958 > > 0 0 > > loop7 0.01 1.16 0.00 0.00 74491285 > > 0 0 > > loop8 0.01 1.05 0.00 0.00 67308404 > > 0 0 > > loop9 0.01 0.72 0.00 0.00 45939669 > > 0 0 > > md0 44.30 33.75 1413.88 397.42 2160234553 > > 90511235396 25441160328 > > nvme0n1 518.12 24.41 5339.35 73.24 1562435128 > > 341803564504 4688433152 > > nvme1n1 391.03 22.11 4063.55 68.36 1415308200 > > 260132142151 4375871488 > > nvme2n1 33.99 175.52 288.87 195.30 11236255296 > > 18492074823 12502441984 > > nvme3n1 36.74 177.43 253.04 195.30 11358616904 > > 16198706451 12502441984 > > nvme4n1 36.34 130.81 1417.08 275.71 8374240889 > > 90715974981 17649735268 > > nvme5n1 35.97 101.47 1417.08 274.81 6495703006 > > 90715974997 17592304524 > > sda 76.43 1102.34 810.08 0.00 70567310268 > > 51858036484 0 > > sdb 55.74 741.38 606.07 0.00 47460332504 > > 38798003512 0 > > sdc 70.79 1017.90 795.50 0.00 65161638916 > > 50924982612 0 > > sdd 72.46 1037.63 853.95 0.00 66424612464 > > 54666221528 0 > > sde 70.40 1007.27 771.38 0.00 64481414320 > > 49380660785 0 > > sdf 69.81 1054.65 806.46 0.00 67514236484 > > 51626397033 0 > > sdg 70.99 1082.70 825.42 0.00 69310316508 > > 52840162953 0 > > sdh 70.13 995.05 783.62 0.00 63698881700 > > 50164098561 0 > > sdi 79.18 1167.40 897.64 0.00 74732240724 > > 57463602289 0 > > --- > > > > nvme0n1 and -n1 are the DB/WAL drives. The block storage for OSD that > > consistently shows elevated latency is sdd. I had previously removed this > > OSD and its HDD from service and ran all kinds of tests on it, no issues > > there. > > > > The cluster is healthy, not much I/O going on (1 OSD is intentionally > out): > > > > --- > > cluster: > > id: 3f50555a-ae2a-11eb-a2fc-ffde44714d86 > > health: HEALTH_OK > > > > services: > > mon: 5 daemons, quorum ceph01,ceph02,ceph03,ceph04,ceph05 (age 12d) > > mgr: ceph01.vankui(active, since 12d), standbys: ceph02.cjhexa > > osd: 66 osds: 66 up (since 12d), 65 in (since 2w) > > > > data: > > pools: 9 pools, 1760 pgs > > objects: 5.08M objects, 16 TiB > > usage: 60 TiB used, 507 TiB / 567 TiB avail > > pgs: 1756 active+clean > > 4 active+clean+scrubbing+deep > > > > io: > > client: 109 MiB/s rd, 114 MiB/s wr, 1.99k op/s rd, 1.78k op/s wr > > --- > > > > The output of `ceph pg ls-by-osd` is: https://pastebin.com/iW4Hx7xV > > > > I have tried restarting this OSD, but not compacting it, as after a > restart > > the OSD log suggests that compaction has nothing to do. Although I also > see > > that scrubbing runs on the cluster pretty much all the time (scrub > settings > > and intervals are default for Pacific), and wonder if that may be killing > > the performance. > > > > /Z > > > > On Thu, 27 Apr 2023 at 11:25, Eugen Block <eblock@xxxxxx> wrote: > > > >> Those numbers look really high to me, more than 2 seconds for a write > >> is awful. Is this a HDD-only cluster/pool? But even then it would be > >> too high, I just compared with our HDD-backed cluster (although > >> rocksDB is SSD-backed) which also mainly serves RBD to openstack. What > >> is the general utilization of that host? Is it an upgraded cluster > >> which could suffer from the performance degredation which was > >> discussed in a recent thread? But I'd expect that more OSDs would be > >> affected by that. How many PGs and objects are on that OSD (ceph pg > >> ls-by-osd <ID>)? Have you tried to restart and/or compact the OSD and > >> see if anything improves? > >> You could set its primary-affinity to 0, or the worst case rebuild > >> that OSD. And there are no smart errors or anything in dmesg reported > >> about this disk? > >> > >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > >> > >> > Thanks, Eugen! > >> > > >> > It's a bunch of entries like this https://pastebin.com/TGPu6PAT - I'm > >> not > >> > really sure what to make of them. I checked adjacent OSDs and they > have > >> > similar ops, but aren't showing excessive latency. > >> > > >> > /Z > >> > > >> > On Thu, 27 Apr 2023 at 10:42, Eugen Block <eblock@xxxxxx> wrote: > >> > > >> >> Hi, > >> >> > >> >> I would monitor the historic_ops_by_duration for a while and see if > >> >> any specific operation takes unusually long. > >> >> > >> >> # this is within the container > >> >> [ceph: root@storage01 /]# ceph daemon osd.0 > >> dump_historic_ops_by_duration > >> >> | head > >> >> { > >> >> "size": 20, > >> >> "duration": 600, > >> >> "ops": [ > >> >> { > >> >> "description": "osd_repop(client.9384193.0:2056545 12.6 > >> >> e2233/2221 12:6192870f:::obj_delete_at_hint.0000000053:head v > >> >> 2233'696390, mlcod=2233'696388)", > >> >> "initiated_at": "2023-04-27T07:37:35.046036+0000", > >> >> "age": 54.805016199999997, > >> >> "duration": 0.58198468699999995, > >> >> ... > >> >> > >> >> The output contains the PG (so you know which pool is involved) and > >> >> the duration of the operation, not sure if that helps though. > >> >> > >> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > >> >> > >> >> > As suggested by someone, I tried `dump_historic_slow_ops`. There > >> aren't > >> >> > many, and they're somewhat difficult to interpret: > >> >> > > >> >> > "description": "osd_op(client.250533532.0:56821 13.16f > >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > >> [stat,write > >> >> > 3518464~8192] snapc 0=[] ondisk+write+known_if_redirected > e118835)", > >> >> > "initiated_at": "2023-04-26T07:00:58.299120+0000", > >> >> > "description": "osd_op(client.250533532.0:56822 13.16f > >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > >> [stat,write > >> >> > 3559424~4096] snapc 0=[] ondisk+write+known_if_redirected > e118835)", > >> >> > "initiated_at": "2023-04-26T07:00:58.299132+0000", > >> >> > "description": "osd_op(client.250533532.0:56823 13.16f > >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > >> [stat,write > >> >> > 3682304~4096] snapc 0=[] ondisk+write+known_if_redirected > e118835)", > >> >> > "initiated_at": "2023-04-26T07:00:58.299138+0000", > >> >> > "description": "osd_op(client.250533532.0:56824 13.16f > >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > >> [stat,write > >> >> > 3772416~4096] snapc 0=[] ondisk+write+known_if_redirected > e118835)", > >> >> > "initiated_at": "2023-04-26T07:00:58.299148+0000", > >> >> > "description": "osd_op(client.250533532.0:56825 13.16f > >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > >> [stat,write > >> >> > 3796992~8192] snapc 0=[] ondisk+write+known_if_redirected > e118835)", > >> >> > "initiated_at": "2023-04-26T07:00:58.299188+0000", > >> >> > "description": "osd_op(client.250533532.0:56826 13.16f > >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > >> [stat,write > >> >> > 3862528~8192] snapc 0=[] ondisk+write+known_if_redirected > e118835)", > >> >> > "initiated_at": "2023-04-26T07:00:58.299198+0000", > >> >> > "description": "osd_op(client.250533532.0:56827 13.16f > >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > >> [stat,write > >> >> > 3899392~12288] snapc 0=[] ondisk+write+known_if_redirected > e118835)", > >> >> > "initiated_at": "2023-04-26T07:00:58.299207+0000", > >> >> > "description": "osd_op(client.250533532.0:56828 13.16f > >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > >> [stat,write > >> >> > 3944448~16384] snapc 0=[] ondisk+write+known_if_redirected > e118835)", > >> >> > "initiated_at": "2023-04-26T07:00:58.299250+0000", > >> >> > "description": "osd_op(client.250533532.0:56829 13.16f > >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > >> [stat,write > >> >> > 4018176~4096] snapc 0=[] ondisk+write+known_if_redirected > e118835)", > >> >> > "initiated_at": "2023-04-26T07:00:58.299270+0000", > >> >> > > >> >> > There's a lot more information there ofc. I also tried to > >> >> > `dump_ops_in_flight` and there aren't many, usually 0-10 ops at a > >> time, > >> >> but > >> >> > the OSD latency remains high even when the ops count is low or > zero. > >> Any > >> >> > ideas? > >> >> > > >> >> > I would very much appreciate it if some could please point me to > the > >> >> > documentation on interpreting the output of ops dump. > >> >> > > >> >> > /Z > >> >> > > >> >> > > >> >> > On Wed, 26 Apr 2023 at 20:22, Zakhar Kirpichenko <zakhar@xxxxxxxxx > > > >> >> wrote: > >> >> > > >> >> >> Hi, > >> >> >> > >> >> >> I have a Ceph 16.2.12 cluster with uniform hardware, same drive > >> >> >> make/model, etc. A particular OSD is showing higher latency than > >> usual > >> >> in > >> >> >> `ceph osd perf`, usually mid to high tens of milliseconds while > other > >> >> OSDs > >> >> >> show low single digits, although its drive's I/O stats don't look > >> >> different > >> >> >> from those of other drives. The workload is mainly random 4K reads > >> and > >> >> >> writes, the cluster is being used as Openstack VM storage. > >> >> >> > >> >> >> Is there a way to trace, which particular PG, pool and disk image > or > >> >> >> object cause this OSD's excessive latency? Is there a way to tell > >> Ceph > >> >> to > >> >> >> > >> >> >> I would appreciate any advice or pointers. > >> >> >> > >> >> >> Best regards, > >> >> >> Zakhar > >> >> >> > >> >> > _______________________________________________ > >> >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> >> > >> >> > >> >> _______________________________________________ > >> >> ceph-users mailing list -- ceph-users@xxxxxxx > >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> >> > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx