Thanks, Eugen. I very much appreciate your time and replies. It's a hybrid OSD with DB/WAL on NVME (Micron_7300_MTFDHBE1T6TDG) and block storage on HDD (Toshiba MG06SCA10TE). There are 6 uniform hosts with 2 x DB/WAL NVMEs and 9 x HDDs each, each NVME hosts DB/WAL for 4-5 OSDs. The cluster was installed with Ceph 16.2.0, i.e. not upgraded from a previous Ceph version. The general host utilization is minimal: --- total used free shared buff/cache available Mem: 394859228 162089492 2278392 4468 230491344 230135560 Swap: 8388604 410624 7977980 --- The host has 2 x Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz CPUs, 48 cores and 96 threads total. The load averages are < 1.5 most of the time. Iostat doesn't show anything dodgy: --- Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd dm-0 40.61 14.05 1426.88 274.81 899585953 91343089664 17592304524 dm-1 95.80 1085.29 804.00 0.00 69476048192 51469036032 0 dm-10 261.82 1.64 1046.98 0.00 104964624 67023666460 0 dm-11 87.62 1036.39 801.12 0.00 66345393128 51284224352 0 dm-12 265.95 1.65 1063.50 0.00 105717636 68081300084 0 dm-13 90.39 1064.38 820.32 0.00 68137422692 52513309008 0 dm-14 260.81 1.65 1042.94 0.00 105460360 66764843944 0 dm-15 88.73 976.58 778.68 0.00 62516667260 49847871016 0 dm-16 266.54 1.62 1065.84 0.00 103731332 68230531868 0 dm-17 100.70 1148.40 892.47 0.00 73516251072 57132462352 0 dm-18 279.91 1.77 1119.29 0.00 113498508 71652321256 0 dm-19 46.05 158.57 283.19 0.00 10150971936 18128700644 0 dm-2 277.15 1.75 1108.26 0.00 112204480 70946082624 0 dm-20 49.98 161.48 248.13 0.00 10337605436 15884020104 0 dm-3 69.60 722.59 596.21 0.00 46257546968 38166860968 0 dm-4 210.51 1.02 841.90 0.00 65369612 53894908104 0 dm-5 89.88 1000.15 789.46 0.00 64025323664 50537848140 0 dm-6 273.40 1.65 1093.31 0.00 105643468 69989257428 0 dm-7 87.50 1019.36 847.10 0.00 65255481416 54228140196 0 dm-8 254.77 1.70 1018.76 0.00 109124164 65217134588 0 dm-9 88.66 989.21 766.84 0.00 63325285524 49089975468 0 loop0 0.01 1.54 0.00 0.00 98623259 0 0 loop1 0.01 1.62 0.00 0.00 103719536 0 0 loop10 0.01 1.04 0.00 0.00 66341543 0 0 loop11 0.00 0.00 0.00 0.00 36 0 0 loop2 0.01 1.61 0.00 0.00 102824919 0 0 loop3 0.01 1.57 0.00 0.00 100808077 0 0 loop4 0.01 1.56 0.00 0.00 100081689 0 0 loop5 0.01 1.53 0.00 0.00 97741555 0 0 loop6 0.01 1.47 0.00 0.00 93867958 0 0 loop7 0.01 1.16 0.00 0.00 74491285 0 0 loop8 0.01 1.05 0.00 0.00 67308404 0 0 loop9 0.01 0.72 0.00 0.00 45939669 0 0 md0 44.30 33.75 1413.88 397.42 2160234553 90511235396 25441160328 nvme0n1 518.12 24.41 5339.35 73.24 1562435128 341803564504 4688433152 nvme1n1 391.03 22.11 4063.55 68.36 1415308200 260132142151 4375871488 nvme2n1 33.99 175.52 288.87 195.30 11236255296 18492074823 12502441984 nvme3n1 36.74 177.43 253.04 195.30 11358616904 16198706451 12502441984 nvme4n1 36.34 130.81 1417.08 275.71 8374240889 90715974981 17649735268 nvme5n1 35.97 101.47 1417.08 274.81 6495703006 90715974997 17592304524 sda 76.43 1102.34 810.08 0.00 70567310268 51858036484 0 sdb 55.74 741.38 606.07 0.00 47460332504 38798003512 0 sdc 70.79 1017.90 795.50 0.00 65161638916 50924982612 0 sdd 72.46 1037.63 853.95 0.00 66424612464 54666221528 0 sde 70.40 1007.27 771.38 0.00 64481414320 49380660785 0 sdf 69.81 1054.65 806.46 0.00 67514236484 51626397033 0 sdg 70.99 1082.70 825.42 0.00 69310316508 52840162953 0 sdh 70.13 995.05 783.62 0.00 63698881700 50164098561 0 sdi 79.18 1167.40 897.64 0.00 74732240724 57463602289 0 --- nvme0n1 and -n1 are the DB/WAL drives. The block storage for OSD that consistently shows elevated latency is sdd. I had previously removed this OSD and its HDD from service and ran all kinds of tests on it, no issues there. The cluster is healthy, not much I/O going on (1 OSD is intentionally out): --- cluster: id: 3f50555a-ae2a-11eb-a2fc-ffde44714d86 health: HEALTH_OK services: mon: 5 daemons, quorum ceph01,ceph02,ceph03,ceph04,ceph05 (age 12d) mgr: ceph01.vankui(active, since 12d), standbys: ceph02.cjhexa osd: 66 osds: 66 up (since 12d), 65 in (since 2w) data: pools: 9 pools, 1760 pgs objects: 5.08M objects, 16 TiB usage: 60 TiB used, 507 TiB / 567 TiB avail pgs: 1756 active+clean 4 active+clean+scrubbing+deep io: client: 109 MiB/s rd, 114 MiB/s wr, 1.99k op/s rd, 1.78k op/s wr --- The output of `ceph pg ls-by-osd` is: https://pastebin.com/iW4Hx7xV I have tried restarting this OSD, but not compacting it, as after a restart the OSD log suggests that compaction has nothing to do. Although I also see that scrubbing runs on the cluster pretty much all the time (scrub settings and intervals are default for Pacific), and wonder if that may be killing the performance. /Z On Thu, 27 Apr 2023 at 11:25, Eugen Block <eblock@xxxxxx> wrote: > Those numbers look really high to me, more than 2 seconds for a write > is awful. Is this a HDD-only cluster/pool? But even then it would be > too high, I just compared with our HDD-backed cluster (although > rocksDB is SSD-backed) which also mainly serves RBD to openstack. What > is the general utilization of that host? Is it an upgraded cluster > which could suffer from the performance degredation which was > discussed in a recent thread? But I'd expect that more OSDs would be > affected by that. How many PGs and objects are on that OSD (ceph pg > ls-by-osd <ID>)? Have you tried to restart and/or compact the OSD and > see if anything improves? > You could set its primary-affinity to 0, or the worst case rebuild > that OSD. And there are no smart errors or anything in dmesg reported > about this disk? > > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > > > Thanks, Eugen! > > > > It's a bunch of entries like this https://pastebin.com/TGPu6PAT - I'm > not > > really sure what to make of them. I checked adjacent OSDs and they have > > similar ops, but aren't showing excessive latency. > > > > /Z > > > > On Thu, 27 Apr 2023 at 10:42, Eugen Block <eblock@xxxxxx> wrote: > > > >> Hi, > >> > >> I would monitor the historic_ops_by_duration for a while and see if > >> any specific operation takes unusually long. > >> > >> # this is within the container > >> [ceph: root@storage01 /]# ceph daemon osd.0 > dump_historic_ops_by_duration > >> | head > >> { > >> "size": 20, > >> "duration": 600, > >> "ops": [ > >> { > >> "description": "osd_repop(client.9384193.0:2056545 12.6 > >> e2233/2221 12:6192870f:::obj_delete_at_hint.0000000053:head v > >> 2233'696390, mlcod=2233'696388)", > >> "initiated_at": "2023-04-27T07:37:35.046036+0000", > >> "age": 54.805016199999997, > >> "duration": 0.58198468699999995, > >> ... > >> > >> The output contains the PG (so you know which pool is involved) and > >> the duration of the operation, not sure if that helps though. > >> > >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > >> > >> > As suggested by someone, I tried `dump_historic_slow_ops`. There > aren't > >> > many, and they're somewhat difficult to interpret: > >> > > >> > "description": "osd_op(client.250533532.0:56821 13.16f > >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > [stat,write > >> > 3518464~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)", > >> > "initiated_at": "2023-04-26T07:00:58.299120+0000", > >> > "description": "osd_op(client.250533532.0:56822 13.16f > >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > [stat,write > >> > 3559424~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)", > >> > "initiated_at": "2023-04-26T07:00:58.299132+0000", > >> > "description": "osd_op(client.250533532.0:56823 13.16f > >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > [stat,write > >> > 3682304~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)", > >> > "initiated_at": "2023-04-26T07:00:58.299138+0000", > >> > "description": "osd_op(client.250533532.0:56824 13.16f > >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > [stat,write > >> > 3772416~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)", > >> > "initiated_at": "2023-04-26T07:00:58.299148+0000", > >> > "description": "osd_op(client.250533532.0:56825 13.16f > >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > [stat,write > >> > 3796992~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)", > >> > "initiated_at": "2023-04-26T07:00:58.299188+0000", > >> > "description": "osd_op(client.250533532.0:56826 13.16f > >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > [stat,write > >> > 3862528~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)", > >> > "initiated_at": "2023-04-26T07:00:58.299198+0000", > >> > "description": "osd_op(client.250533532.0:56827 13.16f > >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > [stat,write > >> > 3899392~12288] snapc 0=[] ondisk+write+known_if_redirected e118835)", > >> > "initiated_at": "2023-04-26T07:00:58.299207+0000", > >> > "description": "osd_op(client.250533532.0:56828 13.16f > >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > [stat,write > >> > 3944448~16384] snapc 0=[] ondisk+write+known_if_redirected e118835)", > >> > "initiated_at": "2023-04-26T07:00:58.299250+0000", > >> > "description": "osd_op(client.250533532.0:56829 13.16f > >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head > [stat,write > >> > 4018176~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)", > >> > "initiated_at": "2023-04-26T07:00:58.299270+0000", > >> > > >> > There's a lot more information there ofc. I also tried to > >> > `dump_ops_in_flight` and there aren't many, usually 0-10 ops at a > time, > >> but > >> > the OSD latency remains high even when the ops count is low or zero. > Any > >> > ideas? > >> > > >> > I would very much appreciate it if some could please point me to the > >> > documentation on interpreting the output of ops dump. > >> > > >> > /Z > >> > > >> > > >> > On Wed, 26 Apr 2023 at 20:22, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > >> wrote: > >> > > >> >> Hi, > >> >> > >> >> I have a Ceph 16.2.12 cluster with uniform hardware, same drive > >> >> make/model, etc. A particular OSD is showing higher latency than > usual > >> in > >> >> `ceph osd perf`, usually mid to high tens of milliseconds while other > >> OSDs > >> >> show low single digits, although its drive's I/O stats don't look > >> different > >> >> from those of other drives. The workload is mainly random 4K reads > and > >> >> writes, the cluster is being used as Openstack VM storage. > >> >> > >> >> Is there a way to trace, which particular PG, pool and disk image or > >> >> object cause this OSD's excessive latency? Is there a way to tell > Ceph > >> to > >> >> > >> >> I would appreciate any advice or pointers. > >> >> > >> >> Best regards, > >> >> Zakhar > >> >> > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx