Re: Ceph 16.2.12, particular OSD shows higher latency than others

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Thu, 27 Apr 2023 12:22:01 +0300

Thanks, Eugen. I very much appreciate your time and replies.

It's a hybrid OSD with DB/WAL on NVME (Micron_7300_MTFDHBE1T6TDG) and block
storage on HDD (Toshiba MG06SCA10TE). There are 6 uniform hosts with 2 x
DB/WAL NVMEs and 9 x HDDs each, each NVME hosts DB/WAL for 4-5 OSDs. The
cluster was installed with Ceph 16.2.0, i.e. not upgraded from a previous
Ceph version. The general host utilization is minimal:

---
              total        used        free      shared  buff/cache
available
Mem:      394859228   162089492     2278392        4468   230491344
230135560
Swap:       8388604      410624     7977980
---

The host has 2 x Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz CPUs, 48 cores
and 96 threads total. The load averages are < 1.5 most of the time. Iostat
doesn't show anything dodgy:

---
Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read
 kB_wrtn    kB_dscd
dm-0             40.61        14.05      1426.88       274.81  899585953
91343089664 17592304524
dm-1             95.80      1085.29       804.00         0.00 69476048192
51469036032          0
dm-10           261.82         1.64      1046.98         0.00  104964624
67023666460          0
dm-11            87.62      1036.39       801.12         0.00 66345393128
51284224352          0
dm-12           265.95         1.65      1063.50         0.00  105717636
68081300084          0
dm-13            90.39      1064.38       820.32         0.00 68137422692
52513309008          0
dm-14           260.81         1.65      1042.94         0.00  105460360
66764843944          0
dm-15            88.73       976.58       778.68         0.00 62516667260
49847871016          0
dm-16           266.54         1.62      1065.84         0.00  103731332
68230531868          0
dm-17           100.70      1148.40       892.47         0.00 73516251072
57132462352          0
dm-18           279.91         1.77      1119.29         0.00  113498508
71652321256          0
dm-19            46.05       158.57       283.19         0.00 10150971936
18128700644          0
dm-2            277.15         1.75      1108.26         0.00  112204480
70946082624          0
dm-20            49.98       161.48       248.13         0.00 10337605436
15884020104          0
dm-3             69.60       722.59       596.21         0.00 46257546968
38166860968          0
dm-4            210.51         1.02       841.90         0.00   65369612
53894908104          0
dm-5             89.88      1000.15       789.46         0.00 64025323664
50537848140          0
dm-6            273.40         1.65      1093.31         0.00  105643468
69989257428          0
dm-7             87.50      1019.36       847.10         0.00 65255481416
54228140196          0
dm-8            254.77         1.70      1018.76         0.00  109124164
65217134588          0
dm-9             88.66       989.21       766.84         0.00 63325285524
49089975468          0
loop0             0.01         1.54         0.00         0.00   98623259
       0          0
loop1             0.01         1.62         0.00         0.00  103719536
       0          0
loop10            0.01         1.04         0.00         0.00   66341543
       0          0
loop11            0.00         0.00         0.00         0.00         36
       0          0
loop2             0.01         1.61         0.00         0.00  102824919
       0          0
loop3             0.01         1.57         0.00         0.00  100808077
       0          0
loop4             0.01         1.56         0.00         0.00  100081689
       0          0
loop5             0.01         1.53         0.00         0.00   97741555
       0          0
loop6             0.01         1.47         0.00         0.00   93867958
       0          0
loop7             0.01         1.16         0.00         0.00   74491285
       0          0
loop8             0.01         1.05         0.00         0.00   67308404
       0          0
loop9             0.01         0.72         0.00         0.00   45939669
       0          0
md0              44.30        33.75      1413.88       397.42 2160234553
90511235396 25441160328
nvme0n1         518.12        24.41      5339.35        73.24 1562435128
341803564504 4688433152
nvme1n1         391.03        22.11      4063.55        68.36 1415308200
260132142151 4375871488
nvme2n1          33.99       175.52       288.87       195.30 11236255296
18492074823 12502441984
nvme3n1          36.74       177.43       253.04       195.30 11358616904
16198706451 12502441984
nvme4n1          36.34       130.81      1417.08       275.71 8374240889
90715974981 17649735268
nvme5n1          35.97       101.47      1417.08       274.81 6495703006
90715974997 17592304524
sda              76.43      1102.34       810.08         0.00 70567310268
51858036484          0
sdb              55.74       741.38       606.07         0.00 47460332504
38798003512          0
sdc              70.79      1017.90       795.50         0.00 65161638916
50924982612          0
sdd              72.46      1037.63       853.95         0.00 66424612464
54666221528          0
sde              70.40      1007.27       771.38         0.00 64481414320
49380660785          0
sdf              69.81      1054.65       806.46         0.00 67514236484
51626397033          0
sdg              70.99      1082.70       825.42         0.00 69310316508
52840162953          0
sdh              70.13       995.05       783.62         0.00 63698881700
50164098561          0
sdi              79.18      1167.40       897.64         0.00 74732240724
57463602289          0
---

nvme0n1 and -n1 are the DB/WAL drives. The block storage for OSD that
consistently shows elevated latency is sdd. I had previously removed this
OSD and its HDD from service and ran all kinds of tests on it, no issues
there.

The cluster is healthy, not much I/O going on (1 OSD is intentionally out):

---
  cluster:
    id:     3f50555a-ae2a-11eb-a2fc-ffde44714d86
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum ceph01,ceph02,ceph03,ceph04,ceph05 (age 12d)
    mgr: ceph01.vankui(active, since 12d), standbys: ceph02.cjhexa
    osd: 66 osds: 66 up (since 12d), 65 in (since 2w)

  data:
    pools:   9 pools, 1760 pgs
    objects: 5.08M objects, 16 TiB
    usage:   60 TiB used, 507 TiB / 567 TiB avail
    pgs:     1756 active+clean
             4    active+clean+scrubbing+deep

  io:
    client:   109 MiB/s rd, 114 MiB/s wr, 1.99k op/s rd, 1.78k op/s wr
---

The output of `ceph pg ls-by-osd` is: https://pastebin.com/iW4Hx7xV

I have tried restarting this OSD, but not compacting it, as after a restart
the OSD log suggests that compaction has nothing to do. Although I also see
that scrubbing runs on the cluster pretty much all the time (scrub settings
and intervals are default for Pacific), and wonder if that may be killing
the performance.

/Z

On Thu, 27 Apr 2023 at 11:25, Eugen Block <eblock@xxxxxx> wrote:

> Those numbers look really high to me, more than 2 seconds for a write
> is awful. Is this a HDD-only cluster/pool? But even then it would be
> too high, I just compared with our HDD-backed cluster (although
> rocksDB is SSD-backed) which also mainly serves RBD to openstack. What
> is the general utilization of that host? Is it an upgraded cluster
> which could suffer from the performance degredation which was
> discussed in a recent thread? But I'd expect that more OSDs would be
> affected by that. How many PGs and objects are on that OSD (ceph pg
> ls-by-osd <ID>)? Have you tried to restart and/or compact the OSD and
> see if anything improves?
> You could set its primary-affinity to 0, or the worst case rebuild
> that OSD. And there are no smart errors or anything in dmesg reported
> about this disk?
>
> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>
> > Thanks, Eugen!
> >
> > It's a bunch of entries like this https://pastebin.com/TGPu6PAT - I'm
> not
> > really sure what to make of them. I checked adjacent OSDs and they have
> > similar ops, but aren't showing excessive latency.
> >
> > /Z
> >
> > On Thu, 27 Apr 2023 at 10:42, Eugen Block <eblock@xxxxxx> wrote:
> >
> >> Hi,
> >>
> >> I would monitor the historic_ops_by_duration for a while and see if
> >> any specific operation takes unusually long.
> >>
> >> # this is within the container
> >> [ceph: root@storage01 /]# ceph daemon osd.0
> dump_historic_ops_by_duration
> >> | head
> >> {
> >>      "size": 20,
> >>      "duration": 600,
> >>      "ops": [
> >>          {
> >>              "description": "osd_repop(client.9384193.0:2056545 12.6
> >> e2233/2221 12:6192870f:::obj_delete_at_hint.0000000053:head v
> >> 2233'696390, mlcod=2233'696388)",
> >>              "initiated_at": "2023-04-27T07:37:35.046036+0000",
> >>              "age": 54.805016199999997,
> >>              "duration": 0.58198468699999995,
> >> ...
> >>
> >> The output contains the PG (so you know which pool is involved) and
> >> the duration of the operation, not sure if that helps though.
> >>
> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
> >>
> >> > As suggested by someone, I tried `dump_historic_slow_ops`. There
> aren't
> >> > many, and they're somewhat difficult to interpret:
> >> >
> >> >             "description": "osd_op(client.250533532.0:56821 13.16f
> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> [stat,write
> >> > 3518464~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >> >             "initiated_at": "2023-04-26T07:00:58.299120+0000",
> >> >             "description": "osd_op(client.250533532.0:56822 13.16f
> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> [stat,write
> >> > 3559424~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >> >             "initiated_at": "2023-04-26T07:00:58.299132+0000",
> >> >             "description": "osd_op(client.250533532.0:56823 13.16f
> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> [stat,write
> >> > 3682304~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >> >             "initiated_at": "2023-04-26T07:00:58.299138+0000",
> >> >             "description": "osd_op(client.250533532.0:56824 13.16f
> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> [stat,write
> >> > 3772416~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >> >             "initiated_at": "2023-04-26T07:00:58.299148+0000",
> >> >             "description": "osd_op(client.250533532.0:56825 13.16f
> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> [stat,write
> >> > 3796992~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >> >             "initiated_at": "2023-04-26T07:00:58.299188+0000",
> >> >             "description": "osd_op(client.250533532.0:56826 13.16f
> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> [stat,write
> >> > 3862528~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >> >             "initiated_at": "2023-04-26T07:00:58.299198+0000",
> >> >             "description": "osd_op(client.250533532.0:56827 13.16f
> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> [stat,write
> >> > 3899392~12288] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >> >             "initiated_at": "2023-04-26T07:00:58.299207+0000",
> >> >             "description": "osd_op(client.250533532.0:56828 13.16f
> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> [stat,write
> >> > 3944448~16384] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >> >             "initiated_at": "2023-04-26T07:00:58.299250+0000",
> >> >             "description": "osd_op(client.250533532.0:56829 13.16f
> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> [stat,write
> >> > 4018176~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >> >             "initiated_at": "2023-04-26T07:00:58.299270+0000",
> >> >
> >> > There's a lot more information there ofc. I also tried to
> >> > `dump_ops_in_flight` and there aren't many, usually 0-10 ops at a
> time,
> >> but
> >> > the OSD latency remains high even when the ops count is low or zero.
> Any
> >> > ideas?
> >> >
> >> > I would very much appreciate it if some could please point me to the
> >> > documentation on interpreting the output of ops dump.
> >> >
> >> > /Z
> >> >
> >> >
> >> > On Wed, 26 Apr 2023 at 20:22, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> >> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I have a Ceph 16.2.12 cluster with uniform hardware, same drive
> >> >> make/model, etc. A particular OSD is showing higher latency than
> usual
> >> in
> >> >> `ceph osd perf`, usually mid to high tens of milliseconds while other
> >> OSDs
> >> >> show low single digits, although its drive's I/O stats don't look
> >> different
> >> >> from those of other drives. The workload is mainly random 4K reads
> and
> >> >> writes, the cluster is being used as Openstack VM storage.
> >> >>
> >> >> Is there a way to trace, which particular PG, pool and disk image or
> >> >> object cause this OSD's excessive latency? Is there a way to tell
> Ceph
> >> to
> >> >>
> >> >> I would appreciate any advice or pointers.
> >> >>
> >> >> Best regards,
> >> >> Zakhar
> >> >>
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx