Re: Ceph 16.2.12, particular OSD shows higher latency than others

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Thu, 27 Apr 2023 11:08:57 +0300

Thanks, Eugen!

It's a bunch of entries like this https://pastebin.com/TGPu6PAT - I'm not
really sure what to make of them. I checked adjacent OSDs and they have
similar ops, but aren't showing excessive latency.

/Z

On Thu, 27 Apr 2023 at 10:42, Eugen Block <eblock@xxxxxx> wrote:

> Hi,
>
> I would monitor the historic_ops_by_duration for a while and see if
> any specific operation takes unusually long.
>
> # this is within the container
> [ceph: root@storage01 /]# ceph daemon osd.0 dump_historic_ops_by_duration
> | head
> {
>      "size": 20,
>      "duration": 600,
>      "ops": [
>          {
>              "description": "osd_repop(client.9384193.0:2056545 12.6
> e2233/2221 12:6192870f:::obj_delete_at_hint.0000000053:head v
> 2233'696390, mlcod=2233'696388)",
>              "initiated_at": "2023-04-27T07:37:35.046036+0000",
>              "age": 54.805016199999997,
>              "duration": 0.58198468699999995,
> ...
>
> The output contains the PG (so you know which pool is involved) and
> the duration of the operation, not sure if that helps though.
>
> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>
> > As suggested by someone, I tried `dump_historic_slow_ops`. There aren't
> > many, and they're somewhat difficult to interpret:
> >
> >             "description": "osd_op(client.250533532.0:56821 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head [stat,write
> > 3518464~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >             "initiated_at": "2023-04-26T07:00:58.299120+0000",
> >             "description": "osd_op(client.250533532.0:56822 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head [stat,write
> > 3559424~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >             "initiated_at": "2023-04-26T07:00:58.299132+0000",
> >             "description": "osd_op(client.250533532.0:56823 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head [stat,write
> > 3682304~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >             "initiated_at": "2023-04-26T07:00:58.299138+0000",
> >             "description": "osd_op(client.250533532.0:56824 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head [stat,write
> > 3772416~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >             "initiated_at": "2023-04-26T07:00:58.299148+0000",
> >             "description": "osd_op(client.250533532.0:56825 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head [stat,write
> > 3796992~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >             "initiated_at": "2023-04-26T07:00:58.299188+0000",
> >             "description": "osd_op(client.250533532.0:56826 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head [stat,write
> > 3862528~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >             "initiated_at": "2023-04-26T07:00:58.299198+0000",
> >             "description": "osd_op(client.250533532.0:56827 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head [stat,write
> > 3899392~12288] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >             "initiated_at": "2023-04-26T07:00:58.299207+0000",
> >             "description": "osd_op(client.250533532.0:56828 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head [stat,write
> > 3944448~16384] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >             "initiated_at": "2023-04-26T07:00:58.299250+0000",
> >             "description": "osd_op(client.250533532.0:56829 13.16f
> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head [stat,write
> > 4018176~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
> >             "initiated_at": "2023-04-26T07:00:58.299270+0000",
> >
> > There's a lot more information there ofc. I also tried to
> > `dump_ops_in_flight` and there aren't many, usually 0-10 ops at a time,
> but
> > the OSD latency remains high even when the ops count is low or zero. Any
> > ideas?
> >
> > I would very much appreciate it if some could please point me to the
> > documentation on interpreting the output of ops dump.
> >
> > /Z
> >
> >
> > On Wed, 26 Apr 2023 at 20:22, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> wrote:
> >
> >> Hi,
> >>
> >> I have a Ceph 16.2.12 cluster with uniform hardware, same drive
> >> make/model, etc. A particular OSD is showing higher latency than usual
> in
> >> `ceph osd perf`, usually mid to high tens of milliseconds while other
> OSDs
> >> show low single digits, although its drive's I/O stats don't look
> different
> >> from those of other drives. The workload is mainly random 4K reads and
> >> writes, the cluster is being used as Openstack VM storage.
> >>
> >> Is there a way to trace, which particular PG, pool and disk image or
> >> object cause this OSD's excessive latency? Is there a way to tell Ceph
> to
> >>
> >> I would appreciate any advice or pointers.
> >>
> >> Best regards,
> >> Zakhar
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx