Re: Ceph 16.2.12, particular OSD shows higher latency than others

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Thu, 27 Apr 2023 17:16:45 +0300

Eugen,

Thanks again for your suggestions! The cluster is balanced, OSDs on this
host and other OSDs in the cluster are almost evenly utilized:

ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE   DATA     OMAP     META
AVAIL    %USE   VAR   PGS  STATUS
...
11    hdd  9.38680   1.00000  9.4 TiB   1.2 TiB  883 GiB  6.6 MiB  2.7 GiB
 8.2 TiB  12.29  1.16   90      up
12    hdd  9.38680         0      0 B       0 B      0 B      0 B      0 B
     0 B      0     0    0      up -- this one is intentionally out
13    hdd  9.38680   1.00000  9.4 TiB   1.1 TiB  838 GiB  8.2 MiB  2.7 GiB
 8.3 TiB  11.82  1.12   92      up
14    hdd  9.38680   1.00000  9.4 TiB   1.1 TiB  838 GiB  7.6 MiB  2.4 GiB
 8.3 TiB  11.82  1.12   86      up
15    hdd  9.38680   1.00000  9.4 TiB   1.1 TiB  830 GiB  6.2 MiB  2.7 GiB
 8.3 TiB  11.74  1.11   80      up
16    hdd  9.38680   1.00000  9.4 TiB   1.1 TiB  809 GiB   11 MiB  2.7 GiB
 8.3 TiB  11.52  1.09   89      up
17    hdd  9.38680   1.00000  9.4 TiB   1.1 TiB  876 GiB  3.2 MiB  2.8 GiB
 8.2 TiB  12.22  1.16   86      up
18    hdd  9.38680   1.00000  9.4 TiB   1.1 TiB  826 GiB  3.0 MiB  2.4 GiB
 8.3 TiB  11.70  1.11   83      up
19    hdd  9.38680   1.00000  9.4 TiB   1.2 TiB  916 GiB  5.7 MiB  2.7 GiB
 8.2 TiB  12.64  1.20   99      up

I tried primary-affinity=0 for this OSD, this didn't have a noticeable
effect. The drive utilization is actually lower than the other ones:

04/27/2023 01:51:41 PM
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s
  wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm
d_await dareq-sz  aqu-sz  %util
sda             23.73      1.08     9.89  29.42    1.73    46.45   52.71
   0.79     5.63   9.66    0.39    15.37    0.00      0.00     0.00   0.00
   0.00     0.00    0.05   6.59
sdb             16.60      0.72     6.69  28.74    2.47    44.66   39.13
   0.59     4.83  10.98    0.47    15.49    0.00      0.00     0.00   0.00
   0.00     0.00    0.02   3.07
sdc             20.33      0.99     9.33  31.46    1.44    50.08   50.48
   0.78     5.27   9.45    0.53    15.76    0.00      0.00     0.00   0.00
   0.00     0.00    0.06   2.46
>>> sdd             20.40      1.01     9.65  32.11    0.19    50.89
52.07      0.83     5.84  10.08    0.80    16.40    0.00      0.00     0.00
  0.00    0.00     0.00    0.04   2.34
sde             20.84      0.98     9.12  30.43    0.78    48.34   49.57
   0.75     4.86   8.93    0.04    15.56    0.00      0.00     0.00   0.00
   0.00     0.00    0.02   0.79
sdf             21.53      1.03     9.58  30.78    1.59    49.01   48.30
   0.79     5.10   9.54    1.06    16.70    0.00      0.00     0.00   0.00
   0.00     0.00    0.02   5.85
sdg             22.41      1.06     9.85  30.54    0.93    48.32   48.60
   0.81     5.58  10.29    0.14    16.99    0.00      0.00     0.00   0.00
   0.00     0.00    0.03   1.42
sdh             20.09      0.97     9.20  31.41    1.83    49.55   50.06
   0.77     5.20   9.42    0.18    15.66    0.00      0.00     0.00   0.00
   0.00     0.00    0.05   0.02
sdi             24.95      1.14    10.42  29.45    1.29    46.81   54.25
   0.88     6.10  10.10    0.21    16.55    0.00      0.00     0.00   0.00
   0.00     0.00    0.03   5.21

There's a considerable difference between this and other OSDs in terms of
write speed:

# ceph tell osd.13 bench -f plain
{
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 11.20604339,
    "bytes_per_sec": 95818103.377877429,
    "iops": 22.844816059560163
}
# ceph tell osd.14 bench -f plain
{
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 26.160931698999999,
    "bytes_per_sec": 41043714.969870269,
    "iops": 9.7855842041659997
}

In general, somehow OSD performance isn't great although the drives are
plenty fast and can easily do about 200 MB/s sequential reads and writes,
specifically the one showing high latency is only half as fast as the other
OSDs.

I added `osd_scrub_sleep=0.1` for now in case scrubbing , will observe
whether that changes anything, so far no effect.

/Z

On Thu, 27 Apr 2023 at 15:49, Eugen Block <eblock@xxxxxx> wrote:

> I don't see anything obvious in the pg output, they are relatively
> small and don't hold many objects. If deep-scrubs would impact
> performance that much you would see that in the iostat output as well.
> Have you watched it for a while, maybe with -xmt options to see the
> %util column as well? Does that OSD show a higher utilization than
> other OSDs? Is the cluster evenly balanced (ceph osd df)? And also try
> the primary-affinity = 0 part, this would set most of the primary PGs
> on that OSD to non-primary and others would take over. If the new
> primary OSDs show increased latencies as well there might be something
> else going on.
>
> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>
> > Thanks, Eugen. I very much appreciate your time and replies.
> >
> > It's a hybrid OSD with DB/WAL on NVME (Micron_7300_MTFDHBE1T6TDG) and
> block
> > storage on HDD (Toshiba MG06SCA10TE). There are 6 uniform hosts with 2 x
> > DB/WAL NVMEs and 9 x HDDs each, each NVME hosts DB/WAL for 4-5 OSDs. The
> > cluster was installed with Ceph 16.2.0, i.e. not upgraded from a previous
> > Ceph version. The general host utilization is minimal:
> >
> > ---
> >               total        used        free      shared  buff/cache
> > available
> > Mem:      394859228   162089492     2278392        4468   230491344
> > 230135560
> > Swap:       8388604      410624     7977980
> > ---
> >
> > The host has 2 x Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz CPUs, 48 cores
> > and 96 threads total. The load averages are < 1.5 most of the time.
> Iostat
> > doesn't show anything dodgy:
> >
> > ---
> > Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read
> >  kB_wrtn    kB_dscd
> > dm-0             40.61        14.05      1426.88       274.81  899585953
> > 91343089664 17592304524
> > dm-1             95.80      1085.29       804.00         0.00 69476048192
> > 51469036032          0
> > dm-10           261.82         1.64      1046.98         0.00  104964624
> > 67023666460          0
> > dm-11            87.62      1036.39       801.12         0.00 66345393128
> > 51284224352          0
> > dm-12           265.95         1.65      1063.50         0.00  105717636
> > 68081300084          0
> > dm-13            90.39      1064.38       820.32         0.00 68137422692
> > 52513309008          0
> > dm-14           260.81         1.65      1042.94         0.00  105460360
> > 66764843944          0
> > dm-15            88.73       976.58       778.68         0.00 62516667260
> > 49847871016          0
> > dm-16           266.54         1.62      1065.84         0.00  103731332
> > 68230531868          0
> > dm-17           100.70      1148.40       892.47         0.00 73516251072
> > 57132462352          0
> > dm-18           279.91         1.77      1119.29         0.00  113498508
> > 71652321256          0
> > dm-19            46.05       158.57       283.19         0.00 10150971936
> > 18128700644          0
> > dm-2            277.15         1.75      1108.26         0.00  112204480
> > 70946082624          0
> > dm-20            49.98       161.48       248.13         0.00 10337605436
> > 15884020104          0
> > dm-3             69.60       722.59       596.21         0.00 46257546968
> > 38166860968          0
> > dm-4            210.51         1.02       841.90         0.00   65369612
> > 53894908104          0
> > dm-5             89.88      1000.15       789.46         0.00 64025323664
> > 50537848140          0
> > dm-6            273.40         1.65      1093.31         0.00  105643468
> > 69989257428          0
> > dm-7             87.50      1019.36       847.10         0.00 65255481416
> > 54228140196          0
> > dm-8            254.77         1.70      1018.76         0.00  109124164
> > 65217134588          0
> > dm-9             88.66       989.21       766.84         0.00 63325285524
> > 49089975468          0
> > loop0             0.01         1.54         0.00         0.00   98623259
> >        0          0
> > loop1             0.01         1.62         0.00         0.00  103719536
> >        0          0
> > loop10            0.01         1.04         0.00         0.00   66341543
> >        0          0
> > loop11            0.00         0.00         0.00         0.00         36
> >        0          0
> > loop2             0.01         1.61         0.00         0.00  102824919
> >        0          0
> > loop3             0.01         1.57         0.00         0.00  100808077
> >        0          0
> > loop4             0.01         1.56         0.00         0.00  100081689
> >        0          0
> > loop5             0.01         1.53         0.00         0.00   97741555
> >        0          0
> > loop6             0.01         1.47         0.00         0.00   93867958
> >        0          0
> > loop7             0.01         1.16         0.00         0.00   74491285
> >        0          0
> > loop8             0.01         1.05         0.00         0.00   67308404
> >        0          0
> > loop9             0.01         0.72         0.00         0.00   45939669
> >        0          0
> > md0              44.30        33.75      1413.88       397.42 2160234553
> > 90511235396 25441160328
> > nvme0n1         518.12        24.41      5339.35        73.24 1562435128
> > 341803564504 4688433152
> > nvme1n1         391.03        22.11      4063.55        68.36 1415308200
> > 260132142151 4375871488
> > nvme2n1          33.99       175.52       288.87       195.30 11236255296
> > 18492074823 12502441984
> > nvme3n1          36.74       177.43       253.04       195.30 11358616904
> > 16198706451 12502441984
> > nvme4n1          36.34       130.81      1417.08       275.71 8374240889
> > 90715974981 17649735268
> > nvme5n1          35.97       101.47      1417.08       274.81 6495703006
> > 90715974997 17592304524
> > sda              76.43      1102.34       810.08         0.00 70567310268
> > 51858036484          0
> > sdb              55.74       741.38       606.07         0.00 47460332504
> > 38798003512          0
> > sdc              70.79      1017.90       795.50         0.00 65161638916
> > 50924982612          0
> > sdd              72.46      1037.63       853.95         0.00 66424612464
> > 54666221528          0
> > sde              70.40      1007.27       771.38         0.00 64481414320
> > 49380660785          0
> > sdf              69.81      1054.65       806.46         0.00 67514236484
> > 51626397033          0
> > sdg              70.99      1082.70       825.42         0.00 69310316508
> > 52840162953          0
> > sdh              70.13       995.05       783.62         0.00 63698881700
> > 50164098561          0
> > sdi              79.18      1167.40       897.64         0.00 74732240724
> > 57463602289          0
> > ---
> >
> > nvme0n1 and -n1 are the DB/WAL drives. The block storage for OSD that
> > consistently shows elevated latency is sdd. I had previously removed this
> > OSD and its HDD from service and ran all kinds of tests on it, no issues
> > there.
> >
> > The cluster is healthy, not much I/O going on (1 OSD is intentionally
> out):
> >
> > ---
> >   cluster:
> >     id:     3f50555a-ae2a-11eb-a2fc-ffde44714d86
> >     health: HEALTH_OK
> >
> >   services:
> >     mon: 5 daemons, quorum ceph01,ceph02,ceph03,ceph04,ceph05 (age 12d)
> >     mgr: ceph01.vankui(active, since 12d), standbys: ceph02.cjhexa
> >     osd: 66 osds: 66 up (since 12d), 65 in (since 2w)
> >
> >   data:
> >     pools:   9 pools, 1760 pgs
> >     objects: 5.08M objects, 16 TiB
> >     usage:   60 TiB used, 507 TiB / 567 TiB avail
> >     pgs:     1756 active+clean
> >              4    active+clean+scrubbing+deep
> >
> >   io:
> >     client:   109 MiB/s rd, 114 MiB/s wr, 1.99k op/s rd, 1.78k op/s wr
> > ---
> >
> > The output of `ceph pg ls-by-osd` is: https://pastebin.com/iW4Hx7xV
> >
> > I have tried restarting this OSD, but not compacting it, as after a
> restart
> > the OSD log suggests that compaction has nothing to do. Although I also
> see
> > that scrubbing runs on the cluster pretty much all the time (scrub
> settings
> > and intervals are default for Pacific), and wonder if that may be killing
> > the performance.
> >
> > /Z
> >
> > On Thu, 27 Apr 2023 at 11:25, Eugen Block <eblock@xxxxxx> wrote:
> >
> >> Those numbers look really high to me, more than 2 seconds for a write
> >> is awful. Is this a HDD-only cluster/pool? But even then it would be
> >> too high, I just compared with our HDD-backed cluster (although
> >> rocksDB is SSD-backed) which also mainly serves RBD to openstack. What
> >> is the general utilization of that host? Is it an upgraded cluster
> >> which could suffer from the performance degredation which was
> >> discussed in a recent thread? But I'd expect that more OSDs would be
> >> affected by that. How many PGs and objects are on that OSD (ceph pg
> >> ls-by-osd <ID>)? Have you tried to restart and/or compact the OSD and
> >> see if anything improves?
> >> You could set its primary-affinity to 0, or the worst case rebuild
> >> that OSD. And there are no smart errors or anything in dmesg reported
> >> about this disk?
> >>
> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
> >>
> >> > Thanks, Eugen!
> >> >
> >> > It's a bunch of entries like this https://pastebin.com/TGPu6PAT - I'm
> >> not
> >> > really sure what to make of them. I checked adjacent OSDs and they
> have
> >> > similar ops, but aren't showing excessive latency.
> >> >
> >> > /Z
> >> >
> >> > On Thu, 27 Apr 2023 at 10:42, Eugen Block <eblock@xxxxxx> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I would monitor the historic_ops_by_duration for a while and see if
> >> >> any specific operation takes unusually long.
> >> >>
> >> >> # this is within the container
> >> >> [ceph: root@storage01 /]# ceph daemon osd.0
> >> dump_historic_ops_by_duration
> >> >> | head
> >> >> {
> >> >>      "size": 20,
> >> >>      "duration": 600,
> >> >>      "ops": [
> >> >>          {
> >> >>              "description": "osd_repop(client.9384193.0:2056545 12.6
> >> >> e2233/2221 12:6192870f:::obj_delete_at_hint.0000000053:head v
> >> >> 2233'696390, mlcod=2233'696388)",
> >> >>              "initiated_at": "2023-04-27T07:37:35.046036+0000",
> >> >>              "age": 54.805016199999997,
> >> >>              "duration": 0.58198468699999995,
> >> >> ...
> >> >>
> >> >> The output contains the PG (so you know which pool is involved) and
> >> >> the duration of the operation, not sure if that helps though.
> >> >>
> >> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
> >> >>
> >> >> > As suggested by someone, I tried `dump_historic_slow_ops`. There
> >> aren't
> >> >> > many, and they're somewhat difficult to interpret:
> >> >> >
> >> >> >             "description": "osd_op(client.250533532.0:56821 13.16f
> >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> >> [stat,write
> >> >> > 3518464~8192] snapc 0=[] ondisk+write+known_if_redirected
> e118835)",
> >> >> >             "initiated_at": "2023-04-26T07:00:58.299120+0000",
> >> >> >             "description": "osd_op(client.250533532.0:56822 13.16f
> >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> >> [stat,write
> >> >> > 3559424~4096] snapc 0=[] ondisk+write+known_if_redirected
> e118835)",
> >> >> >             "initiated_at": "2023-04-26T07:00:58.299132+0000",
> >> >> >             "description": "osd_op(client.250533532.0:56823 13.16f
> >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> >> [stat,write
> >> >> > 3682304~4096] snapc 0=[] ondisk+write+known_if_redirected
> e118835)",
> >> >> >             "initiated_at": "2023-04-26T07:00:58.299138+0000",
> >> >> >             "description": "osd_op(client.250533532.0:56824 13.16f
> >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> >> [stat,write
> >> >> > 3772416~4096] snapc 0=[] ondisk+write+known_if_redirected
> e118835)",
> >> >> >             "initiated_at": "2023-04-26T07:00:58.299148+0000",
> >> >> >             "description": "osd_op(client.250533532.0:56825 13.16f
> >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> >> [stat,write
> >> >> > 3796992~8192] snapc 0=[] ondisk+write+known_if_redirected
> e118835)",
> >> >> >             "initiated_at": "2023-04-26T07:00:58.299188+0000",
> >> >> >             "description": "osd_op(client.250533532.0:56826 13.16f
> >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> >> [stat,write
> >> >> > 3862528~8192] snapc 0=[] ondisk+write+known_if_redirected
> e118835)",
> >> >> >             "initiated_at": "2023-04-26T07:00:58.299198+0000",
> >> >> >             "description": "osd_op(client.250533532.0:56827 13.16f
> >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> >> [stat,write
> >> >> > 3899392~12288] snapc 0=[] ondisk+write+known_if_redirected
> e118835)",
> >> >> >             "initiated_at": "2023-04-26T07:00:58.299207+0000",
> >> >> >             "description": "osd_op(client.250533532.0:56828 13.16f
> >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> >> [stat,write
> >> >> > 3944448~16384] snapc 0=[] ondisk+write+known_if_redirected
> e118835)",
> >> >> >             "initiated_at": "2023-04-26T07:00:58.299250+0000",
> >> >> >             "description": "osd_op(client.250533532.0:56829 13.16f
> >> >> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
> >> [stat,write
> >> >> > 4018176~4096] snapc 0=[] ondisk+write+known_if_redirected
> e118835)",
> >> >> >             "initiated_at": "2023-04-26T07:00:58.299270+0000",
> >> >> >
> >> >> > There's a lot more information there ofc. I also tried to
> >> >> > `dump_ops_in_flight` and there aren't many, usually 0-10 ops at a
> >> time,
> >> >> but
> >> >> > the OSD latency remains high even when the ops count is low or
> zero.
> >> Any
> >> >> > ideas?
> >> >> >
> >> >> > I would very much appreciate it if some could please point me to
> the
> >> >> > documentation on interpreting the output of ops dump.
> >> >> >
> >> >> > /Z
> >> >> >
> >> >> >
> >> >> > On Wed, 26 Apr 2023 at 20:22, Zakhar Kirpichenko <zakhar@xxxxxxxxx
> >
> >> >> wrote:
> >> >> >
> >> >> >> Hi,
> >> >> >>
> >> >> >> I have a Ceph 16.2.12 cluster with uniform hardware, same drive
> >> >> >> make/model, etc. A particular OSD is showing higher latency than
> >> usual
> >> >> in
> >> >> >> `ceph osd perf`, usually mid to high tens of milliseconds while
> other
> >> >> OSDs
> >> >> >> show low single digits, although its drive's I/O stats don't look
> >> >> different
> >> >> >> from those of other drives. The workload is mainly random 4K reads
> >> and
> >> >> >> writes, the cluster is being used as Openstack VM storage.
> >> >> >>
> >> >> >> Is there a way to trace, which particular PG, pool and disk image
> or
> >> >> >> object cause this OSD's excessive latency? Is there a way to tell
> >> Ceph
> >> >> to
> >> >> >>
> >> >> >> I would appreciate any advice or pointers.
> >> >> >>
> >> >> >> Best regards,
> >> >> >> Zakhar
> >> >> >>
> >> >> > _______________________________________________
> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >>
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx