Thanks, Eugen. I very much appreciate your time and replies.
It's a hybrid OSD with DB/WAL on NVME (Micron_7300_MTFDHBE1T6TDG) and block
storage on HDD (Toshiba MG06SCA10TE). There are 6 uniform hosts with 2 x
DB/WAL NVMEs and 9 x HDDs each, each NVME hosts DB/WAL for 4-5 OSDs. The
cluster was installed with Ceph 16.2.0, i.e. not upgraded from a previous
Ceph version. The general host utilization is minimal:
---
total used free shared buff/cache
available
Mem: 394859228 162089492 2278392 4468 230491344
230135560
Swap: 8388604 410624 7977980
---
The host has 2 x Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz CPUs, 48 cores
and 96 threads total. The load averages are < 1.5 most of the time. Iostat
doesn't show anything dodgy:
---
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read
kB_wrtn kB_dscd
dm-0 40.61 14.05 1426.88 274.81 899585953
91343089664 17592304524
dm-1 95.80 1085.29 804.00 0.00 69476048192
51469036032 0
dm-10 261.82 1.64 1046.98 0.00 104964624
67023666460 0
dm-11 87.62 1036.39 801.12 0.00 66345393128
51284224352 0
dm-12 265.95 1.65 1063.50 0.00 105717636
68081300084 0
dm-13 90.39 1064.38 820.32 0.00 68137422692
52513309008 0
dm-14 260.81 1.65 1042.94 0.00 105460360
66764843944 0
dm-15 88.73 976.58 778.68 0.00 62516667260
49847871016 0
dm-16 266.54 1.62 1065.84 0.00 103731332
68230531868 0
dm-17 100.70 1148.40 892.47 0.00 73516251072
57132462352 0
dm-18 279.91 1.77 1119.29 0.00 113498508
71652321256 0
dm-19 46.05 158.57 283.19 0.00 10150971936
18128700644 0
dm-2 277.15 1.75 1108.26 0.00 112204480
70946082624 0
dm-20 49.98 161.48 248.13 0.00 10337605436
15884020104 0
dm-3 69.60 722.59 596.21 0.00 46257546968
38166860968 0
dm-4 210.51 1.02 841.90 0.00 65369612
53894908104 0
dm-5 89.88 1000.15 789.46 0.00 64025323664
50537848140 0
dm-6 273.40 1.65 1093.31 0.00 105643468
69989257428 0
dm-7 87.50 1019.36 847.10 0.00 65255481416
54228140196 0
dm-8 254.77 1.70 1018.76 0.00 109124164
65217134588 0
dm-9 88.66 989.21 766.84 0.00 63325285524
49089975468 0
loop0 0.01 1.54 0.00 0.00 98623259
0 0
loop1 0.01 1.62 0.00 0.00 103719536
0 0
loop10 0.01 1.04 0.00 0.00 66341543
0 0
loop11 0.00 0.00 0.00 0.00 36
0 0
loop2 0.01 1.61 0.00 0.00 102824919
0 0
loop3 0.01 1.57 0.00 0.00 100808077
0 0
loop4 0.01 1.56 0.00 0.00 100081689
0 0
loop5 0.01 1.53 0.00 0.00 97741555
0 0
loop6 0.01 1.47 0.00 0.00 93867958
0 0
loop7 0.01 1.16 0.00 0.00 74491285
0 0
loop8 0.01 1.05 0.00 0.00 67308404
0 0
loop9 0.01 0.72 0.00 0.00 45939669
0 0
md0 44.30 33.75 1413.88 397.42 2160234553
90511235396 25441160328
nvme0n1 518.12 24.41 5339.35 73.24 1562435128
341803564504 4688433152
nvme1n1 391.03 22.11 4063.55 68.36 1415308200
260132142151 4375871488
nvme2n1 33.99 175.52 288.87 195.30 11236255296
18492074823 12502441984
nvme3n1 36.74 177.43 253.04 195.30 11358616904
16198706451 12502441984
nvme4n1 36.34 130.81 1417.08 275.71 8374240889
90715974981 17649735268
nvme5n1 35.97 101.47 1417.08 274.81 6495703006
90715974997 17592304524
sda 76.43 1102.34 810.08 0.00 70567310268
51858036484 0
sdb 55.74 741.38 606.07 0.00 47460332504
38798003512 0
sdc 70.79 1017.90 795.50 0.00 65161638916
50924982612 0
sdd 72.46 1037.63 853.95 0.00 66424612464
54666221528 0
sde 70.40 1007.27 771.38 0.00 64481414320
49380660785 0
sdf 69.81 1054.65 806.46 0.00 67514236484
51626397033 0
sdg 70.99 1082.70 825.42 0.00 69310316508
52840162953 0
sdh 70.13 995.05 783.62 0.00 63698881700
50164098561 0
sdi 79.18 1167.40 897.64 0.00 74732240724
57463602289 0
---
nvme0n1 and -n1 are the DB/WAL drives. The block storage for OSD that
consistently shows elevated latency is sdd. I had previously removed this
OSD and its HDD from service and ran all kinds of tests on it, no issues
there.
The cluster is healthy, not much I/O going on (1 OSD is intentionally out):
---
cluster:
id: 3f50555a-ae2a-11eb-a2fc-ffde44714d86
health: HEALTH_OK
services:
mon: 5 daemons, quorum ceph01,ceph02,ceph03,ceph04,ceph05 (age 12d)
mgr: ceph01.vankui(active, since 12d), standbys: ceph02.cjhexa
osd: 66 osds: 66 up (since 12d), 65 in (since 2w)
data:
pools: 9 pools, 1760 pgs
objects: 5.08M objects, 16 TiB
usage: 60 TiB used, 507 TiB / 567 TiB avail
pgs: 1756 active+clean
4 active+clean+scrubbing+deep
io:
client: 109 MiB/s rd, 114 MiB/s wr, 1.99k op/s rd, 1.78k op/s wr
---
The output of `ceph pg ls-by-osd` is: https://pastebin.com/iW4Hx7xV
I have tried restarting this OSD, but not compacting it, as after a restart
the OSD log suggests that compaction has nothing to do. Although I also see
that scrubbing runs on the cluster pretty much all the time (scrub settings
and intervals are default for Pacific), and wonder if that may be killing
the performance.
/Z
On Thu, 27 Apr 2023 at 11:25, Eugen Block <eblock@xxxxxx> wrote:
Those numbers look really high to me, more than 2 seconds for a write
is awful. Is this a HDD-only cluster/pool? But even then it would be
too high, I just compared with our HDD-backed cluster (although
rocksDB is SSD-backed) which also mainly serves RBD to openstack. What
is the general utilization of that host? Is it an upgraded cluster
which could suffer from the performance degredation which was
discussed in a recent thread? But I'd expect that more OSDs would be
affected by that. How many PGs and objects are on that OSD (ceph pg
ls-by-osd <ID>)? Have you tried to restart and/or compact the OSD and
see if anything improves?
You could set its primary-affinity to 0, or the worst case rebuild
that OSD. And there are no smart errors or anything in dmesg reported
about this disk?
Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
> Thanks, Eugen!
>
> It's a bunch of entries like this https://pastebin.com/TGPu6PAT - I'm
not
> really sure what to make of them. I checked adjacent OSDs and they have
> similar ops, but aren't showing excessive latency.
>
> /Z
>
> On Thu, 27 Apr 2023 at 10:42, Eugen Block <eblock@xxxxxx> wrote:
>
>> Hi,
>>
>> I would monitor the historic_ops_by_duration for a while and see if
>> any specific operation takes unusually long.
>>
>> # this is within the container
>> [ceph: root@storage01 /]# ceph daemon osd.0
dump_historic_ops_by_duration
>> | head
>> {
>> "size": 20,
>> "duration": 600,
>> "ops": [
>> {
>> "description": "osd_repop(client.9384193.0:2056545 12.6
>> e2233/2221 12:6192870f:::obj_delete_at_hint.0000000053:head v
>> 2233'696390, mlcod=2233'696388)",
>> "initiated_at": "2023-04-27T07:37:35.046036+0000",
>> "age": 54.805016199999997,
>> "duration": 0.58198468699999995,
>> ...
>>
>> The output contains the PG (so you know which pool is involved) and
>> the duration of the operation, not sure if that helps though.
>>
>> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>>
>> > As suggested by someone, I tried `dump_historic_slow_ops`. There
aren't
>> > many, and they're somewhat difficult to interpret:
>> >
>> > "description": "osd_op(client.250533532.0:56821 13.16f
>> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
[stat,write
>> > 3518464~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
>> > "initiated_at": "2023-04-26T07:00:58.299120+0000",
>> > "description": "osd_op(client.250533532.0:56822 13.16f
>> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
[stat,write
>> > 3559424~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
>> > "initiated_at": "2023-04-26T07:00:58.299132+0000",
>> > "description": "osd_op(client.250533532.0:56823 13.16f
>> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
[stat,write
>> > 3682304~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
>> > "initiated_at": "2023-04-26T07:00:58.299138+0000",
>> > "description": "osd_op(client.250533532.0:56824 13.16f
>> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
[stat,write
>> > 3772416~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
>> > "initiated_at": "2023-04-26T07:00:58.299148+0000",
>> > "description": "osd_op(client.250533532.0:56825 13.16f
>> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
[stat,write
>> > 3796992~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
>> > "initiated_at": "2023-04-26T07:00:58.299188+0000",
>> > "description": "osd_op(client.250533532.0:56826 13.16f
>> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
[stat,write
>> > 3862528~8192] snapc 0=[] ondisk+write+known_if_redirected e118835)",
>> > "initiated_at": "2023-04-26T07:00:58.299198+0000",
>> > "description": "osd_op(client.250533532.0:56827 13.16f
>> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
[stat,write
>> > 3899392~12288] snapc 0=[] ondisk+write+known_if_redirected e118835)",
>> > "initiated_at": "2023-04-26T07:00:58.299207+0000",
>> > "description": "osd_op(client.250533532.0:56828 13.16f
>> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
[stat,write
>> > 3944448~16384] snapc 0=[] ondisk+write+known_if_redirected e118835)",
>> > "initiated_at": "2023-04-26T07:00:58.299250+0000",
>> > "description": "osd_op(client.250533532.0:56829 13.16f
>> > 13:f6c9079e:::rbd_data.eed629ecc1f946.000000000000001c:head
[stat,write
>> > 4018176~4096] snapc 0=[] ondisk+write+known_if_redirected e118835)",
>> > "initiated_at": "2023-04-26T07:00:58.299270+0000",
>> >
>> > There's a lot more information there ofc. I also tried to
>> > `dump_ops_in_flight` and there aren't many, usually 0-10 ops at a
time,
>> but
>> > the OSD latency remains high even when the ops count is low or zero.
Any
>> > ideas?
>> >
>> > I would very much appreciate it if some could please point me to the
>> > documentation on interpreting the output of ops dump.
>> >
>> > /Z
>> >
>> >
>> > On Wed, 26 Apr 2023 at 20:22, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> I have a Ceph 16.2.12 cluster with uniform hardware, same drive
>> >> make/model, etc. A particular OSD is showing higher latency than
usual
>> in
>> >> `ceph osd perf`, usually mid to high tens of milliseconds while other
>> OSDs
>> >> show low single digits, although its drive's I/O stats don't look
>> different
>> >> from those of other drives. The workload is mainly random 4K reads
and
>> >> writes, the cluster is being used as Openstack VM storage.
>> >>
>> >> Is there a way to trace, which particular PG, pool and disk image or
>> >> object cause this OSD's excessive latency? Is there a way to tell
Ceph
>> to
>> >>
>> >> I would appreciate any advice or pointers.
>> >>
>> >> Best regards,
>> >> Zakhar
>> >>
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx