Hi,
it's possible that the log messages are a consequence of the slow
requests. Having only 4 OSDs is kind of a corner case, ceph is
designed as a scalable solution, so the real benefits start with a
large number of OSDs so you can parallelize many client requests to
many OSDs. And with only 4 OSDs the one with the most IO is probably
the primary OSD. You have more RGWs than OSDs, are all your RGWs
serving client IO? Could you provide more details about the load the
clients are producing? I assume the OSDs are also virtual disks?
I would recommend to scale out if you can.
Regards,
Eugen
Zitat von Alex Hussein-Kershaw <alexhus@xxxxxxxxxxxxx>:
Hi ceph-users,
Having an issue on our test cluster. The problem I have is that S3
requests are slow. Up to 30s on occasion, but usually taking a few
seconds.
This is a multisite cluster of 4 VMs running Ceph in containers. We
have 4 OSDs, 3 MDS, 3 MONs and 3 RGWs. We're running ceph version
14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable).
The cluster is backed by SSDs.
One of our 4 OSDs is doing much more read IOPs than the other,
bouncing between 0 and 300 MB/s. The other 3 OSDs appear mostly
idle, from an IO perspective. The OSD that is having high IOPS is
spamming logs with:
Oct 28 14:13:25 albans_sc0 container_name/ceph-osd-0[1002]:
2021-10-28 14:13:25.731 7fa06229f700 0 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/
DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/cls/rgw/cls_rgw.cc:2090: ERROR: rgw_obj_remove(): cls_cxx_remove returned
-2
Oct 28 14:13:50 albans_sc0 container_name/ceph-osd-0[1002]:
2021-10-28 14:13:50.095 7fa062aa0700 0 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/
DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/cls/rgw/cls_rgw.cc:2090: ERROR: rgw_obj_remove(): cls_cxx_remove returned
-2
Occasionally I'll have the following health warnings pop up then
clear after a few seconds before clearing:
2021-10-28 14:28:46.686909 mon.albans_sc0 [WRN] Health check failed:
0 slow ops, oldest one blocked for 32 sec, osd.2 has slow ops
(SLOW_OPS)
2021-10-28 14:28:47.880352 mon.albans_sc0 [WRN] Health check failed:
1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
2021-10-28 14:28:52.669720 mon.albans_sc0 [WRN] Health check update:
4 slow ops, oldest one blocked for 36 sec, daemons [osd.2,osd.3]
have slow ops. (SLOW_OPS)
I also noticed that the MDS cache appears to be tiny (315MB just
now, when compared to our other similar test system which uses 4GB -
they also have the same MDS cache config, although the other system
is using Octopus):
$ ceph daemon mds.albans_sc1 cache status
{
"pool": {
"items": 7045301,
"bytes": 315787403
}
}
Otherwise the cluster reports:
cluster:
id: 29924e01-c131-4457-b252-e7a48200b925
health: HEALTH_WARN
52 large omap objects
services:
mon: 3 daemons, quorum albans_sc0,albans_sc1,albans_sc2 (age 44h)
mgr: albans_sc2(active, since 44h), standbys: albans_sc0, albans_sc1
mds: cephfs:1 {0=albans_sc2=up:active} 2 up:standby
osd: 4 osds: 4 up (since 3h), 4 in (since 3w)
rgw: 6 daemons active (albans_sc0.pubsub, albans_sc0.rgw0,
albans_sc1.pubsub, albans_sc1.rgw0, albans_sc2.pubsub,
albans_sc2.rgw0)
data:
pools: 14 pools, 140 pgs
objects: 4.90M objects, 129 GiB
usage: 724 GiB used, 676 GiB / 1.4 TiB avail
pgs: 139 active+clean
1 active+clean+scrubbing+deep
io:
client: 226 KiB/s rd, 9.2 KiB/s wr, 31 op/s rd, 18 op/s wr
My client is reporting the slow S3 requests, and in the RGW logs I
can also see things like:
Oct 28 14:12:03 albans_sc2
container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28
14:12:03.234 7fcd16962700 2 req 132980 22.823s s3:get_obj completing
Oct 28 14:12:03 albans_sc2
container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28
14:12:03.234 7fcd16962700 2 req 132980 22.823s s3:get_obj op status=0
Oct 28 14:12:03 albans_sc2
container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28
14:12:03.234 7fcd16962700 2 req 132980 22.823s s3:get_obj http
status=200
Oct 28 14:12:03 albans_sc2
container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28
14:12:03.234 7fcd16962700 1 ====== req done req=0x55bca6b625f0 op
status=0 http_status=200 latency=22.823s ======
I've turned the RGW logs up all the way, but failing to identify
what is causing such a long delay.
What is the cause of the "ERROR: rgw_obj_remove(): cls_cxx_remove
returned -2" message? How can I investigate further into these slow
S3 requests?
Any advice/guidance on how to debug this further is much appreciated.
Thanks,
Alex
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx