Hi ceph-users, Having an issue on our test cluster. The problem I have is that S3 requests are slow. Up to 30s on occasion, but usually taking a few seconds. This is a multisite cluster of 4 VMs running Ceph in containers. We have 4 OSDs, 3 MDS, 3 MONs and 3 RGWs. We're running ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable). The cluster is backed by SSDs. One of our 4 OSDs is doing much more read IOPs than the other, bouncing between 0 and 300 MB/s. The other 3 OSDs appear mostly idle, from an IO perspective. The OSD that is having high IOPS is spamming logs with: Oct 28 14:13:25 albans_sc0 container_name/ceph-osd-0[1002]: 2021-10-28 14:13:25.731 7fa06229f700 0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/ DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/cls/rgw/cls_rgw.cc:2090: ERROR: rgw_obj_remove(): cls_cxx_remove returned -2 Oct 28 14:13:50 albans_sc0 container_name/ceph-osd-0[1002]: 2021-10-28 14:13:50.095 7fa062aa0700 0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/ DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/cls/rgw/cls_rgw.cc:2090: ERROR: rgw_obj_remove(): cls_cxx_remove returned -2 Occasionally I'll have the following health warnings pop up then clear after a few seconds before clearing: 2021-10-28 14:28:46.686909 mon.albans_sc0 [WRN] Health check failed: 0 slow ops, oldest one blocked for 32 sec, osd.2 has slow ops (SLOW_OPS) 2021-10-28 14:28:47.880352 mon.albans_sc0 [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 2021-10-28 14:28:52.669720 mon.albans_sc0 [WRN] Health check update: 4 slow ops, oldest one blocked for 36 sec, daemons [osd.2,osd.3] have slow ops. (SLOW_OPS) I also noticed that the MDS cache appears to be tiny (315MB just now, when compared to our other similar test system which uses 4GB - they also have the same MDS cache config, although the other system is using Octopus): $ ceph daemon mds.albans_sc1 cache status { "pool": { "items": 7045301, "bytes": 315787403 } } Otherwise the cluster reports: cluster: id: 29924e01-c131-4457-b252-e7a48200b925 health: HEALTH_WARN 52 large omap objects services: mon: 3 daemons, quorum albans_sc0,albans_sc1,albans_sc2 (age 44h) mgr: albans_sc2(active, since 44h), standbys: albans_sc0, albans_sc1 mds: cephfs:1 {0=albans_sc2=up:active} 2 up:standby osd: 4 osds: 4 up (since 3h), 4 in (since 3w) rgw: 6 daemons active (albans_sc0.pubsub, albans_sc0.rgw0, albans_sc1.pubsub, albans_sc1.rgw0, albans_sc2.pubsub, albans_sc2.rgw0) data: pools: 14 pools, 140 pgs objects: 4.90M objects, 129 GiB usage: 724 GiB used, 676 GiB / 1.4 TiB avail pgs: 139 active+clean 1 active+clean+scrubbing+deep io: client: 226 KiB/s rd, 9.2 KiB/s wr, 31 op/s rd, 18 op/s wr My client is reporting the slow S3 requests, and in the RGW logs I can also see things like: Oct 28 14:12:03 albans_sc2 container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28 14:12:03.234 7fcd16962700 2 req 132980 22.823s s3:get_obj completing Oct 28 14:12:03 albans_sc2 container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28 14:12:03.234 7fcd16962700 2 req 132980 22.823s s3:get_obj op status=0 Oct 28 14:12:03 albans_sc2 container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28 14:12:03.234 7fcd16962700 2 req 132980 22.823s s3:get_obj http status=200 Oct 28 14:12:03 albans_sc2 container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28 14:12:03.234 7fcd16962700 1 ====== req done req=0x55bca6b625f0 op status=0 http_status=200 latency=22.823s ====== I've turned the RGW logs up all the way, but failing to identify what is causing such a long delay. What is the cause of the "ERROR: rgw_obj_remove(): cls_cxx_remove returned -2" message? How can I investigate further into these slow S3 requests? Any advice/guidance on how to debug this further is much appreciated. Thanks, Alex _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx