Slow S3 Requests

Alex Hussein-Kershaw <alexhus@xxxxxxxxxxxxx> · Tue, 2 Nov 2021 14:37:46 +0000

Hi ceph-users,

Having an issue on our test cluster. The problem I have is that S3 requests are slow. Up to 30s on occasion, but usually taking a few seconds.

This is a multisite cluster of 4 VMs running Ceph in containers. We have 4 OSDs, 3 MDS, 3 MONs and 3 RGWs. We're running ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable). The cluster is backed by SSDs.

One of our 4 OSDs is doing much more read IOPs than the other, bouncing between 0 and 300 MB/s. The other 3 OSDs appear mostly idle, from an IO perspective. The OSD that is having high IOPS is spamming logs with:

Oct 28 14:13:25 albans_sc0 container_name/ceph-osd-0[1002]: 2021-10-28 14:13:25.731 7fa06229f700  0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/
DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/cls/rgw/cls_rgw.cc:2090: ERROR: rgw_obj_remove(): cls_cxx_remove returned -2
Oct 28 14:13:50 albans_sc0 container_name/ceph-osd-0[1002]: 2021-10-28 14:13:50.095 7fa062aa0700  0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/
DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/cls/rgw/cls_rgw.cc:2090: ERROR: rgw_obj_remove(): cls_cxx_remove returned -2

Occasionally I'll have the following health warnings pop up then clear after a few seconds before clearing:

2021-10-28 14:28:46.686909 mon.albans_sc0 [WRN] Health check failed: 0 slow ops, oldest one blocked for 32 sec, osd.2 has slow ops (SLOW_OPS)
2021-10-28 14:28:47.880352 mon.albans_sc0 [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
2021-10-28 14:28:52.669720 mon.albans_sc0 [WRN] Health check update: 4 slow ops, oldest one blocked for 36 sec, daemons [osd.2,osd.3] have slow ops. (SLOW_OPS)

I also noticed that the MDS cache appears to be tiny (315MB just now, when compared to our other similar test system which uses 4GB - they also have the same MDS cache config, although the other system is using Octopus):

$ ceph daemon mds.albans_sc1 cache status
{
    "pool": {
        "items": 7045301,
        "bytes": 315787403
    }
}

Otherwise the cluster reports:

  cluster:
    id:     29924e01-c131-4457-b252-e7a48200b925
    health: HEALTH_WARN
            52 large omap objects

  services:
    mon: 3 daemons, quorum albans_sc0,albans_sc1,albans_sc2 (age 44h)
    mgr: albans_sc2(active, since 44h), standbys: albans_sc0, albans_sc1
    mds: cephfs:1 {0=albans_sc2=up:active} 2 up:standby
    osd: 4 osds: 4 up (since 3h), 4 in (since 3w)
    rgw: 6 daemons active (albans_sc0.pubsub, albans_sc0.rgw0, albans_sc1.pubsub, albans_sc1.rgw0, albans_sc2.pubsub, albans_sc2.rgw0)

  data:
    pools:   14 pools, 140 pgs
    objects: 4.90M objects, 129 GiB
    usage:   724 GiB used, 676 GiB / 1.4 TiB avail
    pgs:     139 active+clean
             1   active+clean+scrubbing+deep

  io:
    client:   226 KiB/s rd, 9.2 KiB/s wr, 31 op/s rd, 18 op/s wr

My client is reporting the slow S3 requests, and in the RGW logs I can also see things like:

Oct 28 14:12:03 albans_sc2 container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28 14:12:03.234 7fcd16962700  2 req 132980 22.823s s3:get_obj completing
Oct 28 14:12:03 albans_sc2 container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28 14:12:03.234 7fcd16962700  2 req 132980 22.823s s3:get_obj op status=0
Oct 28 14:12:03 albans_sc2 container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28 14:12:03.234 7fcd16962700  2 req 132980 22.823s s3:get_obj http status=200
Oct 28 14:12:03 albans_sc2 container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28 14:12:03.234 7fcd16962700  1 ====== req done req=0x55bca6b625f0 op status=0 http_status=200 latency=22.823s ======

I've turned the RGW logs up all the way, but failing to identify what is causing such a long delay.

What is the cause of the "ERROR: rgw_obj_remove(): cls_cxx_remove returned -2" message? How can I investigate further into these slow S3 requests?

Any advice/guidance on how to debug this further is much appreciated.

Thanks,
Alex
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx