Re: Slow S3 Requests

Eugen Block <eblock@xxxxxx> · Wed, 03 Nov 2021 14:55:43 +0000

Hi,

it's possible that the log messages are a consequence of the slow  
requests. Having only 4 OSDs is kind of a corner case, ceph is  
designed as a scalable solution, so the real benefits start with a  
large number of OSDs so you can parallelize many client requests to  
many OSDs. And with only 4 OSDs the one with the most IO is probably  
the primary OSD. You have more RGWs than OSDs, are all your RGWs  
serving client IO? Could you provide more details about the load the  
clients are producing? I assume the OSDs are also virtual disks?

I would recommend to scale out if you can.

Regards,
Eugen

Zitat von Alex Hussein-Kershaw <alexhus@xxxxxxxxxxxxx>:

Hi ceph-users,

Having an issue on our test cluster. The problem I have is that S3  
requests are slow. Up to 30s on occasion, but usually taking a few  
seconds.

This is a multisite cluster of 4 VMs running Ceph in containers. We  
have 4 OSDs, 3 MDS, 3 MONs and 3 RGWs. We're running ceph version  
14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable).  
The cluster is backed by SSDs.

One of our 4 OSDs is doing much more read IOPs than the other,  
bouncing between 0 and 300 MB/s. The other 3 OSDs appear mostly  
idle, from an IO perspective. The OSD that is having high IOPS is  
spamming logs with:

Oct 28 14:13:25 albans_sc0 container_name/ceph-osd-0[1002]:  
2021-10-28 14:13:25.731 7fa06229f700  0 <cls>  
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/
DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/cls/rgw/cls_rgw.cc:2090: ERROR: rgw_obj_remove(): cls_cxx_remove returned  
-2
Oct 28 14:13:50 albans_sc0 container_name/ceph-osd-0[1002]:  
2021-10-28 14:13:50.095 7fa062aa0700  0 <cls>  
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/
DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/cls/rgw/cls_rgw.cc:2090: ERROR: rgw_obj_remove(): cls_cxx_remove returned  
-2

Occasionally I'll have the following health warnings pop up then  
clear after a few seconds before clearing:

2021-10-28 14:28:46.686909 mon.albans_sc0 [WRN] Health check failed:  
0 slow ops, oldest one blocked for 32 sec, osd.2 has slow ops  
(SLOW_OPS)
2021-10-28 14:28:47.880352 mon.albans_sc0 [WRN] Health check failed:  
1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
2021-10-28 14:28:52.669720 mon.albans_sc0 [WRN] Health check update:  
4 slow ops, oldest one blocked for 36 sec, daemons [osd.2,osd.3]  
have slow ops. (SLOW_OPS)

I also noticed that the MDS cache appears to be tiny (315MB just  
now, when compared to our other similar test system which uses 4GB -  
they also have the same MDS cache config, although the other system  
is using Octopus):

$ ceph daemon mds.albans_sc1 cache status
{
    "pool": {
        "items": 7045301,
        "bytes": 315787403
    }
}

Otherwise the cluster reports:

  cluster:
    id:     29924e01-c131-4457-b252-e7a48200b925
    health: HEALTH_WARN
            52 large omap objects

  services:
    mon: 3 daemons, quorum albans_sc0,albans_sc1,albans_sc2 (age 44h)
    mgr: albans_sc2(active, since 44h), standbys: albans_sc0, albans_sc1
    mds: cephfs:1 {0=albans_sc2=up:active} 2 up:standby
    osd: 4 osds: 4 up (since 3h), 4 in (since 3w)
    rgw: 6 daemons active (albans_sc0.pubsub, albans_sc0.rgw0,  
albans_sc1.pubsub, albans_sc1.rgw0, albans_sc2.pubsub,  
albans_sc2.rgw0)

  data:
    pools:   14 pools, 140 pgs
    objects: 4.90M objects, 129 GiB
    usage:   724 GiB used, 676 GiB / 1.4 TiB avail
    pgs:     139 active+clean
             1   active+clean+scrubbing+deep

  io:
    client:   226 KiB/s rd, 9.2 KiB/s wr, 31 op/s rd, 18 op/s wr

My client is reporting the slow S3 requests, and in the RGW logs I  
can also see things like:

Oct 28 14:12:03 albans_sc2  
container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28  
14:12:03.234 7fcd16962700  2 req 132980 22.823s s3:get_obj completing
Oct 28 14:12:03 albans_sc2  
container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28  
14:12:03.234 7fcd16962700  2 req 132980 22.823s s3:get_obj op status=0
Oct 28 14:12:03 albans_sc2  
container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28  
14:12:03.234 7fcd16962700  2 req 132980 22.823s s3:get_obj http  
status=200
Oct 28 14:12:03 albans_sc2  
container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28  
14:12:03.234 7fcd16962700  1 ====== req done req=0x55bca6b625f0 op  
status=0 http_status=200 latency=22.823s ======

I've turned the RGW logs up all the way, but failing to identify  
what is causing such a long delay.

What is the cause of the "ERROR: rgw_obj_remove(): cls_cxx_remove  
returned -2" message? How can I investigate further into these slow  
S3 requests?

Any advice/guidance on how to debug this further is much appreciated.

Thanks,
Alex
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx