Re: Slow S3 Requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

it's possible that the log messages are a consequence of the slow requests. Having only 4 OSDs is kind of a corner case, ceph is designed as a scalable solution, so the real benefits start with a large number of OSDs so you can parallelize many client requests to many OSDs. And with only 4 OSDs the one with the most IO is probably the primary OSD. You have more RGWs than OSDs, are all your RGWs serving client IO? Could you provide more details about the load the clients are producing? I assume the OSDs are also virtual disks?

I would recommend to scale out if you can.

Regards,
Eugen


Zitat von Alex Hussein-Kershaw <alexhus@xxxxxxxxxxxxx>:

Hi ceph-users,

Having an issue on our test cluster. The problem I have is that S3 requests are slow. Up to 30s on occasion, but usually taking a few seconds.

This is a multisite cluster of 4 VMs running Ceph in containers. We have 4 OSDs, 3 MDS, 3 MONs and 3 RGWs. We're running ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable). The cluster is backed by SSDs.

One of our 4 OSDs is doing much more read IOPs than the other, bouncing between 0 and 300 MB/s. The other 3 OSDs appear mostly idle, from an IO perspective. The OSD that is having high IOPS is spamming logs with:

Oct 28 14:13:25 albans_sc0 container_name/ceph-osd-0[1002]: 2021-10-28 14:13:25.731 7fa06229f700 0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/ DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/cls/rgw/cls_rgw.cc:2090: ERROR: rgw_obj_remove(): cls_cxx_remove returned -2 Oct 28 14:13:50 albans_sc0 container_name/ceph-osd-0[1002]: 2021-10-28 14:13:50.095 7fa062aa0700 0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/ DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/cls/rgw/cls_rgw.cc:2090: ERROR: rgw_obj_remove(): cls_cxx_remove returned -2

Occasionally I'll have the following health warnings pop up then clear after a few seconds before clearing:

2021-10-28 14:28:46.686909 mon.albans_sc0 [WRN] Health check failed: 0 slow ops, oldest one blocked for 32 sec, osd.2 has slow ops (SLOW_OPS) 2021-10-28 14:28:47.880352 mon.albans_sc0 [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 2021-10-28 14:28:52.669720 mon.albans_sc0 [WRN] Health check update: 4 slow ops, oldest one blocked for 36 sec, daemons [osd.2,osd.3] have slow ops. (SLOW_OPS)

I also noticed that the MDS cache appears to be tiny (315MB just now, when compared to our other similar test system which uses 4GB - they also have the same MDS cache config, although the other system is using Octopus):

$ ceph daemon mds.albans_sc1 cache status
{
    "pool": {
        "items": 7045301,
        "bytes": 315787403
    }
}

Otherwise the cluster reports:

  cluster:
    id:     29924e01-c131-4457-b252-e7a48200b925
    health: HEALTH_WARN
            52 large omap objects

  services:
    mon: 3 daemons, quorum albans_sc0,albans_sc1,albans_sc2 (age 44h)
    mgr: albans_sc2(active, since 44h), standbys: albans_sc0, albans_sc1
    mds: cephfs:1 {0=albans_sc2=up:active} 2 up:standby
    osd: 4 osds: 4 up (since 3h), 4 in (since 3w)
rgw: 6 daemons active (albans_sc0.pubsub, albans_sc0.rgw0, albans_sc1.pubsub, albans_sc1.rgw0, albans_sc2.pubsub, albans_sc2.rgw0)

  data:
    pools:   14 pools, 140 pgs
    objects: 4.90M objects, 129 GiB
    usage:   724 GiB used, 676 GiB / 1.4 TiB avail
    pgs:     139 active+clean
             1   active+clean+scrubbing+deep

  io:
    client:   226 KiB/s rd, 9.2 KiB/s wr, 31 op/s rd, 18 op/s wr

My client is reporting the slow S3 requests, and in the RGW logs I can also see things like:

Oct 28 14:12:03 albans_sc2 container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28 14:12:03.234 7fcd16962700 2 req 132980 22.823s s3:get_obj completing Oct 28 14:12:03 albans_sc2 container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28 14:12:03.234 7fcd16962700 2 req 132980 22.823s s3:get_obj op status=0 Oct 28 14:12:03 albans_sc2 container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28 14:12:03.234 7fcd16962700 2 req 132980 22.823s s3:get_obj http status=200 Oct 28 14:12:03 albans_sc2 container_name/ceph-rgw-albans_sc2-rgw0[1001]: 2021-10-28 14:12:03.234 7fcd16962700 1 ====== req done req=0x55bca6b625f0 op status=0 http_status=200 latency=22.823s ======

I've turned the RGW logs up all the way, but failing to identify what is causing such a long delay.

What is the cause of the "ERROR: rgw_obj_remove(): cls_cxx_remove returned -2" message? How can I investigate further into these slow S3 requests?

Any advice/guidance on how to debug this further is much appreciated.

Thanks,
Alex
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux