Slow osd ops on large arm cluster

Adam Prycki <aprycki@xxxxxxxxxxxxx> · Mon, 08 Jul 2024 14:28:29 +0200

Hello,

we are having issues with slow ops on our large ARM hpc ceph cluster.

Cluster runs on 18.2.0 and ubutnu 20.04
MONs, MGRs and MDSs had to be moved to intel servers because of poor 
single core performance on our arm servers.
Our main cephfs data pool is on 54 serwers in 9 racks with 1458 HDDs in 
total. (OSDs without block.db on ssd)
Cephfs data pool is configured as erasure coded pool with k=6 m=2 and 
rack level replication. Pool has about 16k PGs with average pg per osd 
at ~90.

We have had good experience with EC cephfs on 3,5 times smaller intel 
ceph cluster. But this arm deployment is becoming problematic. We 
started experiencing issues since one of the users started to generate 
sequential RW traffic at at about 5GiB/s. Single OSD with slow ops was 
enough to create laggy PG and crash application generating this traffic.
We've even had issue where osd with slow ops was lagged for 6 hours and 
required manual restart.

Now we are experiencing slow ops even at much lower read only traffic 
~400MiB/s

Here is an example of slow ops on OSD:
{
    "ops": [
        {
            "description": "osd_op(client.255949991.0:92728602 4.d22s0 
4:44b3390a:::1000b640ddc.0000039b:head [read 3633152~8192] snapc 0=[] 
ondisk+read+known_if_redirected e1117246)",
            "initiated_at": "2024-07-08T10:19:58.469537+0000",
            "age": 507.242936848,
            "duration": 507.24298854800003,
            "type_data": {
                "flag_point": "started",
                "client_info": {
                    "client": "client.255949991",
                    "client_addr": "x.x.x.x:0/887459214",
                    "tid": 92728602
                },
                "events": [
                    {
                        "event": "initiated",
                        "time": "2024-07-08T10:19:58.469537+0000",
                        "duration": 0
                    },
                    {
                        "event": "throttled",
                        "time": "2024-07-08T10:19:58.469537+0000",
                        "duration": 0
                    },
                    {
                        "event": "header_read",
                        "time": "2024-07-08T10:19:58.469535+0000",
                        "duration": 4294967295.9999981
                    },
                    {
                        "event": "all_read",
                        "time": "2024-07-08T10:19:58.469571+0000",
                        "duration": 3.5859999999999999e-05
                    },
                    {
                        "event": "dispatched",
                        "time": "2024-07-08T10:19:58.469573+0000",
                        "duration": 2.08e-06
                    },
                    {
                        "event": "queued_for_pg",
                        "time": "2024-07-08T10:19:58.469586+0000",
                        "duration": 1.2721000000000001e-05
                    },
                    {
                        "event": "reached_pg",
                        "time": "2024-07-08T10:19:58.485132+0000",
                        "duration": 0.015546048999999999
                    },
                    {
                        "event": "started",
                        "time": "2024-07-08T10:19:58.485147+0000",
                        "duration": 1.5160000000000001e-05
                    }
                ]
            }
        },
HDD with this OSD is not busy. Arm cores on these servers are slow but 
no process reaches full 100% core usage.

I think we may have the same issue as one described here: 
https://www.mail-archive.com/ceph-users@xxxxxxx/msg13273.html

I've tried to reduce osd_pool_default_read_lease_ratio form 0.8 to 0.2
I've tried to reduce osd_heartbeat_grace from 20 to 10.
It should lower read_lease_interval from 16 to 2 but it didn't help. 
Still see a lot of slow ops.

Could you give me tips what I could tune to fix this issue?

Could this be an issue with large number of EC PGs on large cluster with 
weak CPUs?

Best regards
Adam Prycki
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx