OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Thu, 28 Oct 2021 11:19:15 +0000

Hi,

I found on the mail thread couple of emails that might be related to this issue, but there it came after osd restart or recovery.
My cluster is on octopus 15.2.14 at the moment.

When slow ops started to come I can see laggy pgs and slowly the 3 rados gateway starts to die behind the haproxy and when the slow ops gone it will restart, but it causes user interruption.

Digging deeper on the affected osd slow_ops historical log I can see couple of slow ops, where 32 seconds spent on waiting for readable event. Is there anything to do with this? To avoid osd restart I'm running with
osd_op_thread_suicide_timeout=2000
osd_op_thread_timeout=90
enabled on all osd at the moment.

Might be I'm using too small amount of pg?
Data pool is on 4:2 ec host based and have 7 nodes, currently the pg number is 128 based on the balancer suggestion.

An example slow ops:

{
            "description": "osd_op(client.141400841.0:290235021 28.17s0 28:e94c28ab:::9213182a-14ba-48ad-bde9-289a1c0c0de8.6034919.1_%2fWHITELABEL-1%2fPAGETPYE-5%2fDEVICE-4%2fLANGUAGE-38%2fSUBTYPE-0%2f148341:head [create,setxattr user.rgw.idtag (56) in=14b,setxattr user.rgw.tail_tag (56) in=17b,writefull 0~17706,setxattr user.rgw.manifest (375) in=17b,setxattr user.rgw.acl (123) in=12b,setxattr user.rgw.content_type (10) in=21b,setxattr user.rgw.etag (32) in=13b,setxattr user.rgw.x-amz-meta-storagetimestamp (40) in=36b,call rgw.obj_store_pg_ver in=19b,setxattr user.rgw.source_zone (4) in=20b] snapc 0=[] ondisk+write+known_if_redirected e37602)",
            "initiated_at": "2021-10-28T16:52:44.652426+0700",
            "age": 3258.4773889160001,
            "duration": 32.445993113,
            "type_data": {
                "flag_point": "started",
                "client_info": {
                    "client": "client.141400841",
                    "client_addr": "10.118.199.3:0/462844935",
                    "tid": 290235021
                },
                "events": [
                    {
                        "event": "initiated",
                        "time": "2021-10-28T16:52:44.652426+0700",
                        "duration": 0
                    },
                    {
                        "event": "throttled",
                        "time": "2021-10-28T16:52:44.652426+0700",
                        "duration": 6.1143999999999996e-05
                    },
                    {
                        "event": "header_read",
                        "time": "2021-10-28T16:52:44.652487+0700",
                        "duration": 2.2340000000000001e-06
                    },
                    {
                        "event": "all_read",
                        "time": "2021-10-28T16:52:44.652489+0700",
                        "duration": 1.1519999999999999e-06
                    },
                    {
                        "event": "dispatched",
                        "time": "2021-10-28T16:52:44.652490+0700",
                        "duration": 3.146e-06
                    },
                    {
                        "event": "queued_for_pg",
                        "time": "2021-10-28T16:52:44.652493+0700",
                        "duration": 4.0936000000000002e-05
                    },
                    {
                        "event": "reached_pg",
                        "time": "2021-10-28T16:52:44.652534+0700",
                        "duration": 1.9236e-05
                    },
                    {
                        "event": "waiting for readable",
                        "time": "2021-10-28T16:52:44.652554+0700",
                        "duration": 32.430086821000003
                    },
                    {
                        "event": "reached_pg",
                        "time": "2021-10-28T16:53:17.082640+0700",
                        "duration": 0.00010452500000000001
                    },
                    {
                        "event": "started",
                        "time": "2021-10-28T16:53:17.082745+0700",
                        "duration": 0.015673919000000001
                    },
                    {
                        "event": "done",
                        "time": "2021-10-28T16:53:17.098419+0700",
                        "duration": 32.445993113
                    }
                ]
            }
        }

Thanks in advance.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx