Intermittent slow/blocked requests on one node

Chris Martin <cmart@xxxxxxxxxxx> · Wed, 22 Aug 2018 12:20:30 -0400

Hi ceph-users,

A few weeks ago, I had an OSD node -- ceph02 -- lock up hard with no
indication why. I reset the system and everything came back OK, except
that I now get intermittent warnings about slow/blocked requests from
OSDs on the other nodes, waiting for a "subop" to complete on one of
ceph02's OSDs. Each of these blocked requests will persist for a few
(5-8?) minutes, then complete. (I see this using the admin socket to
"dump_ops_in_flight" and "dump_historic_slow_ops".)

I have tried several things to fix the issue, including rebuilding
ceph02 completely! Wiping and reinstalling the OS, purging and
re-creating OSDs. All disks reporting "OK" for SMART health status.
The only effective intervention has been to mark all of ceph02's OSDs
as "out".

At this point I strongly suspect a hardware/firmware issue. Two
questions for you folks while I dig into that:

1. Any more diagnostics that I should try to troubleshoot the delayed
subops in Ceph? Perhaps identify what is causing the delay?

2. When an OSD is complaining about a slow/blocked request (waiting
for sub ops), do RBD clients actually notice this, or does it appear
to the client that the write has completed?

Thank you! Information about my cluster and example warning messages follow.

Chris Martin

About my cluster: Luminous (12.2.4), 5 nodes, each with 12 OSDs (one
rotary HDD per OSD), and a shared SSD in each node with 24 partitions
for all the RocksDB databases and WALs. Systems are Supermicro
6028R-E1CR12T with RAID controller (LSI SAS 3108) set to JBOD mode.
Deployed with ceph-ansible and using Bluestore. Bonded 10 gbps links
throughout (20 gbps each for for client network and cluster network).

```
HEALTH_WARN 2 slow requests are blocked > 32 sec
REQUEST_SLOW 2 slow requests are blocked > 32 sec
    2 ops are blocked > 262.144 sec
    osd.2 has blocked requests > 262.144 sec
```

```
        {
            "description": "osd_op(client.84174831.0:45611220 10.1e0
10:07b8635b:::rbd_data.d091f474b0dc51.0000000000006084:head [write
716800~4096] snapc 3c3=[3c3] ondisk+write+known_if_redirected e7305)",
            "initiated_at": "2018-08-10 14:21:20.507929",
            "age": 317.226205,
            "duration": 294.342909,
            "type_data": {
                "flag_point": "commit sent; apply or cleanup",
                "client_info": {
                    "client": "client.84174831",
                    "client_addr": "10.140.120.206:0/2228066036",
                    "tid": 45611220
                },
                "events": [
                    {
                        "time": "2018-08-10 14:21:20.507929",
                        "event": "initiated"
                    },
                    {
                        "time": "2018-08-10 14:21:20.508035",
                        "event": "queued_for_pg"
                    },
                    {
                        "time": "2018-08-10 14:21:20.508102",
                        "event": "reached_pg"
                    },
                    {
                        "time": "2018-08-10 14:21:20.508192",
                        "event": "started"
                    },
                    {
                        "time": "2018-08-10 14:21:20.508331",
                        "event": "waiting for subops from 12,21,60"
                    },
                    {
                        "time": "2018-08-10 14:21:20.509890",
                        "event": "op_commit"
                    },
                    {
                        "time": "2018-08-10 14:21:20.509895",
                        "event": "op_applied"
                    },
                    {
                        "time": "2018-08-10 14:21:20.510475",
                        "event": "sub_op_commit_rec from 12"
                    },
                    {
                        "time": "2018-08-10 14:21:20.510526",
                        "event": "sub_op_commit_rec from 21"
                    },
                    {
                        "time": "2018-08-10 14:26:14.850653",
                        "event": "sub_op_commit_rec from 60"
                    },
                    {
                        "time": "2018-08-10 14:26:14.850728",
                        "event": "commit_sent"
                    },
                    {
                        "time": "2018-08-10 14:26:14.850838",
                        "event": "done"
                    }
                ]
            }
        }
```
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com