Nautilus upgrade causes spike in MDS latency

Josh Haft <paccrap@xxxxxxxxx> · Mon, 13 Apr 2020 15:32:56 -0500

Hi,

I upgraded from 13.2.5 to 14.2.6 last week and am now seeing
significantly higher latency on various MDS operations. For example,
the 2min rate of ceph_mds_server_req_create_latency_sum /
ceph_mds_server_req_create_latency_count for an 8hr window last Monday
prior to the upgrade was an average of 2ms. Today, however the same
stat shows 869ms. Other operations including open, readdir, rmdir,
etc. are also taking significantly longer.

Here's a partial example of an op from dump_ops_in_flight:
        {
            "description": "client_request(client.342513090:334359409
create #...)",
            "initiated_at": "2020-04-13 15:30:15.707637",
            "age": 0.19583208099999999,
            "duration": 0.19767626299999999,
            "type_data": {
                "flag_point": "submit entry: journal_and_reply",
                "reqid": "client.342513090:334359409",
                "op_type": "client_request",
                "client_info": {
                    "client": "client.342513090",
                    "tid": 334359409
                },
                "events": [
                    {
                        "time": "2020-04-13 15:30:15.707637",
                        "event": "initiated"
                    },
                    {
                        "time": "2020-04-13 15:30:15.707637",
                        "event": "header_read"
                    },
                    {
                        "time": "2020-04-13 15:30:15.707638",
                        "event": "throttled"
                    },
                    {
                        "time": "2020-04-13 15:30:15.707640",
                        "event": "all_read"
                    },
                    {
                        "time": "2020-04-13 15:30:15.781935",
                        "event": "dispatched"
                    },
                    {
                        "time": "2020-04-13 15:30:15.785086",
                        "event": "acquired locks"
                    },
                    {
                        "time": "2020-04-13 15:30:15.785507",
                        "event": "early_replied"
                    },
                    {
                        "time": "2020-04-13 15:30:15.785508",
                        "event": "submit entry: journal_and_reply"
                    }
                ]
            }
        }

This along with every other 'create' op I've seen has a 50ms+ delay
between all_read and dispatched events - what is happening during this
time? I'm not sure what I'm looking for the in the MDS debug logs.

We have a mix of clients from 12.2.x through 14.2.8; my plan was to
upgrade those pre-Nautilus clients this week. There is only a single
MDS rank with 1 backup. Other functions of this cluster - RBDs and RGW
- do not appear impacted so this looks limited to the MDS. I did not
observe this behavior after upgrading a dev cluster last month.

Has anyone seen anything similar? Thanks for any assistance!
Josh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx