intermittent slow requests on idle ssd ceph clusters

Pavel Shub <pavel@xxxxxxxxxxxx> · Mon, 16 Jul 2018 17:51:50 -0400

Hello folks,

We've been having issues with slow requests cropping up on practically
idle ceph clusters. From what I can tell the requests are hanging
waiting for subops, and the OSD on the other end receives requests
minutes later! Below it started waiting for subops at 12:09:51 and the
subop was completed at 12:14:28.

{
"description": "osd_op(client.903117.0:569924 6.391
6:89ed76f2:::%2fraster%2fv5%2fes%2f16%2f36320%2f24112:head [writefull
0~2072] snapc 0=[] ondisk+write+known_if_redirected e5777)",
"initiated_at": "2018-07-05 12:09:51.191419",
"age": 326.651167,
"duration": 276.977834,
"type_data": {
"flag_point": "commit sent; apply or cleanup",
"client_info": {
"client": "client.903117",
"client_addr": "10.20.31.234:0/1433094386",
"tid": 569924
},
"events": [
{
"time": "2018-07-05 12:09:51.191419",
"event": "initiated"
},
{
"time": "2018-07-05 12:09:51.191471",
"event": "queued_for_pg"
},
{
"time": "2018-07-05 12:09:51.191538",
"event": "reached_pg"
},
{
"time": "2018-07-05 12:09:51.191877",
"event": "started"
},
{
"time": "2018-07-05 12:09:51.192135",
"event": "waiting for subops from 11"
},
{
"time": "2018-07-05 12:09:51.192599",
"event": "op_commit"
},
{
"time": "2018-07-05 12:09:51.192616",
"event": "op_applied"
},
{
"time": "2018-07-05 12:14:28.169018",
"event": "sub_op_commit_rec from 11"
},
{
"time": "2018-07-05 12:14:28.169164",
"event": "commit_sent"
},
{
"time": "2018-07-05 12:14:28.169253",
"event": "done"
}
]
}
},

The below is what I assume the corresponding request on osd.11, it
seems to be receiving the network request ~4 minutes later.

2018-07-05 12:14:28.058552 7fb75ee0e700 20 osd.11 5777 share_map_peer
0x562b61bca000 already has epoch 5777
2018-07-05 12:14:28.167247 7fb75de0c700 10 osd.11 5777  new session
0x562cc23f0200 con=0x562baaa0e000 addr=10.16.15.28:6805/3218
2018-07-05 12:14:28.167282 7fb75de0c700 10 osd.11 5777  session
0x562cc23f0200 osd.20 has caps osdcap[grant(*)] 'allow *'
2018-07-05 12:14:28.167291 7fb75de0c700  0 -- 10.16.16.32:6817/3808 >>
10.16.15.28:6805/3218 conn(0x562baaa0e000 :6817
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 20 vs existing csq=19
existing_state=STATE_STANDBY
2018-07-05 12:14:28.167322 7fb7546d6700  2 osd.11 5777 ms_handle_reset
con 0x562baaa0e000 session 0x562cc23f0200
2018-07-05 12:14:28.167546 7fb75de0c700 10 osd.11 5777  session
0x562b62195c00 osd.20 has caps osdcap[grant(*)] 'allow *'

This is an all SSD cluster with minimal load. All hardware checks
return good values. The cluster is currently running latest ceph mimic
(13.2.0) but we have also experienced this on other versions of
luminous 12.2.2 and 12.2.5.

I'm starting to think that this is a potential network driver issue.
We're currently running on kernel 4.14.15 and when we updated to
latest 4.17 the slow requests seem to occur more frequently. The
network cards that we run are 10g intel X540.

Does anyone know how I can debug this further?

Thanks,
Pavel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com