Hello folks, We've been having issues with slow requests cropping up on practically idle ceph clusters. From what I can tell the requests are hanging waiting for subops, and the OSD on the other end receives requests minutes later! Below it started waiting for subops at 12:09:51 and the subop was completed at 12:14:28. { "description": "osd_op(client.903117.0:569924 6.391 6:89ed76f2:::%2fraster%2fv5%2fes%2f16%2f36320%2f24112:head [writefull 0~2072] snapc 0=[] ondisk+write+known_if_redirected e5777)", "initiated_at": "2018-07-05 12:09:51.191419", "age": 326.651167, "duration": 276.977834, "type_data": { "flag_point": "commit sent; apply or cleanup", "client_info": { "client": "client.903117", "client_addr": "10.20.31.234:0/1433094386", "tid": 569924 }, "events": [ { "time": "2018-07-05 12:09:51.191419", "event": "initiated" }, { "time": "2018-07-05 12:09:51.191471", "event": "queued_for_pg" }, { "time": "2018-07-05 12:09:51.191538", "event": "reached_pg" }, { "time": "2018-07-05 12:09:51.191877", "event": "started" }, { "time": "2018-07-05 12:09:51.192135", "event": "waiting for subops from 11" }, { "time": "2018-07-05 12:09:51.192599", "event": "op_commit" }, { "time": "2018-07-05 12:09:51.192616", "event": "op_applied" }, { "time": "2018-07-05 12:14:28.169018", "event": "sub_op_commit_rec from 11" }, { "time": "2018-07-05 12:14:28.169164", "event": "commit_sent" }, { "time": "2018-07-05 12:14:28.169253", "event": "done" } ] } }, The below is what I assume the corresponding request on osd.11, it seems to be receiving the network request ~4 minutes later. 2018-07-05 12:14:28.058552 7fb75ee0e700 20 osd.11 5777 share_map_peer 0x562b61bca000 already has epoch 5777 2018-07-05 12:14:28.167247 7fb75de0c700 10 osd.11 5777 new session 0x562cc23f0200 con=0x562baaa0e000 addr=10.16.15.28:6805/3218 2018-07-05 12:14:28.167282 7fb75de0c700 10 osd.11 5777 session 0x562cc23f0200 osd.20 has caps osdcap[grant(*)] 'allow *' 2018-07-05 12:14:28.167291 7fb75de0c700 0 -- 10.16.16.32:6817/3808 >> 10.16.15.28:6805/3218 conn(0x562baaa0e000 :6817 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 20 vs existing csq=19 existing_state=STATE_STANDBY 2018-07-05 12:14:28.167322 7fb7546d6700 2 osd.11 5777 ms_handle_reset con 0x562baaa0e000 session 0x562cc23f0200 2018-07-05 12:14:28.167546 7fb75de0c700 10 osd.11 5777 session 0x562b62195c00 osd.20 has caps osdcap[grant(*)] 'allow *' This is an all SSD cluster with minimal load. All hardware checks return good values. The cluster is currently running latest ceph mimic (13.2.0) but we have also experienced this on other versions of luminous 12.2.2 and 12.2.5. I'm starting to think that this is a potential network driver issue. We're currently running on kernel 4.14.15 and when we updated to latest 4.17 the slow requests seem to occur more frequently. The network cards that we run are 10g intel X540. Does anyone know how I can debug this further? Thanks, Pavel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com