Quoting Patrick Donnelly (pdonnell@xxxxxxxxxx): > On Thu, Feb 28, 2019 at 12:49 PM Stefan Kooman <stefan@xxxxxx> wrote: > > > > Dear list, > > > > After upgrading to 12.2.11 the MDSes are reporting slow metadata IOs > > (MDS_SLOW_METADATA_IO). The metadata IOs would have been blocked for > > more that 5 seconds. We have one active, and one active standby MDS. All > > storage on SSD (Samsung PM863a / Intel DC4500). No other (OSD) slow ops > > reported. The MDSes are underutilized, only a handful of active clients > > and almost no load (fast hexacore CPU, 256 GB RAM, 20 Gb/s network). The > > cluster is also far from busy. > > > > I've dumped ops in flight on the MDSes but all ops that are printed are > > finished in a split second (duration: 0.000152), flag_point": "acquired > > locks". > > I believe you're looking at the wrong "ops" dump. You want to check > "objector_requests". I've done that, but not much timing info in there, i.e.: "ops": [ { "tid": 416231, "pg": "6.791ce4c5", "osd": 31, "object_id": "200.00024d0f", "object_locator": "@6", "target_object_id": "200.00024d0f", "target_object_locator": "@6", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "79790s", "attempts": 1, "snapid": "head", "snap_context": "0=[]", "mtime": "2019-03-01 16:39:29.0.41847s", "osd_ops": [ "write 798204~6249 [fadvise_dontneed]" ] } The MDS_SLOW_METADATA_IO warning has been added in commit https://github.com/ceph/ceph/commit/0f735f40315448560fde049ed3ea019a7d30d868#diff-74784f821b1aae68f768965680914268 Tracker issue: http://tracker.ceph.com/issues/24879 PR: 23022 It uses the "osd_op_complaint_time" as the "complaint_time". We have set the osd_op_complaint_time to 5 seconds. The tracker issue hints that slow metadata would be caused by a slow OSD (joural writes of MDS that take too long). However, we do not see any slow mds ops (data), nor slow ops. We've changed the parameter "mon_osd_warn_op_age" to 5 seconds as well so slow OSD ops would be reported as well. I do not think our OSDs are that slow, that often (~ at least every 15 minutes we would see that warning) ... so I really wonder what's going on. For now we've set the osd_op_complaint_time to "30" on the MDSes and do not see any SLOW METADATA warnings anymore. Anyone willing to set the osd_op_complaint_time to 5 seconds to see if those messages will start popping up? ceph tell mds.* config set osd_op_complaint_time 5 should do the trick ... you would need to be running 12.2.11 and /or mimic 13.2.2. Thanks, Stefan P.s. kudos to Wido den Hollander for helping us debug this. -- | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com