On Wed, Jun 21, 2017 at 4:16 PM, Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote: > On 06/14/17 11:59, Dan van der Ster wrote: >> Dear ceph users, >> >> Today we had O(100) slow requests which were caused by deep-scrubbing >> of the metadata log: >> >> 2017-06-14 11:07:55.373184 osd.155 >> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d >> deep-scrub starts >> ... >> 2017-06-14 11:22:04.143903 osd.155 >> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow >> request 480.140904 seconds old, received at 2017-06-14 >> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d >> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc >> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for >> scrub >> ... >> 2017-06-14 11:22:06.729306 osd.155 >> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d >> deep-scrub ok > > This looks just like my problem in my thread on ceph-devel "another > scrub bug? blocked for > 10240.948831 secs" except that your scrub > eventually finished (mine ran hours before I stopped it manually), and > I'm not using rgw. > > Sage commented that it is likely related to snaps being removed at some > point and interacting with scrub. > > Restarting the osd that is mentioned there (osd.155 in your case) will > fix it for now. And tuning scrub changes the way it behaves (defaults > make it happen more rarely than what I had before). In my case it's not related to snaps -- there are no snaps (or trimming) in a (normal) set of rgw pools. My problem is about the cls_log class, which tries to do a lot of work in one op, timing out the osds. Well, the *real* problem in my case is about this rgw mdlog, which can grow unboundedly, then eventually become un-scrubbable, leading to this huge amount of cleanup to be done. -- dan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com