In my experience I have seen something like this this happen twice - First time there were unclean PGs because Ceph was down to one replica of a PG. When that happens Ceph blocks IO to remaining replicas when the number falls below the Œmin_size¹ parameter. That will manifest as blocked ops. Second time the disk was Œsoft-failing¹ - gaining many bad sectors but SMART still reported the drive as OK. Maybe check OSD.5 and OSD.7 for low level media errors with a tool like MegaCli, or whatever controller management tool comes with your hardware. At any rate, restarting the problem-child OSD is probably troubleshooting step #1, which you have done. On 7/14/15, 6:45 AM, "Deneau, Tom" <tom.deneau@xxxxxxx> wrote: >I don't think there were any stale or unclean PGs, (when there are, >I have seen "health detail" list them and it did not in this case). >I have since restarted the 2 osds and the health went immediately to >HEALTH_OK. > >-- Tom > >> -----Original Message----- >> From: Will.Boege [mailto:Will.Boege@xxxxxxxxxx] >> Sent: Monday, July 13, 2015 10:19 PM >> To: Deneau, Tom; ceph-users@xxxxxxxxxxxxxx >> Subject: Re: slow requests going up and down >> >> Does the ceph health detail show anything about stale or unclean PGs, or >> are you just getting the blocked ops messages? >> >> On 7/13/15, 5:38 PM, "Deneau, Tom" <tom.deneau@xxxxxxx> wrote: >> >> >I have a cluster where over the weekend something happened and >>successive >> >calls to ceph health detail show things like below. >> >What does it mean when the number of blocked requests goes up and down >> >like this? >> >Some clients are still running successfully. >> > >> >-- Tom Deneau, AMD >> > >> > >> > >> >HEALTH_WARN 20 requests are blocked > 32 sec; 2 osds have slow requests >> >20 ops are blocked > 536871 sec >> >2 ops are blocked > 536871 sec on osd.5 >> >18 ops are blocked > 536871 sec on osd.7 >> >2 osds have slow requests >> > >> >HEALTH_WARN 4 requests are blocked > 32 sec; 2 osds have slow requests >> >4 ops are blocked > 536871 sec >> >2 ops are blocked > 536871 sec on osd.5 >> >2 ops are blocked > 536871 sec on osd.7 >> >2 osds have slow requests >> > >> >HEALTH_WARN 27 requests are blocked > 32 sec; 2 osds have slow requests >> >27 ops are blocked > 536871 sec >> >2 ops are blocked > 536871 sec on osd.5 >> >25 ops are blocked > 536871 sec on osd.7 >> >2 osds have slow requests >> > >> >HEALTH_WARN 34 requests are blocked > 32 sec; 2 osds have slow requests >> >34 ops are blocked > 536871 sec >> >9 ops are blocked > 536871 sec on osd.5 >> >25 ops are blocked > 536871 sec on osd.7 >> >2 osds have slow requests >> >_______________________________________________ >> >ceph-users mailing list >> >ceph-users@xxxxxxxxxxxxxx >> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com