2011/7/26 Sage Weil <sage@xxxxxxxxxxxx>: > On Tue, 26 Jul 2011, Christian Brunner wrote: >> OK I've solved this by myself. >> >> Since I knew that ther is replication between >> >> osd001 and osd005, >> >> as well as >> >> osd001 and osd015, >> osd001 and osd012, >> >> I decided to take osd005, osd012 and osd015 offline. After that ceph >> started to rebuild the PGs on other nodes. > > At the same time you mean? Or just restarted them? At the same time. > The usual way to debug these situations is: > > - identify a stuck pg > - figure out what osds it maps to. [15,1] > - turn on logs on those nodes: > ceph osd tell 15 injectargs '--debug-osd 20 --debug-ms 1' > ceph osd tell 1 injectargs '--debug-osd 20 --debug-ms 1' > - restart peering by togging the primary (first osd, 15) > ceph osd down 15 > - send us the resulting logs (for all nodes) > > Even better if you also include other (old) osds that include pg data > (osd1 in your case) in this. > > We definitely want to fix the core issue, so any help gathering the logs > would be appreciated! It's also possible that the above will 'fix' it > because the peering issue is hard to hit. In that case, cranking up the > debug level after the initial crash but before you restart everything > might be a good idea. I will turn on debuging next time. I think it is possible to hit the issue, when an osd that is the destination of a rebuild, fails at the time a rebuild is performed. But I have not verified this. Regards, Christian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html