On Thursday 17 of May 2012 18:01:55 Sage Weil wrote: > On Thu, 17 May 2012, Karol Jurak wrote: > > Hi, > > > > During an ongoing recovery in one of my clusters a couple of OSDs > > complained about too small journal. For instance: > > > > 2012-05-12 13:31:04.034144 7f491061d700 1 journal check_for_full at > > 863363072 : JOURNAL FULL 863363072 >= 1048571903 (max_size 1048576000 > > start 863363072) > > 2012-05-12 13:31:04.034680 7f491061d700 0 journal JOURNAL TOO SMALL: > > item 1693745152 > journal 1048571904 (usable) > > > > I was under the impression that the OSDs stopped participating in > > recovery after this event. (ceph -w showed that the number of PGs in > > state active+clean no longer increased.) They resumed recovery after > > I enlarged their journals (stop osd, --flush-journal, --mkjournal, > > start osd). > > > > How serious is such situation? Do the OSDs know how to handle it > > correctly? Or could this result in some data loss or corruption? > > After the recovery finished (ceph -w showed that all PGs are in > > active+clean state) I noticed that a few rbd images were corrupted. > > The osds tolerate the full journal. There will be a big latency spike, > but they'll recover without risking data. You should definitely > increase the journal size if this happens regulary, though. Thank you for clarification and advice. Karol -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html