On Thu, 17 May 2012, Karol Jurak wrote: > Hi, > > During an ongoing recovery in one of my clusters a couple of OSDs > complained about too small journal. For instance: > > 2012-05-12 13:31:04.034144 7f491061d700 1 journal check_for_full at > 863363072 : JOURNAL FULL 863363072 >= 1048571903 (max_size 1048576000 > start 863363072) > 2012-05-12 13:31:04.034680 7f491061d700 0 journal JOURNAL TOO SMALL: item > 1693745152 > journal 1048571904 (usable) > > I was under the impression that the OSDs stopped participating in recovery > after this event. (ceph -w showed that the number of PGs in state > active+clean no longer increased.) They resumed recovery after I enlarged > their journals (stop osd, --flush-journal, --mkjournal, start osd). > > How serious is such situation? Do the OSDs know how to handle it > correctly? Or could this result in some data loss or corruption? After the > recovery finished (ceph -w showed that all PGs are in active+clean state) > I noticed that a few rbd images were corrupted. The osds tolerate the full journal. There will be a big latency spike, but they'll recover without risking data. You should definitely increase the journal size if this happens regulary, though. sage > > The cluster runs v0.46. The OSDs use ext4. I'm pretty sure that during the > recovery no clients were accessing the cluster. > > Best regards, > Karol > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html