Hi Sage, I have some high level idea about this but I haven't fully trace the code so please forgive me if it's too naive On Thu, Sep 23, 2010 at 11:05 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > One problem that keeps popping up is corruption in the PG logs. This > usually manifests itself as a crash when the OSD restarts and is unable to > parse the log. There are a couple of things to do here. > > First, we need to figure out where the corruption is coming from. Dumps > of the corrupt pglog files will help. Are they zeroed out? Entirely? Is > there valid data at the end of the file? etc. > I am thinking to have a checksum for each log entry, when osd restart and parse the log it will be able to detect if the data is corrupt. > Second, we need to come up with a reasonable way to start up even when > some PGs are corrupt. Deleting them is one option, but we want to avoid > doing so unless we're sure we have a good copy elsewhere. > By implement the checksum for each log entry, we will be able to ignore the corrupted log and hopefully it can be rebuilt. This is the place I am not certain, we have deleted the single PGinfo that cause the error manually and see that osd start successfully, but due to the limited knowledge about current implementation we are not sure if everything is rebuilt properly. > Another option would be to make a 'corrupt' subdirectory on the OSD and > move the log there. Without the log, the OSD will need to rebuild the > object list to recover/resync with other PG copies, but at least it will > start and (eventually) recover. > > http://tracker.newdream.net/issues/169 > > Thoughts? > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html