On Dec 20, 2013, "Yan, Zheng" <zheng.z.yan@xxxxxxxxx> wrote: > On 12/20/2013 11:35 AM, Alexandre Oliva wrote: >> On Dec 19, 2013, "Yan, Zheng" <zheng.z.yan@xxxxxxxxx> wrote: >> >>> next time you encountered log corruption, please open "new issues" at http://tracker.ceph.com/ >> >> I have saved the dumped journal, but it contains too much information >> I'm not supposed to or willing to share, especially with a US-based >> company in the post-Snowden era :-( (something like btrfs-image -s >> might be good to have) >> >> I'd be glad to try to dig info that might be useful out of it, if told >> how to do so, but uploading the journal as taken from the cluster is not >> an option. > can you send out the mds log (the latest few hundreds of lines) when the mds crashed. It didn't quite crash. Here's what it spit out: mds.0.cache creating system inode with ino:1 mds.0.journaler(ro) try_read_entry got 0 len entry at offset 1248878082748 mds.0.log _replay journaler got error -22, aborting mds.0.1307 boot_start encountered an error, failing mds.0.1307 suicide. wanted down:dne, now up:replay Before this, the mds was brought down along with the rest of the cluster: mds.0.1273 *** got signal Terminated *** mds.0.1273 suicide. wanted down:dne, now up:active The --reset-journal session, that aborted because of the assertion failure I mentioned in the patch, added the recent message log, but there's nothing useful there: the mds asked osds for each one of the log segments, got successful responses for them all, and then proceeded to reset the journal. The only oddity I found was this: client.411305.journaler(ro) _finish_probe_end write_pos = 1248890415494 (header had 1248890413881). recovered. For reference, here's the complete loghead info: client.411305.journaler(ro) _finish_read_head loghead(trim 1248761741312, expire 1248761744503, write 1248890413881). probing for end of log (from 1248890413881)... Note how the offset at which we got the zero-len entry is not close to either end of the (non-trimmed portion of) the log. 200.00048b1b, where the zero-len entry was, was not only read successfully, the corresponding file in the osds had the correct 4MB size, and the md5sum of that file was the same on all 4 osds that contained it. So, whatever it was that corrupted the log happened long before the mds shutdown; 200.00048b1b's timestamp is 1387397117, while 200.00048b1e's (the tail before the journal reset) is 1387430417; the 2 other files in between are spaced at a nearly constant rate of 10000 seconds per 4MB. The mds log has nothing special for that time range, but from the mon and the osd logs I can tell that was about the time in which I rolled back many of the osds to recent snapshots thereof, from which I'd cleaned all traces of the user.ceph._parent. I intended to roll back only 2 out of the 4 replicas of each pg at a time, and wait for them to recover, but I recall that during this process, two osds came down before they were meant to, because of timeouts caused by the metadata-heavy recursive setfattr. I guess this may have caused me to lose changes that had already been committed, by rolling back all the osds that had those changes. So I think by now I'm happy to announce that it was an IO error (where IO stands for Incompetence of the Operator ;-) Sorry about this disturbance, and thanks for asking me to investigate it further and find a probable cause that involves no fault of Ceph's. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html