Hmm, okay. Unfortunately the logs don't have any clues. I'm giving on up solving the mystery this time around. You can go ahead and stop all the daemons and re-run mkcephfs. I've added a 'scripts/check_pglog.sh $osddatadir' script that will just look for any corruption. Running that periodically will let you verify that you haven't hit the same corruption without having to restart cosd. Ideally we can figure out what kind of workloads are triggering the problem, and then reproduce it with sufficient logging enabled to find where the race is taking place. If you have any details about the workload or any failure/recovery activity that may have been going on at the time that may shed some light on it... FYI, this is http://tracker.newdream.net/issues/114 Thanks! sage On Tue, 8 Jun 2010, Andre Noll wrote: > On Mon, Jun 07, 10:20, Sage Weil wrote: > > > Do you have osd logs? > > Should all be there. At least I did not remove anything. > > > This is the same corruption I've seen previously, but I've just > > reaudited the code I suspect and it looks ok. Some insight into what > > happened to the cluster would help. Which osd is it? > > It's osd6 running on node142. I was running osd from ceph-v0.20.2 but > also tried the osd compiled from the "testing" branch of the git repo. > > > Do you still have the logs (/var/log/ceph/osd$n*)? The osdmap > > sequence (tarball of $mon_data/osdmap) would be helpful too. > > I created a tarball of the full /var/log/ceph directory of node142 > (which is only a storage node) and will send it to you off-line. > > We have three monitors. I'll send a tarball of /var/ceph/mon0/monmap/ > of mon0 (node141) as well. > > Thanks > Andre > -- > The only person who always got his work done by Friday was Robinson Crusoe > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html