> > I've added a 'scripts/check_pglog.sh $osddatadir' script that will just > > look for any corruption. Running that periodically will let you verify > > that you haven't hit the same corruption without having to restart cosd. > > Goodie. I'll set up a cron job to run this automatically. Should I > do this on each osd? Yeah, that would be ideal. > > Ideally we can figure out what kind of workloads are triggering the > > problem, and then reproduce it with sufficient logging enabled to find > > where the race is taking place. > > > > If you have any details about the workload or any failure/recovery > > activity that may have been going on at the time that may shed some light > > on it... > > Funny enough, the problem occurred while the fs was idle. I ran > some simple tests a few days earlier and these completed with no > problems. From that point on, nobody ever accessed it. > > I'll let you know if I can trigger it reliably. If you have any > suggestions for a workload that is likely to trigger the race, I'm > happy to give it a try. Otherwise, I'll just use "stress" again to > write to the cephfs from a couple of clients simultaneously. I think the corruption can happen when the osd map updates. The problem is that the corrupt data is never read until cosd is restarted, usually a long time after the actual problem occurred, so adding that cron job should help narrow it down. I would just proceed with your usual testing and see what happens. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html