On Tue, Jun 08, 22:54, Sage Weil wrote: > Hmm, okay. Unfortunately the logs don't have any clues. I'm giving on up > solving the mystery this time around. You can go ahead and stop all the > daemons and re-run mkcephfs. OK, there was only stress-test data on the cephfs anyway. I'll start from scratch and run another set of tests. Fortunately we haven't told the users about the shiny new file systems yet, so we are in no hurry ;) > I've added a 'scripts/check_pglog.sh $osddatadir' script that will just > look for any corruption. Running that periodically will let you verify > that you haven't hit the same corruption without having to restart cosd. Goodie. I'll set up a cron job to run this automatically. Should I do this on each osd? > Ideally we can figure out what kind of workloads are triggering the > problem, and then reproduce it with sufficient logging enabled to find > where the race is taking place. > > If you have any details about the workload or any failure/recovery > activity that may have been going on at the time that may shed some light > on it... Funny enough, the problem occurred while the fs was idle. I ran some simple tests a few days earlier and these completed with no problems. From that point on, nobody ever accessed it. I'll let you know if I can trigger it reliably. If you have any suggestions for a workload that is likely to trigger the race, I'm happy to give it a try. Otherwise, I'll just use "stress" again to write to the cephfs from a couple of clients simultaneously. Thanks Andre -- The only person who always got his work done by Friday was Robinson Crusoe
Attachment:
signature.asc
Description: Digital signature