Re: osd: terminate called after throwing an instance of 'std::bad_alloc'

Andre Noll <maan@xxxxxxxxxxxxxxx> · Wed, 9 Jun 2010 10:16:04 +0200

On Tue, Jun 08, 22:54, Sage Weil wrote:
> Hmm, okay.  Unfortunately the logs don't have any clues.  I'm giving on up 
> solving the mystery this time around.  You can go ahead and stop all the 
> daemons and re-run mkcephfs. 

OK, there was only stress-test data on the cephfs anyway. I'll start
from scratch and run another set of tests. Fortunately we haven't
told the users about the shiny new file systems yet, so we are in no
hurry ;)

> I've added a 'scripts/check_pglog.sh $osddatadir' script that will just 
> look for any corruption.  Running that periodically will let you verify 
> that you haven't hit the same corruption without having to restart cosd.  

Goodie. I'll set up a cron job to run this automatically. Should I
do this on each osd?

> Ideally we can figure out what kind of workloads are triggering the 
> problem, and then reproduce it with sufficient logging enabled to find 
> where the race is taking place.
>
> If you have any details about the workload or any failure/recovery 
> activity that may have been going on at the time that may shed some light 
> on it...

Funny enough, the problem occurred while the fs was idle. I ran
some simple tests a few days earlier and these completed with no
problems. From that point on, nobody ever accessed it.

I'll let you know if I can trigger it reliably. If you have any
suggestions for a workload that is likely to trigger the race, I'm
happy to give it a try. Otherwise, I'll just use "stress" again to
write to the cephfs from a couple of clients simultaneously.

Thanks
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
Attachment:
signature.asc

Description: Digital signature