Re: osd: terminate called after throwing an instance of 'std::bad_alloc'

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 9 Jun 2010 10:29:00 -0700 (PDT)

> > I've added a 'scripts/check_pglog.sh $osddatadir' script that will just 
> > look for any corruption.  Running that periodically will let you verify 
> > that you haven't hit the same corruption without having to restart cosd.  
> 
> Goodie. I'll set up a cron job to run this automatically. Should I
> do this on each osd?

Yeah, that would be ideal.

> > Ideally we can figure out what kind of workloads are triggering the 
> > problem, and then reproduce it with sufficient logging enabled to find 
> > where the race is taking place.
> >
> > If you have any details about the workload or any failure/recovery 
> > activity that may have been going on at the time that may shed some light 
> > on it...
> 
> Funny enough, the problem occurred while the fs was idle. I ran
> some simple tests a few days earlier and these completed with no
> problems. From that point on, nobody ever accessed it.
>
> I'll let you know if I can trigger it reliably. If you have any
> suggestions for a workload that is likely to trigger the race, I'm
> happy to give it a try. Otherwise, I'll just use "stress" again to
> write to the cephfs from a couple of clients simultaneously.

I think the corruption can happen when the osd map updates.  The problem 
is that the corrupt data is never read until cosd is restarted, usually a 
long time after the actual problem occurred, so adding that cron job 
should help narrow it down.

I would just proceed with your usual testing and see what happens.

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html