Re: Is Ceph recovery able to handle massive crash

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 8 Jan 2013 15:36:59 -0800



On Tue, Jan 8, 2013 at 11:44 AM, Denis Fondras <ceph@xxxxxxxxxxx> wrote:
> Hello,
>
>
>> What error message do you get when you try and turn it on? If the
>> daemon is crashing, what is the backtrace?
>
>
> The daemon is crashing. Here is the full log if you want to take a look :
> http://vps.ledeuns.net/ceph-osd.0.log.gz
>
> The RBD rebuild script helped to get the data back. I will now try to
> rebuild a Ceph cluster and do some more tests.
>
> Denis

It looks like it's taking approximately forever for writes to complete
to disk; it's shutting down because threads are going off to write and
not coming back. If you set "osd op thread timeout = 60" (or 120) it
might manage to churn through, but I'd look into why the writes are
taking so long — bad disk, fragmented btrfs filesystem, or something
else.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html