Re: An OSD always crash few minutes after start

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Mon, 10 Nov 2014 11:50:08 -0800

You're running 0.87-6.  There were various fixes for this problem in Firefly.  Were any of these snapshots created on early version of Firefly?
So far, every fix for this issue has gotten developers involved.  I'd see if you can talk to some devs on IRC, or post to the ceph-devel mailing list.

My own experience is that I had to delete the affected PGs, and force create them.  Hopefully there's a better answer now.

On Fri, Nov 7, 2014 at 8:10 PM, Chu Duc Minh <chu.ducminh@xxxxxxxxx> wrote:
One of my OSDs have problems and can NOT be start. I tried to start many times but it always crash few minutes after start.
I think about two reasons to make it crash:
1. A read/write request to this OSD, but due to the corrupted volume/snapshot/parent-image/..., it crash.
2. The recovering process can NOT work properly due to the corrupted volumes/snapshot/parent-image/...

After many retry and check log, i guess the reason (2) is the main cause. Because  if (1) is the main cause, other OSDs (contain buggy volume/snapshot) will crash too.

State of my ceph cluster (just few seconds before crash time):
  111/57706299 objects degraded (0.001%)    
        14918 active+clean
                   1 active+clean+scrubbing+deep
                  52 active+recovery_wait+degraded
                   2 active+recovering+degraded

PS: i attach crash-dump log of that OSD in this email for your information.

Thank you!

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com