70+ OSD are DOWN and not coming up

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/22/14 00:26 , Craig Lewis wrote:
> On 5/21/14 21:15 , Sage Weil wrote:
>> On Wed, 21 May 2014, Craig Lewis wrote:
>>> If you do this over IRC, can you please post a summary to the mailling
>>> list?
>>>
>>> I believe I'm having this issue as well.
>> In the other case, we found that some of the OSDs were behind processing
>> maps (by several thousand epochs).  The trick here to give them a chance
>> to catch up is
>>
>>   ceph osd set noup
>>   ceph osd set nodown
>>   ceph osd set noout
>>
>> and wait for them to stop spinning on the CPU.  You can check which map
>> each OSD is on with
>>
>>   ceph daemon osd.NNN status
>>
>> to see which epoch they are on and compare that to
>>
>>   ceph osd stat
>>
>> Once they are within 100 or less epochs,
>>
>>   ceph osd unset noup
>>
>> and let them all start up.
>>
>> We haven't determined whether the original problem was caused by this or
>> the other way around; we'll see once they are all caught up.
>>
>> sage
>
> I was seeing the CPU spinning too, so I think it is the same issue.  
> Thanks for the explanation!  I've been pulling my hair out for weeks.
>

This process solved my problem, with one caveat.  When I followed it, I 
filled up /var/log/ceph/ and the recovery failed.  I had to manually run 
each OSD in debugging mode until it completed the map update.  Aside 
from that, I followed your procedure.

After that, I was able to start everything normally, and the cluster 
recovered within a couple of hours.


This has been keeping me awake at night.  So far, it only happened to my 
slave cluster.  I've been living in dread of seeing this happen to my 
master cluster.  Now I know why the master cluster has been safe.  When 
my master cluster had problems, I intervened quickly (usually rebooting 
the node).  When the slave had problems, I fixed it in the morning.  
That extra delay was enough time to cause this issue.

Thank you!



-- 

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis at centraldesktop.com <mailto:clewis at centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140527/5ad3359b/attachment.htm>


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux