OSD suffers problems after filesystem crashed and recovered.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/29/14 01:09 , Felix Lee wrote:
> Dear experts,
> Recently, a disk for one of our OSDs was failure and caused osd down, 
> after I recovered the disk and filesystem, I noticed two problems:
>
> 1. journal corruption, which causes osd failure from starting:
>
>
>
> 2. I guess I may use ceph-osd with "--mkjournal" option to fix journal 
> corruption issue, but there is another thing that bothers me, which 
> is, the previous osd daemon is staying in "D" state, so, it can't be 
> terminated, but usually, when filesystem recovered, process should be 
> able to leave D state, so, I am not sure what causes this and if I can 
> ignore that without causing any bad consequence.
>
> In any case, it would be very grateful if you experts could shed me 
> some light.
>
> Our current ceph version is ceph-0.72.2-0.el6.x86_64
> And, the filesystem backend is xfs with fiber direct attached storages. 


I can't speak to the specific errors you're seeing, but it looks like 
you have a failing or corrupted disk.

Things I would investigate:

 1. Is the disk itself failing?  If this were a SATA disk, I'd check the
    SMART stats on the disk.  I haven't dealt with Fiber Channel disks
    since before SMART was standardized, so I can't tell you do do that.
 2. Get rid of the old ceph-osd process.  Reboot the node if you have
    to.  If things come up cleanly, then you're done.
 3. Fsck the filesystem.  If the FS is clean, then you probably
    corrupted the OSD journal.
 4. How quickly do you need this fixed?  At this point, I'm out of
    suggestions, so I'd remove the osd, zap it, and add it back in. If
    you can wait, somebody might have a better suggestion.


Fiber Channel hardware is much more complicated that SATA and SAS.  
There are a lot more parts involved, which leaves more room for bugs.

If you see this problem come back on the same disk, I'd replace the 
disk.  If you see this happen again to other disks, I would get your 
Fiber Channel vendor involved.  It wouldn't hurt to make sure you have 
the latest firmware on the disks, enclosure, and FC adapter.


-- 

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis at centraldesktop.com <mailto:clewis at centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140530/7a0fa986/attachment.htm>


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux