Re: osd down

Michael Nishimoto <mnishimoto@xxxxxxxxxxx> · Fri, 7 Nov 2014 19:01:58 +0000

Most likely, the drive mapping to /dev/sdl1 is going bad or is bad.  I
suggest power cycling it to see if the error is cleared.  If the drive
comes up, check out the SMART stats to see if sectors are starting to get
remapped.  It's possible that a transient error occurred.

Mike

On 11/6/14 5:06 PM, "Shain Miley" <SMiley@xxxxxxx> wrote:

>I tried restarting all the osd's on that node, osd.70 was the only ceph
>process that did not come back online.
>
>There is nothing in the ceph-osd log for osd.70.
>
>However I do see over 13,000 of these messages in the kern.log:
>
>Nov  6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1):
>xfs_log_force: error 5 returned.
>
>Does anyone have any suggestions on how I might be able to get this HD
>back in the cluster (or whether or not it is worth even trying).
>
>Thanks,
>
>Shain
>
>Shain Miley | Manager of Systems and Infrastructure, Digital Media |
>smiley@xxxxxxx | 202.513.3649
>
>________________________________________
>From: Shain Miley [smiley@xxxxxxx]
>Sent: Tuesday, November 04, 2014 3:55 PM
>To: ceph-users@xxxxxxxxxxxxxx
>Subject: osd down
>
>Hello,
>
>We are running ceph version 0.80.5 with 108 osd's.
>
>Today I noticed that one of the osd's is down:
>
>root@hqceph1:/var/log/ceph# ceph -s
>     cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
>      health HEALTH_WARN crush map has legacy tunables
>      monmap e1: 3 mons at
>{hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205
>:6789/0},
>election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3
>      osdmap e7119: 108 osds: 107 up, 107 in
>       pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects
>             216 TB used, 171 TB / 388 TB avail
>                 3204 active+clean
>                    4 active+clean+scrubbing
>   client io 4079 kB/s wr, 8 op/s
>
>
>Using osd dump I determined that it is osd number 70:
>
>osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913
>last_clean_interval [488,2665) 10.35.1.217:6814/22440
>10.35.1.217:6820/22440 10.35.1.217:6824/22440 10.35.1.217:6830/22440
>autoout,exists 5dbd4a14-5045-490e-859b-15533cd67568
>
>
>Looking at that node, the drive is still mounted and I did not see any
>errors in any of the system logs, and the raid level status shows the
>drive as up and healthy, etc.
>
>
>root@hqosd6:~# df -h |grep 70
>/dev/sdl1       3.7T  1.9T  1.9T  51% /var/lib/ceph/osd/ceph-70
>
>
>I was hoping that someone might be able to advise me on the next course
>of action (can I add the osd back in?, should I replace the drive
>altogether, etc)
>
>I have attached the osd log to this email.
>
>Any suggestions would be great.
>
>Thanks,
>
>Shain
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>--
>Shain Miley | Manager of Systems and Infrastructure, Digital Media |
>smiley@xxxxxxx | 202.513.3649
>_______________________________________________
>ceph-users mailing list
>ceph-users@xxxxxxxxxxxxxx
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com