Re: osd down

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Mon, 10 Nov 2014 14:51:22 -0800

Yes, removing an OSD before re-creating it will give you the same OSD ID.  That's my preferred method, because it keeps the crushmap the same.  Only PGs that existed on the replaced disk need to be backfilled.
I don't know if adding the replacement to the same host then removing the old OSD gives you the same CRUSH map as the reverse.  I suspect not, because the OSDs are re-ordered on that host.

On Mon, Nov 10, 2014 at 1:29 PM, Shain Miley <SMiley@xxxxxxx> wrote:

Craig,

Thanks for the info.

I ended up doing a zap and then a create via ceph-deploy.

One question that I still have is surrounding adding the failed osd back into the pool.

In this example...osd.70 was bad....when I added it back in via ceph-deploy...the disk was brought up as osd.108.

Only after osd.108 was up and running did I think to remove osd.70 from the crush map etc.

My question is this...had I removed it from the crush map prior to my ceph-deploy create...should/would Ceph have reused the osd number 70?

I would prefer to replace a failed disk with a new one and keep the old osd assignment...if possible that is why I am asking.

Anyway...thanks again for all the help.

Shain

Sent from my iPhone

On Nov 7, 2014, at 2:09 PM, Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> wrote:

I'd stop that osd daemon, and run xfs_check / xfs_repair on that partition.

If you repair anything, you should probably force a deep-scrub on all the PGs on that disk.  I think
ceph osd deep-scrub <osdid> will do that, but you might have to manually grep
ceph pg dump .

Or you could just treat it like a failed disk, but re-use the disk. 
ceph-disk-prepare --zap-disk should take care of you.

On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley 
<SMiley@xxxxxxx> wrote:

I tried restarting all the osd's on that node, osd.70 was the only ceph process that did not come back online.

There is nothing in the ceph-osd log for osd.70.

However I do see over 13,000 of these messages in the kern.log:

Nov  6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1): xfs_log_force: error 5 returned.

Does anyone have any suggestions on how I might be able to get this HD back in the cluster (or whether or not it is worth even trying).

Thanks,

Shain

Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
smiley@xxxxxxx | 202.513.3649

________________________________________

From: Shain Miley [smiley@xxxxxxx]

Sent: Tuesday, November 04, 2014 3:55 PM

To: ceph-users@xxxxxxxxxxxxxx

Subject: osd down

Hello,

We are running ceph version 0.80.5 with 108 osd's.

Today I noticed that one of the osd's is down:

root@hqceph1:/var/log/ceph# ceph -s

     cluster 504b5794-34bd-44e7-a8c3-0494cf800c23

      health HEALTH_WARN crush map has legacy tunables

      monmap e1: 3 mons at

{hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0},

election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3

      osdmap e7119: 108 osds: 107 up, 107 in

       pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects

             216 TB used, 171 TB / 388 TB avail

                 3204 active+clean

                    4 active+clean+scrubbing

   client io 4079 kB/s wr, 8 op/s

Using osd dump I determined that it is osd number 70:

osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913

last_clean_interval [488,2665) 
10.35.1.217:6814/22440

10.35.1.217:6820/22440
10.35.1.217:6824/22440
10.35.1.217:6830/22440

autoout,exists 5dbd4a14-5045-490e-859b-15533cd67568

Looking at that node, the drive is still mounted and I did not see any

errors in any of the system logs, and the raid level status shows the

drive as up and healthy, etc.

root@hqosd6:~# df -h |grep 70

/dev/sdl1       3.7T  1.9T  1.9T  51% /var/lib/ceph/osd/ceph-70

I was hoping that someone might be able to advise me on the next course

of action (can I add the osd back in?, should I replace the drive

altogether, etc)

I have attached the osd log to this email.

Any suggestions would be great.

Thanks,

Shain

--

Shain Miley | Manager of Systems and Infrastructure, Digital Media |

smiley@xxxxxxx | 202.513.3649

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com