Re: osd down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yes, removing an OSD before re-creating it will give you the same OSD ID.  That's my preferred method, because it keeps the crushmap the same.  Only PGs that existed on the replaced disk need to be backfilled.

I don't know if adding the replacement to the same host then removing the old OSD gives you the same CRUSH map as the reverse.  I suspect not, because the OSDs are re-ordered on that host.


On Mon, Nov 10, 2014 at 1:29 PM, Shain Miley <SMiley@xxxxxxx> wrote:
Craig,

Thanks for the info.

I ended up doing a zap and then a create via ceph-deploy.

One question that I still have is surrounding adding the failed osd back into the pool.

In this example...osd.70 was bad....when I added it back in via ceph-deploy...the disk was brought up as osd.108.

Only after osd.108 was up and running did I think to remove osd.70 from the crush map etc.

My question is this...had I removed it from the crush map prior to my ceph-deploy create...should/would Ceph have reused the osd number 70?

I would prefer to replace a failed disk with a new one and keep the old osd assignment...if possible that is why I am asking.

Anyway...thanks again for all the help.

Shain

Sent from my iPhone

On Nov 7, 2014, at 2:09 PM, Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> wrote:

I'd stop that osd daemon, and run xfs_check / xfs_repair on that partition.

If you repair anything, you should probably force a deep-scrub on all the PGs on that disk.  I think ceph osd deep-scrub <osdid> will do that, but you might have to manually grep ceph pg dump .


Or you could just treat it like a failed disk, but re-use the disk. ceph-disk-prepare --zap-disk should take care of you.


On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley <SMiley@xxxxxxx> wrote:
I tried restarting all the osd's on that node, osd.70 was the only ceph process that did not come back online.

There is nothing in the ceph-osd log for osd.70.

However I do see over 13,000 of these messages in the kern.log:

Nov  6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1): xfs_log_force: error 5 returned.

Does anyone have any suggestions on how I might be able to get this HD back in the cluster (or whether or not it is worth even trying).

Thanks,

Shain

Shain Miley | Manager of Systems and Infrastructure, Digital Media | smiley@xxxxxxx | 202.513.3649

________________________________________
From: Shain Miley [smiley@xxxxxxx]
Sent: Tuesday, November 04, 2014 3:55 PM
To: ceph-users@xxxxxxxxxxxxxx
Subject: osd down

Hello,

We are running ceph version 0.80.5 with 108 osd's.

Today I noticed that one of the osd's is down:

root@hqceph1:/var/log/ceph# ceph -s
     cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
      health HEALTH_WARN crush map has legacy tunables
      monmap e1: 3 mons at
{hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0},
election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3
      osdmap e7119: 108 osds: 107 up, 107 in
       pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects
             216 TB used, 171 TB / 388 TB avail
                 3204 active+clean
                    4 active+clean+scrubbing
   client io 4079 kB/s wr, 8 op/s


Using osd dump I determined that it is osd number 70:

osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913
last_clean_interval [488,2665) 10.35.1.217:6814/22440
10.35.1.217:6820/22440 10.35.1.217:6824/22440 10.35.1.217:6830/22440
autoout,exists
5dbd4a14-5045-490e-859b-15533cd67568


Looking at that node, the drive is still mounted and I did not see any
errors in any of the system logs, and the raid level status shows the
drive as up and healthy, etc.


root@hqosd6:~# df -h |grep 70
/dev/sdl1       3.7T  1.9T  1.9T  51% /var/lib/ceph/osd/ceph-70


I was hoping that someone might be able to advise me on the next course
of action (can I add the osd back in?, should I replace the drive
altogether, etc)

I have attached the osd log to this email.

Any suggestions would be great.

Thanks,

Shain















--
Shain Miley | Manager of Systems and Infrastructure, Digital Media |
smiley@xxxxxxx | 202.513.3649
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux