Re: Give up on backfill, remove slow OSD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 22. sep. 2016 09:16, Iain Buclaw wrote:
Hi,

I currently have an OSD that has been backfilling data off it for a
little over two days now, and it's gone from approximately 68 PGs to
63.

As data is still being read from, and written to it by clients whilst
I'm trying to get it out of the cluster, this is not helping it at
all.  I figured that it's probably best just to cut my losses and just
force it out entirely so that all new writes and reads to those PGs
get redirected elsewhere to a functional disk, and the rest of the
recovery can proceed without being blocked heavily by this one disk.

Granted that objects and files have a 1:1 relationship, I can just
rsync the data to a new server and write it back into ceph afterwards.

Now, I know that as soon as I bring down this OSD, the entire cluster
will stop operating.  So what's the most swift method of telling the
cluster to forget about this disk and everything that may be stored on
it.

Thanks



It should normally not get new writes to it if you want to remove it from the cluster. I assume you did something wrong here. How did you define the osd out of the cluster ?


generally my procedure for a working osd is something like
1. ceph osd crush reweight osd.X 0

2. ceph osd tree
   check that the osd in question actualy have 0 weight (first number
after ID) and that the host weight have been reduced accordingly.


3. ls /var/lib/ceph/osd/cph-X/current ; periodically
wait for the osd to drain, there should be no PG directories n.xxx_head or n.xxx_TEMP this will take a while depending on the size of the osd. in reality i just wait until the disk usage graph settle, then doublecheck with ls.

4: once empty I mark the osd out, stop the process, and removes the osd from the cluster as written in the documentation
 - ceph auth del osd.x
 - ceph osd crush remove osd.x
 - ceph osd rm osd.x



PS: if your cluster stops to operate when a osd goes down, you have something else fundamentally wrong. you should look into this as well as a separate case.

kind regards
Ronny Aasen





_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux