Re: Give up on backfill, remove slow OSD

Iain Buclaw <ibuclaw@xxxxxxxxx> · Sat, 8 Oct 2016 10:46:46 +0200

On 3 October 2016 at 07:30, Ronny Aasen <ronny+ceph-users@xxxxxxxx> wrote:
> On 22. sep. 2016 09:16, Iain Buclaw wrote:
>>
>> Hi,
>>
>> I currently have an OSD that has been backfilling data off it for a
>> little over two days now, and it's gone from approximately 68 PGs to
>> 63.
>>
>> As data is still being read from, and written to it by clients whilst
>> I'm trying to get it out of the cluster, this is not helping it at
>> all.  I figured that it's probably best just to cut my losses and just
>> force it out entirely so that all new writes and reads to those PGs
>> get redirected elsewhere to a functional disk, and the rest of the
>> recovery can proceed without being blocked heavily by this one disk.
>>
>> Granted that objects and files have a 1:1 relationship, I can just
>> rsync the data to a new server and write it back into ceph afterwards.
>>
>> Now, I know that as soon as I bring down this OSD, the entire cluster
>> will stop operating.  So what's the most swift method of telling the
>> cluster to forget about this disk and everything that may be stored on
>> it.
>>
>> Thanks
>>
>
>
> It should normally not get new writes to it if you want to remove it from
> the cluster. I assume you did something wrong here. How did you define the
> osd out of the cluster ?
>
>
> generally my procedure for a working osd is something like
> 1. ceph osd crush reweight osd.X 0
>
> 2. ceph osd tree
>    check that the osd in question actualy have 0 weight (first number
> after ID) and that the host weight have been reduced accordingly.
>

This was what was done.  However it seems to take a very long time for
ceph to backfill millions of tiny objects, the slow/bad SATA disk only
exacerbated the situation.

>
> 3. ls /var/lib/ceph/osd/cph-X/current ; periodically
>    wait for the osd to drain, there should be no PG directories n.xxx_head
> or n.xxx_TEMP this will take a while depending on the size of the osd. in
> reality i just wait  until the disk usage graph settle, then doublecheck
> with ls.
>

With some of the OSDs, there were some PGs still left - probably
orphaned somehow in the confusion when rebalancing away from full
disks.  Is not a problem for me though, as I just scanned the
directories and rewrote the file back into ceph.  It's rather nice to
see that they all got written into the same PG that I recovered them
from.  So ceph is predictable in where it writes data, I wonder if I
could use that to my advantage somehow. :-)

> 4: once empty I mark the osd out, stop the process, and removes the osd from
> the cluster as written in the documentation
>  - ceph auth del osd.x
>  - ceph osd crush remove osd.x
>  - ceph osd rm osd.x
>

This is how to remove an OSD, not how to remove a and recreate a PG. ;-)

>
>
> PS: if your cluster stops to operate when a osd goes down, you have
> something else fundamentally wrong. you should look into this as well as a
> separate case.
>

osd pool default size = 1

I'm still trying to work out the best method of handlling this, as I
understand it, if an OSD goes down, all requests to it get stuck in a
queue, and that slows down operation latency to functional OSDs.

In any case, it eventually finished backfilling just over a week
later, and I managed to speed up the backfilling of the SSD disks by
starting a balance on the btrfs disk metadata, that freed up around
1.5 TB of data back to ceph.

Being blocked by backfill+too_full probably didn't help overall
recovery either, as it tried to juggle going from 30 full disks, to
adding 15 temporary disks, then adding a further 8 when proper servers
were made available to handle the overflow, removing the 15
temporaries.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com