On 3 October 2016 at 07:30, Ronny Aasen <ronny+ceph-users@xxxxxxxx> wrote: > On 22. sep. 2016 09:16, Iain Buclaw wrote: >> >> Hi, >> >> I currently have an OSD that has been backfilling data off it for a >> little over two days now, and it's gone from approximately 68 PGs to >> 63. >> >> As data is still being read from, and written to it by clients whilst >> I'm trying to get it out of the cluster, this is not helping it at >> all. I figured that it's probably best just to cut my losses and just >> force it out entirely so that all new writes and reads to those PGs >> get redirected elsewhere to a functional disk, and the rest of the >> recovery can proceed without being blocked heavily by this one disk. >> >> Granted that objects and files have a 1:1 relationship, I can just >> rsync the data to a new server and write it back into ceph afterwards. >> >> Now, I know that as soon as I bring down this OSD, the entire cluster >> will stop operating. So what's the most swift method of telling the >> cluster to forget about this disk and everything that may be stored on >> it. >> >> Thanks >> > > > It should normally not get new writes to it if you want to remove it from > the cluster. I assume you did something wrong here. How did you define the > osd out of the cluster ? > > > generally my procedure for a working osd is something like > 1. ceph osd crush reweight osd.X 0 > > 2. ceph osd tree > check that the osd in question actualy have 0 weight (first number > after ID) and that the host weight have been reduced accordingly. > This was what was done. However it seems to take a very long time for ceph to backfill millions of tiny objects, the slow/bad SATA disk only exacerbated the situation. > > 3. ls /var/lib/ceph/osd/cph-X/current ; periodically > wait for the osd to drain, there should be no PG directories n.xxx_head > or n.xxx_TEMP this will take a while depending on the size of the osd. in > reality i just wait until the disk usage graph settle, then doublecheck > with ls. > With some of the OSDs, there were some PGs still left - probably orphaned somehow in the confusion when rebalancing away from full disks. Is not a problem for me though, as I just scanned the directories and rewrote the file back into ceph. It's rather nice to see that they all got written into the same PG that I recovered them from. So ceph is predictable in where it writes data, I wonder if I could use that to my advantage somehow. :-) > 4: once empty I mark the osd out, stop the process, and removes the osd from > the cluster as written in the documentation > - ceph auth del osd.x > - ceph osd crush remove osd.x > - ceph osd rm osd.x > This is how to remove an OSD, not how to remove a and recreate a PG. ;-) > > > PS: if your cluster stops to operate when a osd goes down, you have > something else fundamentally wrong. you should look into this as well as a > separate case. > osd pool default size = 1 I'm still trying to work out the best method of handlling this, as I understand it, if an OSD goes down, all requests to it get stuck in a queue, and that slows down operation latency to functional OSDs. In any case, it eventually finished backfilling just over a week later, and I managed to speed up the backfilling of the SSD disks by starting a balance on the btrfs disk metadata, that freed up around 1.5 TB of data back to ceph. Being blocked by backfill+too_full probably didn't help overall recovery either, as it tried to juggle going from 30 full disks, to adding 15 temporary disks, then adding a further 8 when proper servers were made available to handle the overflow, removing the 15 temporaries. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0'; _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com