replacing an OSD or crush map sensitivity

Nigel Williams <nigel.d.williams@xxxxxxxxx> · Sun, 02 Jun 2013 13:14:06 +1000

Could I have a critique of this approach please as to how I could have 
done it better or whether what I experienced simply reflects work still 
to be done.

This is with Ceph 0.61.2 on a quite slow test cluster (logs shared with 
OSDs, no separate journals, using CephFS).

I knocked the power cord out from a storage node taking down 4 of the 
hosted OSDs, all but one came back ok. This is one OSD out of a total of 
12 so 1/12 of the storage.

Losing an OSD put the cluster into recovery, so all good. Next action 
was how to get the missing (downed) OSD back online.

The OSD was xfs based and so I had to throw away the xfs log to get it 
to mount. Having done this and getting it re-mounted Ceph then started 
throwing issue #4855 (I added dmesg and logs to that issue if it helps - 
I am wonder if throwing away the xfs log caused an internal OSD 
inconsistency? and this causes issue #4855?). Given that I could not 
"recover" this OSD as far as Ceph is concerned I decided to delete and 
rebuild it.

Several hours later, cluster was back to HEALTH_OK. I proceeded to 
remove and re-add the bad OSD. I following the doc suggestions to do this.

The problem is we each change, it caused a slight change in the crush 
map, resulting in the cluster going back into recovery, adding several 
hours wait for each change. I chose to wait until the cluster was back 
to HEALTH_OK before doing the next step. Overall it has taken a few days 
to finally get a single OSD back into the cluster.

At one point during recovery the full threshold was triggered on a 
single OSD causing the recovery to stop, doing "ceph pg set_full_ratio 
0.98" did not help. I was not planning to add data to the cluster while 
doing recovery operations and did not understand the suggestion the PGs 
could be deleted to make space on a "full" OSD, so I expect raising the 
threshold was the best option but it had no (immediate) effect.

I am now back to having all 12 OSDs in and the hopefully final recovery 
under way while it re-balances the OSDs, although I note I am still 
getting the full OSD warning I am expecting this to disappear soon now 
that the 12th OSD is back online.

During this recovery the percentage degraded has been a little 
confusing. While the 12th OSD was offline the percentages were around 
15-20% IIRC. But now I see the percentage is 35% and slowly dropping, 
not sure I understand the ratios and why so high with a single missing OSD.

A few documentation errors caused confusion too.

This page still contains errors in the steps to create a new OSD (manually):

http://eu.ceph.com/docs/wip-3060/cluster-ops/add-or-rm-osds/#adding-an-osd-manual

"ceph osd create {osd-num}" should be "ceph osd create"

and this:

http://eu.ceph.com/docs/wip-3060/cluster-ops/crush-map/#addosd

I had to put host= to get the command accepted.

Suggestions and questions:

1. Is there a way to get documentation pages fixed? or at least 
health-warnings on them: "This page badly needs updating since it is 
wrong/misleading"

2. We need a small set of definitive succinct recipes that provide steps 
to recover from common failures with a narrative around what to expect 
at each step (your cluster will be in recovery here...).

3. Some commands are throwing erroneous errors that are actually benign 
:ceph-osd -i 10 --mkfs --mkkey" complains about failures that are 
expected as the OSD is initially empty.

4. An easier way to capture the state of the cluster for analysis. I 
don't feel confident that when asked for "logs" that I am giving the 
most useful snippets or the complete story. It seems we need a tool that 
can gather all this in a neat bundle for later dissection or forensics.

5. Is there a more straightforward (faster) way getting an OSD back 
online. It almost seems like it is worth having a standby OSD ready to 
step in and assume duties (a hot spare?).

6. Is there a way to make the crush map less sensitive to changes during 
recovery operations? I would have liked to stall/slow recovery while I 
replaced the OSD then let it run at full speed.

Excuses:

I'd be happy to action suggestions but my current level of Ceph 
understanding is still too limited that effort on my part is 
unproductive; I am prodding the community to see if there is consensus 
on the need.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com