Re: Questions about OSD recovery

Josh Durgin <josh.durgin@xxxxxxxxxxxxx> · Wed, 08 Feb 2012 19:14:20 -0800

On 02/07/2012 06:54 PM, Henry C Chang wrote:
Hi all,

I did some experiments on the OSD and had some questions about it.

I removed one object directly from the osd data store. As expected,
the osd didn't notice it until I manually scrubbed the pg. However,
the scrubbing doen't trigger the recovery automatically. I had to do
'ceph pg repair' to fix it.

So, my first question is: can the recovery process be triggered
automatically once the scrubbing has detected the inconsistency?

It's possible to do what the current repair code does
automatically, but this would be a bad idea since it just takes
the first osd (with primary before replicas) to have the object
as authoritative, and copies it to all the relevant osds. If the
primary has a corrupt copy, this corruption will spread to other
osds. In your case, since you removed the object entirely, repair
could correct it.

In general, if an object is corrupted, there's no way to tell
which one is correct right now. You could use btrfs checksumming
underneath the osd to protect against this, but the osds don't
checksum the objects themselves. Scrub/repair could certainly be
a lot smarter. It's been on the todo list for a while, but we
haven't gotten to it yet.

Then, I tried again and removed another object. But this time, I
didn't scrub the pg. I restarted the osd. As expected, the osd didn't
notice that, either.

My second question is: is it possible to check the existence of the
objects when scanning the pg during osd startup? Does it make sense to
do so?

Detecting missing objects on startup is possible by looking at
the pg log and comparing it to the objects on disk, but this can
be a pretty expensive operation. The osd might also be out of
date, so it's log might be useless (for example it could have
divergent history that was not acked). It can't know how many
current objects that should be there aren't until it goes through
peering (to get an up to date and authoritative log) and
recovery (to get missing data the logs say should be there). This
is why scrub skips pgs that aren't active+clean. More details of
peering can be found at http://ceph.newdream.net/docs/latest/dev/peering/.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html