Okay I have now ended up returning the cluster into a healthy state but
instead using the version of the object from OSDs 0 and 2 rather than
OSD 1. I set the "noout" flag, and shut down OSD 1. That appears to have
resulted in the cluster being happy to use the version of the object
that was present on the other OSDs. Then after starting up OSD 1 again,
their version was replicated back to OSD 1. So there are no more
inconsistencies or unfound objects.
I had noticed that the object in question corresponded to the first 4 MB
of a logical volume within the VM that was used for its root filesystem
(which is BTRFS). Comparing the content to the equivalent location on
disk on some other similar VMs, I started suspecting that the "extra
data" in OSD 1's copy of the object was superfluous anyway. I have now
restarted the VM that owns the RBD, and it was at least quite happy
mounting the filesystem, so I'm hoping all is well...
Alex
On 03/05/2015 12:55 PM, Alex Moore wrote:
Hi all, I need some help getting my 0.87.1 cluster back into a healthy
state...
Overnight, a deep scrub detected an inconsistent object pg. Ceph
health detail said the following:
# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 2.3b is active+clean+inconsistent, acting [1,2,0]
2 scrub errors
And these were the corresponding errors from the log:
2015-05-03 02:47:27.804774 6a8bc3f1e700 -1 log_channel(default) log
[ERR] : 2.3b shard 1: soid
c886da7b/rbd_data.25212ae8944a.0000000000000100/head//2 digest
1859582522 != known digest 2859280481, size 4194304 != known size 1642496
2015-05-03 02:47:44.099475 6a8bc3f1e700 -1 log_channel(default) log
[ERR] : 2.3b deep-scrub stat mismatch, got 655/656 objects, 0/0
clones, 655/656 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
2685746176/2689940480 bytes,0/0 hit_set_archive bytes.
2015-05-03 02:47:44.099496 6a8bc3f1e700 -1 log_channel(default) log
[ERR] : 2.3b deep-scrub 0 missing, 1 inconsistent objects
2015-05-03 02:47:44.099501 6a8bc3f1e700 -1 log_channel(default) log
[ERR] : 2.3b deep-scrub 2 errors
I located the inconsistent object on-disk on the 3 OSDs (and have
saved a copy of them). The copy on OSDs 0 and 2 match each other, and
have the supposedly "known size" of 1642496. The copy on OSD 1 (the
primary) has additional data appended, and a size of 4194304. The
content within the portion of the file that exists on OSDs 0 and 2 is
the same on OSD 1, it just has extra data as well.
As this is part of an RBD (used by a linux VM, with a filesystem on
top) I reasoned that if the "extra data" on OSD 1's copy of the object
is not supposed to be there, then it almost certainly maps to an
unallocated part of the filesystem within the VM, and so having the
extra data isn't going to do any harm. So I want to stick with the
version on OSD 1 (the primary).
I then ran "ceph pg repair 2.3b", as my understanding is that should
replace the copies of the object on OSDs 0 and 2 with the one from the
primary OSD, achieving what I want, and removing the inconsistency.
However that doesn't seem to have happened!
Instead I now have 1 unfound object (and it is the same object that
had previously been reported as inconsistent), and some IO is now
being blocked:
# ceph health detail
HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; 1 requests are
blocked > 32 sec; 1 osds have slow requests; recovery -1/1747956
objects degraded (-0.000%); 1/582652 unfound (0.000%)
pg 2.3b is stuck unclean for 533.238307, current state
active+recovering, last acting [1,2,0]
pg 2.3b is active+recovering, acting [1,2,0], 1 unfound
1 ops are blocked > 524.288 sec
1 ops are blocked > 524.288 sec on osd.1
1 osds have slow requests
recovery -1/1747956 objects degraded (-0.000%); 1/582652 unfound (0.000%)
# ceph pg 2.3b list_missing
{ "offset": { "oid": "",
"key": "",
"snapid": 0,
"hash": 0,
"max": 0,
"pool": -1,
"namespace": ""},
"num_missing": 1,
"num_unfound": 1,
"objects": [
{ "oid": { "oid": "rbd_data.25212ae8944a.0000000000000100",
"key": "",
"snapid": -2,
"hash": 3364280955,
"max": 0,
"pool": 2,
"namespace": ""},
"need": "1216'8088646",
"have": "0'0",
"locations": []}],
"more": 0}
However the 3 OSDs do still have the corresponding file on-disk, with
the same content that they had when I first looked at them. I can only
assume that because the data in the object on the primary OSD didn't
match the "known size", when I issued the "repair" Ceph somehow
decided to invalidate the copy of the object on the primary OSD,
rather than use it as the authoritative version, and now believes it
has no good copies of the object.
How can I persuade Ceph to just go ahead and use the version of
rbd_data.25212ae8944a.0000000000000100 that is already on-disk on OSD
1, and push it out to OSDs 0 and 2? Surely there is a way to do that!
Thanks in advance!
Alex
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com