Re: 1 unfound object (but I can find it on-disk on the OSDs!)

Alex Moore <alex@xxxxxxxxxx> · Sun, 03 May 2015 15:15:03 +0100

Okay I have now ended up returning the cluster into a healthy state but 
instead using the version of the object from OSDs 0 and 2 rather than 
OSD 1. I set the "noout" flag, and shut down OSD 1. That appears to have 
resulted in the cluster being happy to use the version of the object 
that was present on the other OSDs. Then after starting up OSD 1 again, 
their version was replicated back to OSD 1. So there are no more 
inconsistencies or unfound objects.

I had noticed that the object in question corresponded to the first 4 MB 
of a logical volume within the VM that was used for its root filesystem 
(which is BTRFS). Comparing the content to the equivalent location on 
disk on some other similar VMs, I started suspecting that the "extra 
data" in OSD 1's copy of the object was superfluous anyway. I have now 
restarted the VM that owns the RBD, and it was at least quite happy 
mounting the filesystem, so I'm hoping all is well...

Alex

On 03/05/2015 12:55 PM, Alex Moore wrote:
Hi all, I need some help getting my 0.87.1 cluster back into a healthy 
state...

Overnight, a deep scrub detected an inconsistent object pg. Ceph 
health detail said the following:

# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 2.3b is active+clean+inconsistent, acting [1,2,0]
2 scrub errors

And these were the corresponding errors from the log:

2015-05-03 02:47:27.804774 6a8bc3f1e700 -1 log_channel(default) log 
[ERR] : 2.3b shard 1: soid 
c886da7b/rbd_data.25212ae8944a.0000000000000100/head//2 digest 
1859582522 != known digest 2859280481, size 4194304 != known size 1642496
2015-05-03 02:47:44.099475 6a8bc3f1e700 -1 log_channel(default) log 
[ERR] : 2.3b deep-scrub stat mismatch, got 655/656 objects, 0/0 
clones, 655/656 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 
2685746176/2689940480 bytes,0/0 hit_set_archive bytes.
2015-05-03 02:47:44.099496 6a8bc3f1e700 -1 log_channel(default) log 
[ERR] : 2.3b deep-scrub 0 missing, 1 inconsistent objects
2015-05-03 02:47:44.099501 6a8bc3f1e700 -1 log_channel(default) log 
[ERR] : 2.3b deep-scrub 2 errors

I located the inconsistent object on-disk on the 3 OSDs (and have 
saved a copy of them). The copy on OSDs 0 and 2 match each other, and 
have the supposedly "known size" of 1642496. The copy on OSD 1 (the 
primary) has additional data appended, and a size of 4194304. The 
content within the portion of the file that exists on OSDs 0 and 2 is 
the same on OSD 1, it just has extra data as well.

As this is part of an RBD (used by a linux VM, with a filesystem on 
top) I reasoned that if the "extra data" on OSD 1's copy of the object 
is not supposed to be there, then it almost certainly maps to an 
unallocated part of the filesystem within the VM, and so having the 
extra data isn't going to do any harm. So I want to stick with the 
version on OSD 1 (the primary).

I then ran "ceph pg repair 2.3b", as my understanding is that should 
replace the copies of the object on OSDs 0 and 2 with the one from the 
primary OSD, achieving what I want, and removing the inconsistency. 
However that doesn't seem to have happened!

Instead I now have 1 unfound object (and it is the same object that 
had previously been reported as inconsistent), and some IO is now 
being blocked:

# ceph health detail
HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; 1 requests are 
blocked > 32 sec; 1 osds have slow requests; recovery -1/1747956 
objects degraded (-0.000%); 1/582652 unfound (0.000%)
pg 2.3b is stuck unclean for 533.238307, current state 
active+recovering, last acting [1,2,0]
pg 2.3b is active+recovering, acting [1,2,0], 1 unfound
1 ops are blocked > 524.288 sec
1 ops are blocked > 524.288 sec on osd.1
1 osds have slow requests
recovery -1/1747956 objects degraded (-0.000%); 1/582652 unfound (0.000%)

# ceph pg 2.3b list_missing
{ "offset": { "oid": "",
      "key": "",
      "snapid": 0,
      "hash": 0,
      "max": 0,
      "pool": -1,
      "namespace": ""},
  "num_missing": 1,
  "num_unfound": 1,
  "objects": [
        { "oid": { "oid": "rbd_data.25212ae8944a.0000000000000100",
              "key": "",
              "snapid": -2,
              "hash": 3364280955,
              "max": 0,
              "pool": 2,
              "namespace": ""},
          "need": "1216'8088646",
          "have": "0'0",
          "locations": []}],
  "more": 0}

However the 3 OSDs do still have the corresponding file on-disk, with 
the same content that they had when I first looked at them. I can only 
assume that because the data in the object on the primary OSD didn't 
match the "known size", when I issued the "repair" Ceph somehow 
decided to invalidate the copy of the object on the primary OSD, 
rather than use it as the authoritative version, and now believes it 
has no good copies of the object.

How can I persuade Ceph to just go ahead and use the version of 
rbd_data.25212ae8944a.0000000000000100 that is already on-disk on OSD 
1, and push it out to OSDs 0 and 2? Surely there is a way to do that!

Thanks in advance!
Alex
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com