Re: Fixing inconsistency

Межов Игорь Александрович <megov@xxxxxxxxxx> · Wed, 25 Nov 2015 11:53:11 +0000

Hi!

>I think the only time we've seen this was when there was some kind of
>XFS corruption that accidentally extended the size of the file on
>disk, and the object info was correct with its shorter size. But
>perhaps not, in which case I've no idea how this could have happened.

We use ext4 filesystem so it is not the issue you wrote about.
And I guess I know, what leaded us to consistency errors: some
experiments we took during the performance optimization. 
In trying to improve performance, especially for 4k random access 
we change some params. The 'async' messenger are greatly improve
latency and lower cpu load, but in Hammer it is rather buggy and not suited
for production. But in test environment it works for some time.
We enable async messenger on one osd (45). Pity, but it crashed with
core dump after ~10 seconds of running, because cluster is constantly
under ~3kiops load. The next deep scrub founds 4 errors in 3 PGs,
for all of them object size was smaller, than the on-disk file size
and for all of broken PGs osd.45 was the primary osd.

>I don't think RBD stores any data about individual data objects that
>could be broken this way, unless there are any new xattrs you need to
>keep consistent.

Yes, you're right, RBS is quite fixable ;)

We can successfully repair out cluster using the method, described below:

1. Gather bad object names and their sizes (in metadata and ondisk) from ceph logs. For example,
from this log line (splitted in three parts for readability):

    2015-11-14 10:56:59.649777 osd.45 node48:6801/3022 137 : 
    cluster [ERR] deep-scrub 6.5ed d6d595ed/rbd_data.15524c22ae8944a.0000000000018313/head//6 
    on disk size (4194304) does not match object info size (1921024) adjusted for ondisk to (1921024)

we can get object  name (rbd_data.15524c22ae8944a.0000000000018313), PG (6.5ed), 
primary osd (45), host (node48), ondisk size (4194304), object size, stored in ceph metadata (1921024)

2. From ceph health detail we can get other osds, which contain replicas:

    pg 6.5ed is active+clean+inconsistent, acting [45,75,87]

In our case, second and third replicas are on osd.75 and osd.87. From CRUSH map
we take nodes, that contains osd 75 and 87. In our case it is node26 and node20.

3. Check, what RBD volume can be affected by issuing 'rbd info' info command. 
>From its output we can get RBD prefix and compare it with our corrupted objects.
Sadly, but we can find a straight way to get RBD volume by object prefix, we have
to wrote a simple script, that run 'rbd ls' for pool and then iterate through volumes,
running 'rbd info' and comparing object prefix for RBD volume with prefix in our
corrupted objects. In our case, prefix 'rbd_data.15524c22ae8944a' belongs to a 3TB
rbd volume, containing user profiles and file storage for Windows. Of course,
it was GPT disk, formatted with NTFS. To exclude possibility of filesystem corruprion
in Windows, due to conflict between VMs cache and actual RBD volume content
we will change during repair/rewrite, we poweroff VM, so nobody used RBD volume
during repair.

4. Gather all three replicas as files from hosts. We use 'find' and search for files in
'/var/ceph/osd/<clustername>-<osdnum>' on previously found hosts and osds.
Also, the chars, representing underscore, differs in objectname (in logs) and in filename,
so better to search via partial object name, ommitting 'rbd_data' part.
To ensue consistency we need to compare objects (or compare md5/sha1 hashes).
In our case all three replicas was the same. 

5. Get the object from ceph to file 'object-before-repair' via rados and check its size: 

    rados -p <poolname> get rbd_data.15524c22ae8944a.0000000000018313 ./object-before-repair

Copy one of the files from step 4, truncate it to the same size (1921024, in metadata) 
and compare with the object we just got from ceph. In our case they are the same,
and that's good! So, we have inconsistent only ceph metadata and object content
are not damaged.

6. Put one of a files (file-object) we got at step 4 as an object with the same name back to ceph:

    rados -p <poolname> put rbd_data.15524c22ae8944a.0000000000018313 ./file-object

7. Re-get object back and check ins size and content:

    rados -p <poolname> get rbd_data.15524c22ae8944a.0000000000018313 ./object-after-repair

The ./object-after-repair size have to be 4194304 and content have to be the same with ./file-object

8. Run 'ceph --cluster <clustername> pg repair 6.5ed' and check results. 

We repeat this procedure for all damaged objects and cluster became healthy!

Megov Igor
CIO, Yuterra

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com