Re: ZFS or BTRFS for performance?

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Sun, 20 Mar 2016 17:35:10 +0100

Hi,

Le 20/03/2016 15:23, Francois Lafont a écrit :
> Hello,
>
> On 20/03/2016 04:47, Christian Balzer wrote:
>
>> That's not protection, that's an "uh-oh, something is wrong, you better
>> check it out" notification, after which you get to spend a lot of time
>> figuring out which is the good replica 
> In fact, I have never been confronted to this case so far and I have a
> couple of questions.
>
> 1. When it happens (ie a deep scrub fails), is it mentioned in the output
> of the "ceph status" command and, in this case, can you confirm to me
> that the health of the cluster in the output is different of "HEALTH_OK"?

Yes. This is obviously a threat to your data so the cluster isn't
HEALTH_OK (HEALTH_WARN IIRC).

>
> 2. For instance, if it happens with the PG id == 19.10 and if I have 3 OSDs
> for this PG (because my pool has replica size == 3). I suppose that the
> concerned OSDs are OSD id == 1, 6 and 12. Can you tell me if this "naive"
> method is valid to solve the problem (and, if not, why)?
>
>     a) ssh in the node which hosts osd-1 and I launch this command:
>         ~# id=1 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | sed "s|/ceph-$id/|/ceph-id/|" | sha1sum
>         055b0fd18cee4b158a8d336979de74d25fadc1a3  -
>
>     b) ssh in the node which hosts osd-6 and I launch this command:
>         ~# id=6 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | sed "s|/ceph-$id/|/ceph-id/|" | sha1sum
>         055b0fd18cee4b158a8d336979de74d25fadc1a3 -
>
>     c) ssh in the node which hosts osd-12 and I launch this command:
>         ~# id=12 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | sed "s|/ceph-$id/|/ceph-id/|" | sha1sum
>         3f786850e387550fdab836ed7e6dc881de23001b -

You may get 3 different hashes because of concurrent writes on the PG.
So you may have to restart your commands and probably try to launch them
at the same time on all nodes to avoid this problem. If you have
constant heavy writes on all your PGs this will probably never give a
useful result.

>
>     I notice that the result is different for osd-12 so it's the "bad" osd.
>     So, in the node which hosts osd-12, I launch this command:
>
>         id=12 && rm /var/lib/ceph/osd/ceph-$id/current/19.10_head/*

You should stop the OSD, flush its journal and then do this before
restarting the OSD.

>     And now I can launch safely this command:
>
>         ceph pg repair 19.10
>
> Is there a problem with this "naive" method?

It is probably overkill (and may not work, see above). Usually you can
find out the exact file (see the link in my previous post) in this
directory which differs and should be deleted. I believe that if the
offending file isn't on the primary you can directly launch the repair
command.

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com