Re: PGs inconsistent, do I fear data loss?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

First of all, I would recommend, that you use ceph pg repair wherever you can.


When you have size=3 the cluster can compare 3 instances, therefore it is easier for it to spot which two is good, and which one is bad.

When you use size=2 the case is harder for o-so-many ways:

-According to the documentation it is harder to determine which object is the faulty.

-If an osd dies the increased load (caused by the missing osd) and the extra io from the recovery process hits the other osd much harder, increasing the chance that another osd dies (because of disk failure caused by the sudden spike of extra load), and then you loose your data

-If there is a bitrot in the remaining one replica, then you do not have any valid copies for your data

So, to summarize it, the experts say, that it is MUCH safer to have size=3 min_size=2 (I'm far from an expert, I'm just quoting :))


So, back to the task at hand:

If you repaired all pgs that you coud by ceph pg repair, there is a manual recovery process, (written for filestore unfortunately):

http://ceph.com/geen-categorie/ceph-manually-repair-object/

The good news is, that there is a fuse client for bluestore too, so you can mount it by hand and repair it as per the linked document,


I think that you could ceph osd pool set [pool] size 3 yo increase the copy count, but before that you should be certain that you have enough free space, and you'll not hit the osd pg count limits.


[DISCLAIMER]:
I have never done this, and I too have questions about this topic:

[Questions to the list]
How is it possible that the cluster cannot repair itself with ceph pg repair?
No good copies are remaining?
Cannot decide which copy is valid or up-to date?
If so, why not, when there is checksum, mtime for everything?
In this inconsistent state which object does the cluster serve when it doesn't know which one is the valid?


Isn't there a way to do a more "online" repair?

A way to examine, remove objects while running the osd?

Or better yet, to tell the cluster that which copy should be used when repairing?

There is a command, ceph pg force-recovery, but I cannot find documentation for it.


Kind regards,

Denes Dolhay.



On 10/28/2017 01:05 PM, Mario Giammarco wrote:
Hello,
we recently upgraded two clusters to Ceph luminous with bluestore and we discovered that we have many more pgs in state active+clean+inconsistent. (Possible data damage, xx pgs inconsistent)
 
This is probably due to checksums in bluestore that discover more errors.

We have some pools with replica 2 and some with replica 3.

I have read past forums thread and I have seen that Ceph do not repair automatically inconsistent pgs.

Even manual repair sometime fails. 

I would like to understand if I am losing my data:

- with replica 2 I hope that ceph chooses right replica looking at checksums
- with replica 3 I hope that there are no problems at all

How can I tell ceph to simply create the second replica in another place?

Because I suppose that with replica 2 and inconsistent pgs I have only one copy of data.

Thank you in advance for any help.

Mario







_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux