Re: Is a scrub error (read_error) on a primary osd safe to repair?

Caspar Smit <casparsmit@xxxxxxxxxxx> · Thu, 5 Dec 2019 10:47:38 +0100

Konstantin,
Thanks for your answer, i will run a ceph pg repair.
Could you maybe elaborate globally how this repair process works? Does it just try to re-read the read_error osd?
IIRC there was a time when a ceph pg repair wasn't considered 'safe' because it just copied the primary osd shard contents to the other osd's.
Since when did this change?

btw, i woke up this morning with only 1 active+clean+inconsistent pg left so one already triggered a new (deep) scrub and re-read the primary osd and found it good.
I noticed these read_errors start to occur on this installation when available RAM gets low (We still have to reboot the cluster nodes once in a while to free up RAM).

Furthermore we will upgrade to 12.2.12 soon

Caspar Smit
Systemengineer
SuperNAS
Dorsvlegelstraat 13
1445 PA Purmerend

t: (+31) 299 410 414
e: casparsmit@xxxxxxxxxxx
w: www.supernas.eu

Op do 5 dec. 2019 om 07:26 schreef Konstantin Shalygin <k0ste@xxxxxxxx>:

        I tried to dig in the mailinglist archives but couldn't find a clear answer
to the following situation:

Ceph encountered a scrub error resulting in HEALTH_ERR
Two PG's are active+clean+inconsistent. When investigating the PG i see a
"read_error" on the primary OSD. Both PG's  are replicated PG's with 3
copies.

I'm on Luminous 12.2.5 on this installation, is it safe to just run "ceph
pg repair" on those PG's or will it then overwrite the two good copies with
the bad one from the primary?
If the latter is true, what is the correct way to resolve this?

      Yes, you should call pg repair. Also It's better to upgrade to
      12.2.12.

    k

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com