Re: Inconsistent PG, repair doesn't work

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



My usuall workaround for that is to set noscrub and nodeep-scrub flags and wait (sometimes even 3 hours) until all the scheduled scrubs finish. Then a manually issued scrub or repair starts immediately. After that I unset the scrub blocking flags.

A general advice regarding pg repair is not to run it without full understanding of what kind of the data error has been discovered, how many replicas you have, how many of them are affected, etc. In some cases pg repair is considered as dangerous to the data.  

czw., 11 paź 2018 o 19:54 Gregory Farnum <gfarnum@xxxxxxxxxx> napisał(a):
Yeah improving that workflow is in the backlog. (or maybe it's done in master? I forget.) But it's complicated, so for now that's just how it goes. :(

On Thu, Oct 11, 2018 at 10:27 AM Brett Chancellor <bchancellor@xxxxxxxxxxxxxx> wrote:
This seems like a bug. If I'm kicking off a repair manually it should take place immediately, and ignore flags such as max scrubs, or minimum scrub window.

-Brett

On Thu, Oct 11, 2018 at 1:11 PM David Turner <drakonstein@xxxxxxxxx> wrote:
As a part of a repair is queuing a deep scrub. As soon as the repair part is over the deep scrub continues until it is done.

On Thu, Oct 11, 2018, 12:26 PM Brett Chancellor <bchancellor@xxxxxxxxxxxxxx> wrote:
Does the "repair" function use the same rules as a deep scrub? I couldn't get one to kick off, until I temporarily increased the max_scrubs and lowered the scrub_min_interval on all 3 OSDs for that placement group. This ended up fixing the issue, so I'll leave this here in case somebody else runs into it.

sudo ceph tell 'osd.208' injectargs '--osd_max_scrubs 3'
sudo ceph tell 'osd.120' injectargs '--osd_max_scrubs 3'
sudo ceph tell 'osd.235' injectargs '--osd_max_scrubs 3'
sudo ceph tell 'osd.208' injectargs '--osd_scrub_min_interval 1.0'
sudo ceph tell 'osd.120' injectargs '--osd_scrub_min_interval 1.0'
sudo ceph tell 'osd.235' injectargs '--osd_scrub_min_interval 1.0'
sudo ceph pg repair 75.302

-Brett


On Thu, Oct 11, 2018 at 8:42 AM Maks Kowalik <maks_kowalik@xxxxxxxxx> wrote:
Imho moving was not the best idea (a copying attempt would have told if the read error was the case here).
Scrubs might don't want to start if there are many other scrubs ongoing.

czw., 11 paź 2018 o 14:27 Brett Chancellor <bchancellor@xxxxxxxxxxxxxx> napisał(a):
I moved the file. But the cluster won't actually start any scrub/repair I manually initiate.

On Thu, Oct 11, 2018, 7:51 AM Maks Kowalik <maks_kowalik@xxxxxxxxx> wrote:
Based on the log output it looks like you're having a damaged file on OSD 235 where the shard is stored.
To ensure if that's the case you should find the file (using 81d5654895863d as a part of its name) and try to copy it to another directory.
If you get the I/O error while copying, the next steps would be to delete the file, run the scrub on 75.302 and take a deep look at the OSD.235 for any other errors.

Kind regards,
Maks
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux