Thanks for that. Seeing 'health err' so frequently has led to
worrisome 'alarm fatigue'. Yup that's half of what I want to do.
The number of copies of a pg in the crush map drives how
time-critical and human-intervention critical the pg repair
process is. Having several copies makes automatic pg repair
reasonable-- only if there's a way to log the count of repairs
filed against pg's on the same osd since it was last marked
'in'. I'd love to have looking at that list be a periodic
staffer chore for pro-active osd replacement.
Appreciate the lead for the setting.
On 8/4/19 10:47 AM, Brett Chancellor wrote:
If all you want to do is repair the pg when it
finds an inconsistent pg, you could set osd_scrub_auto_repair to
true.
Question:
If you have enough osds it seems an almost daily thing when
you get to work in the morning there' s a "ceph health error"
"1 pg
inconsistent" arising from a 'scrub error'. Or 2, etc.
Then like
most such mornings you look to see there's two or more valid
instances
of the pg and one with an issue. So, like putting on socks
that just
takes time every day: there's the 'ceph pg repair xx' (making
note of
the likely soon to fail osd) then hey presto on with the day.
Am I missing some way to automate this and be notified only if
one
attempt at pg repair has failed and just a log entry for
successful
repairs? Calls about dashboard "HEALTH ERR" warnings so
often I don't
need.
Ideas welcome!
Thanks
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com