Great, thanks, that seems to be what I needed. The osds are running
again and the cluster is beginning its long road to recovery. It looks
like I'm left with a few unfound objects and 3 osds that won't start due
to crashes while reading the osdmap, but I'll see if I can work through
that.
On 11/29/2022 4:25 PM, Gregory Farnum wrote:
On Tue, Nov 29, 2022 at 1:18 PM Joshua Timmer
<mrjoshuatimmer@xxxxxxxxx> wrote:
I've got a cluster in a precarious state because several nodes
have run
out of memory due to extremely large pg logs on the osds. I came
across
the pglog_hardlimit flag which sounds like the solution to the issue,
but I'm concerned that enabling it will immediately truncate the
pg logs
and possibly drop some information needed to recover the pgs.
There are
many in degraded and undersized states right now as nodes are
down. Is
it safe to enable the flag in this state? The cluster is running
luminous 12.2.13 right now.
The hard limit will truncate the log, but all the data goes into the
backing bluestore/filestore instance at the same time. The pglogs are
used for two things:
1) detecting replayed client operations and sending the same answer
back on replays, so shorter logs means a shorter time window of
detection but shouldn’t be an issue;
2) enabling log-based recovery of pgs where OSDs with overlapping logs
can identify exactly which objects have been modified and only moving
them.
So if you set the hard limit, it’s possible you’ll induce more
backfill as fewer logs overlap. But no data will be lost.
-Greg
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx