Re: My cluster is down. Two osd:s on different hosts uses all memory on boot and then crashes.

Stefan <slissm@xxxxxxxxxxxxxx> · Mon, 13 Jun 2022 20:55:41 +0000

Hello Mara,

Thank you so much, you are a lifesaver!

I'm not very skilled at docker, normally just use docker containers with  provided docker run commands. So it took some time before I was able to run the command inside the container, and have the container access the ceph osd disk. But after some trail and error I managed to fix everything and now my cluster is healthy again!

Again, thank you!

I also want to take the opportunity to thank everyone else in the ceph community for a great project!

Best regards
Stefan Lissmats

Sent with Proton Mail secure email.
------- Original Message -------
On Monday, June 13th, 2022 at 4:42 PM, Mara Sophie Grosch <littlefox@xxxxxxxxxx> wrote:

> Hi,
>
> as someone who has gone through that just last week, that sounds a lot
> like the symptoms of my cluster. In case you are comfortable with docker
> (or any other container runtime), I have pushed an image [1] with quincy
> from a few days ago, the fix for pglog dups being included in that and
> was able to successfully clean my OSD with the ceph-objectstore-tool in
> it.
>
> Something like `CEPH_ARGS="--osd_pg_log_trim_max=50000 --osd_max_pg_log_entries=2000 ceph-objectstore-tool --data-path $osd_path --op trim-pg-log` should help (command mostly from memory,
> check it before executing it - as always).
>
> Best of luck, Mara
>
> [1] littlefox/ceph-daemon-base:2, based on commit 5d47b8e21e77a57e51781f00021f77c7967ebbe2
>
> Am Mon, Jun 13, 2022 at 02:10:42PM +0000 schrieb Stefan:
>
> > Hello,
> >
> > I have been running Ceph for several years and everything has been rock solid until this weekend.
> > Due to some unfortune events my cluster at home is down.
> >
> > I have two osd:s that don't boot and the reason seems to be this issue: https://tracker.ceph.com/issues/53729
> >
> > I'm currently running version 17.2.0, but when i hit the issue I was on 16.2.7. In an attempt to fix the issue i upgraded first to 16.2.9 and then to 17.2.0, but it didn't help.
> > I also tried giving it a huge swap. But it ended up krashing anyway.
> >
> > 1. There seems to be a fix for the issue in a github branch. https://github.com/NitzanMordhai/ceph/tree/wip-nitzan-pglog-dups-not-trimmed/ I don't have very advanced Ceph/Linux skills and i'm not 100% that i understand exacly how I should use it.
> > Do I need to compile a complete Ceph installation and run that or can i pinpoint ceph-objectstore-tool in some way to only compile and run that?
> > 2. The issue seems to be targeted for release in 17.2.1, is there any information when that will be released?
> >
> > Any advice would be very welcome since i was running a lot of different VM:s and didn't have all backed up.
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx