Re: Very bad behavior when

Sylvain Munaut <s.munaut@xxxxxxxxxxxxxxxxxxxx> · Tue, 4 Dec 2012 21:46:11 +0100

Hi,

> Sorry to let this drop for so long, but is this something you've seen
> happen before/again or otherwise reproduced? I'm not entirely sure how
> to best test for it (other than just jerking the time around), and
> while I can come up with scenarios where the OSD leaks memory, I've
> got nothing for how that happens to the monitors. We've also fixed a
> number of leaks recently that could account for part of the problem.

It happenned very reliably at each attempt to restart the OSD and
stopped right when I fixed the clock.
Just take a working cluster, take an osd out, let it rebalance, set
the clock of one of the OSD 50 min too fast, and restart the OSD.

I had it occur twice with the same clock sync problems. (once in a
test cluster with just 2 osd IIRC and once in the prod cluster).

I don't get it anymore because I patched the underlying problem that
was causing the clock to jump forward 50 min.

If you can't reproduce it locally, I can try to reproduce it again on
the test cluster tomorrow.

My best guess was that somehow the messages had a timestamp and it
refused to process message too much in the future and maybe just
queued them while waiting (but 50 min worth of message is a lot of
memory). But that's really a wild guess :p

Cheers,

    Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html