Re: Very bad behavior when

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 4 Dec 2012 12:54:57 -0800



On Tue, Dec 4, 2012 at 12:46 PM, Sylvain Munaut
<s.munaut@xxxxxxxxxxxxxxxxxxxx> wrote:
> Hi,
>
>> Sorry to let this drop for so long, but is this something you've seen
>> happen before/again or otherwise reproduced? I'm not entirely sure how
>> to best test for it (other than just jerking the time around), and
>> while I can come up with scenarios where the OSD leaks memory, I've
>> got nothing for how that happens to the monitors. We've also fixed a
>> number of leaks recently that could account for part of the problem.
>
> It happenned very reliably at each attempt to restart the OSD and
> stopped right when I fixed the clock.
> Just take a working cluster, take an osd out, let it rebalance, set
> the clock of one of the OSD 50 min too fast, and restart the OSD.
>
> I had it occur twice with the same clock sync problems. (once in a
> test cluster with just 2 osd IIRC and once in the prod cluster).
>
> I don't get it anymore because I patched the underlying problem that
> was causing the clock to jump forward 50 min.
>
> If you can't reproduce it locally, I can try to reproduce it again on
> the test cluster tomorrow.
>
> My best guess was that somehow the messages had a timestamp and it
> refused to process message too much in the future and maybe just
> queued them while waiting (but 50 min worth of message is a lot of
> memory). But that's really a wild guess :p

No, there's no mechanism for anything like that. I suspect it's a bug
with trying to obtain not-yet-existent cephx keys, but unfortunately I
don't think anybody has the bandwidth to deal with it right now. I've
created a bug, feel free to update if there's anything else important:
http://tracker.newdream.net/issues/3569
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html