On Tue, Dec 4, 2012 at 12:46 PM, Sylvain Munaut <s.munaut@xxxxxxxxxxxxxxxxxxxx> wrote: > Hi, > >> Sorry to let this drop for so long, but is this something you've seen >> happen before/again or otherwise reproduced? I'm not entirely sure how >> to best test for it (other than just jerking the time around), and >> while I can come up with scenarios where the OSD leaks memory, I've >> got nothing for how that happens to the monitors. We've also fixed a >> number of leaks recently that could account for part of the problem. > > It happenned very reliably at each attempt to restart the OSD and > stopped right when I fixed the clock. > Just take a working cluster, take an osd out, let it rebalance, set > the clock of one of the OSD 50 min too fast, and restart the OSD. > > I had it occur twice with the same clock sync problems. (once in a > test cluster with just 2 osd IIRC and once in the prod cluster). > > I don't get it anymore because I patched the underlying problem that > was causing the clock to jump forward 50 min. > > If you can't reproduce it locally, I can try to reproduce it again on > the test cluster tomorrow. > > My best guess was that somehow the messages had a timestamp and it > refused to process message too much in the future and maybe just > queued them while waiting (but 50 min worth of message is a lot of > memory). But that's really a wild guess :p No, there's no mechanism for anything like that. I suspect it's a bug with trying to obtain not-yet-existent cephx keys, but unfortunately I don't think anybody has the bandwidth to deal with it right now. I've created a bug, feel free to update if there's anything else important: http://tracker.newdream.net/issues/3569 -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html