I'm playing with Mark's cluster, where he is seeing high ceph-mon cpu utilization when he creates new big pools. I'm able to fairly reliably reproduce a livelock where it is stuck checking is_readable on queued auth requests long enough that it times out on the election and has to start all over again. I see two issues: - The PaxosService stuff is pulling values directly out of leveldb, and that is slow in this case. Not completely sure why (compaction in teh background? who knows.) But, it's also unnecessary.. except that there is currently not a notification of the PaxosService instances when the underlying data changes. That most easily plugs into a fwe places in teh Paxos class and on startup, but the way the layering is structured it's not very clean. Not sure what teh right way to fix this up is.. but I think we do want some sort of PaxosService::refresh() that tells us whenever things changed; it can be the one to call the child's pdate_from_paxos(). - We should be able to discard those auth messages (and others!) if the original connection they came from has disconnected.. which is normally will after the client disconnects after 3 seconds (by default). There is also wip-mon-trim that will lower the trim periodicity (and compaction) for the paxos states; that ought to help some as well, but I haven't tried it yet... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html