There are lots of scenarios in a large cluster an admin mistake or misconfiguration can cause a bajillion PGs to pile up on one OSD. This causes various probably, most of which are difficult to diagnose and get out of because things are so heavily loaded. We need to have a way to prevent people/clusters from shooting themselves in the foot in this particular way. Here's a rough proposal: - config option 'osd max pgs = 500' (or something similar) - if the osd is at or above pg_max (pg_map.size()), it will silently drop any pg peering messages or requests for pgs that they don't already have - if the osd reaches pg_max, it will set a bit or flag in the osd_stat_t it reports to the monitor. - when the osd drops below pg_max, it will clear that bit ...and the tricky part... - when the mon sees that osd bit clear, it will do something to the osdmap and issue a new epoch. that something will either trigger an interval change for the osd or in other way induce any unpeered pgs including that osd to restart peering or resend messages to that osd. I'm not sure we want to simply trigger an interval change as that will restart peering on peered PGs at a time when the OSD is under load. A more targetted way to kick peering (and resend of the potentially dropped messages) just for unpeered PGs would be ideal. Eventually I'd like to see us include this in the thrashing tests by setting the pg_max at something that is reasonably low (2x or 3x the target/average pgs per osd) and make one of the thrashing operations skew the crush weights. Ideally in a way that will make an overloaded OSD need to drain previous PGs before it can accept new ones? Thoughts? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html