Can you share the actual map? I'm not sure exactly what "rack rules" means here but from your description so far I'm guessing/hoping that you've accidentally restricted OSD choices in a way that limits the number of peers each OSD is getting. -Greg On Thu, Mar 19, 2015 at 5:41 AM, Matt Conner <matt.conner@xxxxxxxxxxxxxx> wrote: > In this case we are using rack rules with the firefly tunables. The testing > was being done with a single, 3 copy, pool with placement group number 4096. > This was calculated based on 10% data and 200 PGs per OSD using the > calculator at http://ceph.com/pgcalc/. > > Thanks, > Matt > > Matt Conner > keepertechnology > matt.conner@xxxxxxxxxxxxxx > (240) 461-2657 > > On Wed, Mar 18, 2015 at 4:11 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: >> >> On Wed, Mar 18, 2015 at 12:59 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> > On Wed, 18 Mar 2015, Matt Conner wrote: >> >> I'm working with a 6 rack, 18 server (3 racks of 2 servers , 3 racks >> >> of 4 servers), 640 OSD cluster and have run into an issue when failing >> >> a storage server or rack where the OSDs are not getting marked down >> >> until the monitor timeout is reached - typically resulting in all >> >> writes being blocked until the timeout. >> >> >> >> Each of our storage servers contains 36 OSDs, so in order to prevent a >> >> server from marking a OSD out (in case of network issues), we have set >> >> our "mon_osd_min_down_reporters" value to 37. This value is working >> >> great for a smaller cluster, but unfortunately seems to not work so >> >> well in this large cluster. Tailing the monitor logs I can see that >> >> the monitor is only receiving failed reports from 9-10 unique OSDs per >> >> failed OSD. >> >> >> >> I've played around with "osd_heartbeat_min_peers", and it seems to >> >> help, but I still run into issues where an OSD is not marked down. Can >> >> anyone explain how the number of heartbeat peers is determined and >> >> how, if necessary, I can can use "osd_heartbeat_min_peers" to ensure I >> >> have enough peering to detect failures in large clusters? >> > >> > The peers are normally determined why which other OSDs we share a PG >> > with. >> > Increasing pg_num for your pools will tend to increase this. It looks >> > like osd_heartbeat_min_peers will do the same (to be honest I don't >> > think >> > I've ever had to use it for this), so I'm not sure why that isn't >> > resolving the problem. Maybe make sure it is set significantly higher >> > than 36? It may simply be that several of the random choices were >> > within >> > the same host so that when it goes down there still aren't enough peers >> > to mark things down. >> > >> > Alternatively, you can give up on the mon_osd_min_down_reporters tweak. >> > It sounds like it creates more problems than it solves... >> >> This normally works out okay; in particular I'd expect a lot more than >> 10 peer OSDs in a cluster of this size. What's your CRUSH map and >> ruleset look like? How many pools and PGs? >> -Greg > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html