On Thu, 19 Mar 2015, Gregory Farnum wrote: > Can you share the actual map? I'm not sure exactly what "rack rules" > means here but from your description so far I'm guessing/hoping that > you've accidentally restricted OSD choices in a way that limits the > number of peers each OSD is getting. It may also be that the pg_num calculation is off. If OSDs really have ~200 PGs you would expect ~200-400 peers, not ~10. Perhaps you plugged in the nubmer of OSD *hosts* into your calculation instead of the number of disks? Either way, the 'ceph osd dump' and 'ceph osd crush dump' output should clear things up! sage > -Greg > > On Thu, Mar 19, 2015 at 5:41 AM, Matt Conner <matt.conner@xxxxxxxxxxxxxx> wrote: > > In this case we are using rack rules with the firefly tunables. The testing > > was being done with a single, 3 copy, pool with placement group number 4096. > > This was calculated based on 10% data and 200 PGs per OSD using the > > calculator at http://ceph.com/pgcalc/. > > > > Thanks, > > Matt > > > > Matt Conner > > keepertechnology > > matt.conner@xxxxxxxxxxxxxx > > (240) 461-2657 > > > > On Wed, Mar 18, 2015 at 4:11 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: > >> > >> On Wed, Mar 18, 2015 at 12:59 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >> > On Wed, 18 Mar 2015, Matt Conner wrote: > >> >> I'm working with a 6 rack, 18 server (3 racks of 2 servers , 3 racks > >> >> of 4 servers), 640 OSD cluster and have run into an issue when failing > >> >> a storage server or rack where the OSDs are not getting marked down > >> >> until the monitor timeout is reached - typically resulting in all > >> >> writes being blocked until the timeout. > >> >> > >> >> Each of our storage servers contains 36 OSDs, so in order to prevent a > >> >> server from marking a OSD out (in case of network issues), we have set > >> >> our "mon_osd_min_down_reporters" value to 37. This value is working > >> >> great for a smaller cluster, but unfortunately seems to not work so > >> >> well in this large cluster. Tailing the monitor logs I can see that > >> >> the monitor is only receiving failed reports from 9-10 unique OSDs per > >> >> failed OSD. > >> >> > >> >> I've played around with "osd_heartbeat_min_peers", and it seems to > >> >> help, but I still run into issues where an OSD is not marked down. Can > >> >> anyone explain how the number of heartbeat peers is determined and > >> >> how, if necessary, I can can use "osd_heartbeat_min_peers" to ensure I > >> >> have enough peering to detect failures in large clusters? > >> > > >> > The peers are normally determined why which other OSDs we share a PG > >> > with. > >> > Increasing pg_num for your pools will tend to increase this. It looks > >> > like osd_heartbeat_min_peers will do the same (to be honest I don't > >> > think > >> > I've ever had to use it for this), so I'm not sure why that isn't > >> > resolving the problem. Maybe make sure it is set significantly higher > >> > than 36? It may simply be that several of the random choices were > >> > within > >> > the same host so that when it goes down there still aren't enough peers > >> > to mark things down. > >> > > >> > Alternatively, you can give up on the mon_osd_min_down_reporters tweak. > >> > It sounds like it creates more problems than it solves... > >> > >> This normally works out okay; in particular I'd expect a lot more than > >> 10 peer OSDs in a cluster of this size. What's your CRUSH map and > >> ruleset look like? How many pools and PGs? > >> -Greg > > > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html