Re: Failed OSDs not getting marked down or out

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 19 Mar 2015 08:01:03 -0700 (PDT)

On Thu, 19 Mar 2015, Gregory Farnum wrote:
> Can you share the actual map? I'm not sure exactly what "rack rules"
> means here but from your description so far I'm guessing/hoping that
> you've accidentally restricted OSD choices in a way that limits the
> number of peers each OSD is getting.

It may also be that the pg_num calculation is off.  If OSDs really have 
~200 PGs you would expect ~200-400 peers, not ~10.  Perhaps you plugged in 
the nubmer of OSD *hosts* into your calculation instead of the number of 
disks?

Either way, the 'ceph osd dump' and 'ceph osd crush dump' output should 
clear things up!

sage

> -Greg
> 
> On Thu, Mar 19, 2015 at 5:41 AM, Matt Conner <matt.conner@xxxxxxxxxxxxxx> wrote:
> > In this case we are using rack rules with the firefly tunables. The testing
> > was being done with a single, 3 copy, pool with placement group number 4096.
> > This was calculated based on 10% data and 200 PGs per OSD using the
> > calculator at http://ceph.com/pgcalc/.
> >
> > Thanks,
> > Matt
> >
> > Matt Conner
> > keepertechnology
> > matt.conner@xxxxxxxxxxxxxx
> > (240) 461-2657
> >
> > On Wed, Mar 18, 2015 at 4:11 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> >>
> >> On Wed, Mar 18, 2015 at 12:59 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >> > On Wed, 18 Mar 2015, Matt Conner wrote:
> >> >> I'm working with a 6 rack, 18 server (3 racks of 2 servers , 3 racks
> >> >> of 4 servers), 640 OSD cluster and have run into an issue when failing
> >> >> a storage server or rack where the OSDs are not getting marked down
> >> >> until the monitor timeout is reached - typically resulting in all
> >> >> writes being blocked until the timeout.
> >> >>
> >> >> Each of our storage servers contains 36 OSDs, so in order to prevent a
> >> >> server from marking a OSD out (in case of network issues), we have set
> >> >> our "mon_osd_min_down_reporters" value to 37. This value is working
> >> >> great for a smaller cluster, but unfortunately seems to not work so
> >> >> well in this large cluster. Tailing the monitor logs I can see that
> >> >> the monitor is only receiving failed reports from 9-10 unique OSDs per
> >> >> failed OSD.
> >> >>
> >> >> I've played around with "osd_heartbeat_min_peers", and it seems to
> >> >> help, but I still run into issues where an OSD is not marked down. Can
> >> >> anyone explain how the number of heartbeat peers is determined and
> >> >> how, if necessary, I can can use "osd_heartbeat_min_peers" to ensure I
> >> >> have enough peering to detect failures in large clusters?
> >> >
> >> > The peers are normally determined why which other OSDs we share a PG
> >> > with.
> >> > Increasing pg_num for your pools will tend to increase this.  It looks
> >> > like osd_heartbeat_min_peers will do the same (to be honest I don't
> >> > think
> >> > I've ever had to use it for this), so I'm not sure why that isn't
> >> > resolving the problem.  Maybe make sure it is set significantly higher
> >> > than 36?  It may simply be that several of the random choices were
> >> > within
> >> > the same host so that when it goes down there still aren't enough peers
> >> > to mark things down.
> >> >
> >> > Alternatively, you can give up on the mon_osd_min_down_reporters tweak.
> >> > It sounds like it creates more problems than it solves...
> >>
> >> This normally works out okay; in particular I'd expect a lot more than
> >> 10 peer OSDs in a cluster of this size. What's your CRUSH map and
> >> ruleset look like? How many pools and PGs?
> >> -Greg
> >
> >
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html