Re: Failed OSDs not getting marked down or out

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 19 Mar 2015 06:58:11 -0700



Can you share the actual map? I'm not sure exactly what "rack rules"
means here but from your description so far I'm guessing/hoping that
you've accidentally restricted OSD choices in a way that limits the
number of peers each OSD is getting.
-Greg

On Thu, Mar 19, 2015 at 5:41 AM, Matt Conner <matt.conner@xxxxxxxxxxxxxx> wrote:
> In this case we are using rack rules with the firefly tunables. The testing
> was being done with a single, 3 copy, pool with placement group number 4096.
> This was calculated based on 10% data and 200 PGs per OSD using the
> calculator at http://ceph.com/pgcalc/.
>
> Thanks,
> Matt
>
> Matt Conner
> keepertechnology
> matt.conner@xxxxxxxxxxxxxx
> (240) 461-2657
>
> On Wed, Mar 18, 2015 at 4:11 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>
>> On Wed, Mar 18, 2015 at 12:59 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> > On Wed, 18 Mar 2015, Matt Conner wrote:
>> >> I'm working with a 6 rack, 18 server (3 racks of 2 servers , 3 racks
>> >> of 4 servers), 640 OSD cluster and have run into an issue when failing
>> >> a storage server or rack where the OSDs are not getting marked down
>> >> until the monitor timeout is reached - typically resulting in all
>> >> writes being blocked until the timeout.
>> >>
>> >> Each of our storage servers contains 36 OSDs, so in order to prevent a
>> >> server from marking a OSD out (in case of network issues), we have set
>> >> our "mon_osd_min_down_reporters" value to 37. This value is working
>> >> great for a smaller cluster, but unfortunately seems to not work so
>> >> well in this large cluster. Tailing the monitor logs I can see that
>> >> the monitor is only receiving failed reports from 9-10 unique OSDs per
>> >> failed OSD.
>> >>
>> >> I've played around with "osd_heartbeat_min_peers", and it seems to
>> >> help, but I still run into issues where an OSD is not marked down. Can
>> >> anyone explain how the number of heartbeat peers is determined and
>> >> how, if necessary, I can can use "osd_heartbeat_min_peers" to ensure I
>> >> have enough peering to detect failures in large clusters?
>> >
>> > The peers are normally determined why which other OSDs we share a PG
>> > with.
>> > Increasing pg_num for your pools will tend to increase this.  It looks
>> > like osd_heartbeat_min_peers will do the same (to be honest I don't
>> > think
>> > I've ever had to use it for this), so I'm not sure why that isn't
>> > resolving the problem.  Maybe make sure it is set significantly higher
>> > than 36?  It may simply be that several of the random choices were
>> > within
>> > the same host so that when it goes down there still aren't enough peers
>> > to mark things down.
>> >
>> > Alternatively, you can give up on the mon_osd_min_down_reporters tweak.
>> > It sounds like it creates more problems than it solves...
>>
>> This normally works out okay; in particular I'd expect a lot more than
>> 10 peer OSDs in a cluster of this size. What's your CRUSH map and
>> ruleset look like? How many pools and PGs?
>> -Greg
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html