Re: Trouble getting a new file system to start, for v0.59 and newer

"Jim Schutt" <jaschut@xxxxxxxxxx> · Wed, 3 Apr 2013 16:40:45 -0600

On 04/03/2013 12:25 PM, Sage Weil wrote:
>>> > > Sorry, guess I forgot some of the history since this piece at least is
>>> > > resolved now. I'm surprised if 30-second timeouts are causing issues
>>> > > without those overloads you were seeing; have you seen this issue
>>> > > without your high debugging levels and without the bad PG commits (due
>>> > > to debugging)?
>> > 
>> > I think so, because that's why I started with higher debugging
>> > levels.
>> > 
>> > But, as it turns out, I'm just in the process of returning to my
>> > testing of next, with all my debugging back to 0.  So, I'll try
>> > the default timeout of 30 seconds first.  If I have trouble starting
>> > up a new file system, I'll turn up the timeout and try again, without
>> > any extra debugging.  Either way, I'll let you know what happens.
> I would be curious to hear roughly what value between 30 and 300 is 
> sufficient, if you can experiment just a bit.  We probably want to adjust 
> the default.
> 
> Perhaps more importantly, we'll need to look at the performance of the pg 
> stat updates on the mon.  There is a refactor due in that code that should 
> improve life, but it's slated for dumpling.

OK, here's some results, with all debugging at 0, using current next...

My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts
use 10 GbE NICs now.  The mon host uses an SSD for the mon data store.
My test procedure is to start 'ceph -w', start all the OSDs, and once
they're all running start the mon.  I report the time from starting
the mon to all PGs active+clean.

# PGs     osd mon ack    startup    notes
            timeout       time
-------  ------------    --------   -----
 55392      default      >30:00       1
 55392        300         18:36       2
 55392         60        >30:00       3
 55392        150        >30:00       4
 55392        240        >30:00       5
 55392        300        >30:00       2,6

notes:
1) lots of PGs marked stale, OSDs wrongly marked down
     before I gave up on this case
2) OSDs report lots of slow requests for "pg_notify(...) v4
     currently wait for new map"
3) some OSDs wrongly marked down, OSDs report some slow requests
     for "pg_notify(...) v4 currently wait for new map"
     before I gave up on this case
4) appeared to be making progress; then an OSD was marked
     out at ~21 minutes; many more marked out before I
     gave up on this case
5) some OSD reports of slow requests for "pg_notify",
     some OSDs wrongly marked down, appeared to be making
     progress, then stalled; then I gave up on this case
6) retried this case, appeared to be making progress, but
     after ~18 min stalled at 19701 active+clean, 35691 peering
     until I gave up

Hmmm, I didn't really expect the above results.  I ran out of
time before attempting an even longer osd mon ack timeout.

But either we're on the wrong trail, or 300 is not sufficient.
Or, I'm doing something wrong and haven't yet figured out what
it is.

FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs,
and my memory is a new filesystem started up in ~5 minutes.  For
that testing I had to increase 'paxos propose interval' to two
or three seconds to keep the monitor writeout rate (as measured
by vmstat) down to a sustained 50-70 MB/s during start-up.

That was with a 1 GbE NIC in the mon; the reason I upgraded
it was a filesystem with 512K PGs was taking too long to start,
and I thought the mon might be network-limited since it had an
SSD for the mon data store.

For the testing above I used the default 'paxos propose interval'.
I don't know if it matters, but vmstat sees only a little data
being written on the mon system.

-- Jim

> 
> sage
> 
>> > 
>> > -- Jim
>> > 
>>> > > -Greg
>>> > > Software Engineer #42 @ http://inktank.com | http://ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html