Re: Trouble getting a new file system to start, for v0.59 and newer

"Jim Schutt" <jaschut@xxxxxxxxxx> · Thu, 4 Apr 2013 08:15:35 -0600

On 04/03/2013 04:51 PM, Gregory Farnum wrote:
> On Wed, Apr 3, 2013 at 3:40 PM, Jim Schutt <jaschut@xxxxxxxxxx> wrote:
>> On 04/03/2013 12:25 PM, Sage Weil wrote:
>>>>>>> Sorry, guess I forgot some of the history since this piece at least is
>>>>>>> resolved now. I'm surprised if 30-second timeouts are causing issues
>>>>>>> without those overloads you were seeing; have you seen this issue
>>>>>>> without your high debugging levels and without the bad PG commits (due
>>>>>>> to debugging)?
>>>>>
>>>>> I think so, because that's why I started with higher debugging
>>>>> levels.
>>>>>
>>>>> But, as it turns out, I'm just in the process of returning to my
>>>>> testing of next, with all my debugging back to 0.  So, I'll try
>>>>> the default timeout of 30 seconds first.  If I have trouble starting
>>>>> up a new file system, I'll turn up the timeout and try again, without
>>>>> any extra debugging.  Either way, I'll let you know what happens.
>>> I would be curious to hear roughly what value between 30 and 300 is
>>> sufficient, if you can experiment just a bit.  We probably want to adjust
>>> the default.
>>>
>>> Perhaps more importantly, we'll need to look at the performance of the pg
>>> stat updates on the mon.  There is a refactor due in that code that should
>>> improve life, but it's slated for dumpling.
>>
>> OK, here's some results, with all debugging at 0, using current next...
>>
>> My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts
>> use 10 GbE NICs now.  The mon host uses an SSD for the mon data store.
>> My test procedure is to start 'ceph -w', start all the OSDs, and once
>> they're all running start the mon.  I report the time from starting
>> the mon to all PGs active+clean.
>>
>> # PGs     osd mon ack    startup    notes
>>             timeout       time
>> -------  ------------    --------   -----
>>  55392      default      >30:00       1
>>  55392        300         18:36       2
>>  55392         60        >30:00       3
>>  55392        150        >30:00       4
>>  55392        240        >30:00       5
>>  55392        300        >30:00       2,6
>>
>> notes:
>> 1) lots of PGs marked stale, OSDs wrongly marked down
>>      before I gave up on this case
>> 2) OSDs report lots of slow requests for "pg_notify(...) v4
>>      currently wait for new map"
>> 3) some OSDs wrongly marked down, OSDs report some slow requests
>>      for "pg_notify(...) v4 currently wait for new map"
>>      before I gave up on this case
>> 4) appeared to be making progress; then an OSD was marked
>>      out at ~21 minutes; many more marked out before I
>>      gave up on this case
>> 5) some OSD reports of slow requests for "pg_notify",
>>      some OSDs wrongly marked down, appeared to be making
>>      progress, then stalled; then I gave up on this case
>> 6) retried this case, appeared to be making progress, but
>>      after ~18 min stalled at 19701 active+clean, 35691 peering
>>      until I gave up
>>
>> Hmmm, I didn't really expect the above results.  I ran out of
>> time before attempting an even longer osd mon ack timeout.
>>
>> But either we're on the wrong trail, or 300 is not sufficient.
>> Or, I'm doing something wrong and haven't yet figured out what
>> it is.
>>
>> FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs,
>> and my memory is a new filesystem started up in ~5 minutes.  For
>> that testing I had to increase 'paxos propose interval' to two
>> or three seconds to keep the monitor writeout rate (as measured
>> by vmstat) down to a sustained 50-70 MB/s during start-up.
>>
>> That was with a 1 GbE NIC in the mon; the reason I upgraded
>> it was a filesystem with 512K PGs was taking too long to start,
>> and I thought the mon might be network-limited since it had an
>> SSD for the mon data store.
>>
>> For the testing above I used the default 'paxos propose interval'.
>> I don't know if it matters, but vmstat sees only a little data
>> being written on the mon system.
> 
> That's odd; I'd actually expect to see more going to disk with v0.59
> than previously. Is vmstat actually looking at disk IO, or might it be
> missing DirectIO or something? (Not that I remember if LevelDB is
> using those.)

FWIW, 'dd oflag=direct' shows up in vmstat.  But, I don't know if
that is relevant to what LevelDB might be doing...

> However, I think you might want to increase your paxos propose
> interval to where it was before — your OSDs are having trouble keeping
> up with the number of maps that are being generated, based on the fact
> that you have a lot of pg notifies stuck waiting for newer maps.

OK, I'll try that.  But to clarify, in the past the default paxos
propose interval was good up to 128K PGs, or so.

Thanks -- Jim

> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html