On Wed, Apr 3, 2013 at 3:40 PM, Jim Schutt <jaschut@xxxxxxxxxx> wrote: > On 04/03/2013 12:25 PM, Sage Weil wrote: >>>> > > Sorry, guess I forgot some of the history since this piece at least is >>>> > > resolved now. I'm surprised if 30-second timeouts are causing issues >>>> > > without those overloads you were seeing; have you seen this issue >>>> > > without your high debugging levels and without the bad PG commits (due >>>> > > to debugging)? >>> > >>> > I think so, because that's why I started with higher debugging >>> > levels. >>> > >>> > But, as it turns out, I'm just in the process of returning to my >>> > testing of next, with all my debugging back to 0. So, I'll try >>> > the default timeout of 30 seconds first. If I have trouble starting >>> > up a new file system, I'll turn up the timeout and try again, without >>> > any extra debugging. Either way, I'll let you know what happens. >> I would be curious to hear roughly what value between 30 and 300 is >> sufficient, if you can experiment just a bit. We probably want to adjust >> the default. >> >> Perhaps more importantly, we'll need to look at the performance of the pg >> stat updates on the mon. There is a refactor due in that code that should >> improve life, but it's slated for dumpling. > > OK, here's some results, with all debugging at 0, using current next... > > My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts > use 10 GbE NICs now. The mon host uses an SSD for the mon data store. > My test procedure is to start 'ceph -w', start all the OSDs, and once > they're all running start the mon. I report the time from starting > the mon to all PGs active+clean. > > # PGs osd mon ack startup notes > timeout time > ------- ------------ -------- ----- > 55392 default >30:00 1 > 55392 300 18:36 2 > 55392 60 >30:00 3 > 55392 150 >30:00 4 > 55392 240 >30:00 5 > 55392 300 >30:00 2,6 > > notes: > 1) lots of PGs marked stale, OSDs wrongly marked down > before I gave up on this case > 2) OSDs report lots of slow requests for "pg_notify(...) v4 > currently wait for new map" > 3) some OSDs wrongly marked down, OSDs report some slow requests > for "pg_notify(...) v4 currently wait for new map" > before I gave up on this case > 4) appeared to be making progress; then an OSD was marked > out at ~21 minutes; many more marked out before I > gave up on this case > 5) some OSD reports of slow requests for "pg_notify", > some OSDs wrongly marked down, appeared to be making > progress, then stalled; then I gave up on this case > 6) retried this case, appeared to be making progress, but > after ~18 min stalled at 19701 active+clean, 35691 peering > until I gave up > > Hmmm, I didn't really expect the above results. I ran out of > time before attempting an even longer osd mon ack timeout. > > But either we're on the wrong trail, or 300 is not sufficient. > Or, I'm doing something wrong and haven't yet figured out what > it is. > > FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs, > and my memory is a new filesystem started up in ~5 minutes. For > that testing I had to increase 'paxos propose interval' to two > or three seconds to keep the monitor writeout rate (as measured > by vmstat) down to a sustained 50-70 MB/s during start-up. > > That was with a 1 GbE NIC in the mon; the reason I upgraded > it was a filesystem with 512K PGs was taking too long to start, > and I thought the mon might be network-limited since it had an > SSD for the mon data store. > > For the testing above I used the default 'paxos propose interval'. > I don't know if it matters, but vmstat sees only a little data > being written on the mon system. That's odd; I'd actually expect to see more going to disk with v0.59 than previously. Is vmstat actually looking at disk IO, or might it be missing DirectIO or something? (Not that I remember if LevelDB is using those.) However, I think you might want to increase your paxos propose interval to where it was before — your OSDs are having trouble keeping up with the number of maps that are being generated, based on the fact that you have a lot of pg notifies stuck waiting for newer maps. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html