Re: [EXTERNAL] Re: scaling issues

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 10 Apr 2012 15:38:00 -0700 (PDT)

On Tue, 10 Apr 2012, Jim Schutt wrote:
> On 04/10/2012 10:39 AM, Sage Weil wrote:
> > On Tue, 10 Apr 2012, Jim Schutt wrote:
> > > On 03/09/2012 04:21 PM, Jim Schutt wrote:
> > > > On 03/09/2012 12:39 PM, Jim Schutt wrote:
> > > > > On 03/08/2012 05:26 PM, Sage Weil wrote:
> > > > > > On Thu, 8 Mar 2012, Jim Schutt wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > I've been trying to scale up a Ceph filesystem to as big
> > > > > > > as I have hardware for - up to 288 OSDs right now.
> > > > > > > 
> > > > > > > (I'm using commit ed0f605365e - tip of master branch from
> > > > > > > a few days ago.)
> > > > > > > 
> > > > > > > My problem is that I cannot get a 288 OSD filesystem to go active
> > > > > > > (that's with 1 mon and 1 MDS). Pretty quickly I start seeing
> > > > > > > "mds e4 e4: 1/1/1 up {0=cs33=up:creating(laggy or crashed)}".
> > > > > > > Note that as this is happening all the OSDs and the MDS are
> > > > > > > essentially idle; only the mon is busy.
> > > > > > > 
> > > > > > > While tailing the mon log I noticed there was a periodic pause;
> > > > > > > after adding a little more debug printing, I learned that the
> > > > > > > pause was due to encoding pg_stat_t before writing the pg_map to
> > > > > > > disk.
> > > > > > > 
> > > > > > > Here's the result of a scaling study I did on startup time for
> > > > > > > a freshly created filesystem. I normally run 24 OSDs/server on
> > > > > > > these machines with no trouble, for small numbers of OSDs.
> > > 
> > > [snip]
> > > 
> > > > 
> > > > I recompiled with -g -O2, and got this:
> > > > 
> > > > OSDs size of pg_stat_t
> > > >       latest  encode
> > > >               time
> > > > 
> > > > 48  2976461  0.052731
> > > > 72  4472477  0.107187
> > > > 96  5969477  0.194690
> > > > 120 7466021  0.311586
> > > > 144 8963141  0.465111
> > > > 168 10460317 0.680222
> > > > 192 11956709 0.713398
> > > > 240 14950437 1.159426
> > > > 288 17944413 1.714004
> > > > 
> > > > It seems that encoding time still isn't proportional to the
> > > > size of pgmap/latest. However, things have improved enough
> > > > that my 288 OSD filesystem goes active pretty quickly (~90 sec),
> > > > so I can continue testing at that scale.
> > 
> > A fix for this was just merged into master last night.
> 
> That's great!
> 
> > 
> > > I'm still having trouble at 288 OSDs with under heavy write load
> > > (166 linux clients running dd simultaneously).  I'm currently
> > > running with master branch from last week - commit e792cd938897.
> > > 
> > > The symptom is that the cluster cycles between "up:active"
> > > and "up:active(laggy or crashed)".  When the cluster goes into
> > > "laggy or crashed" the client caps go stale, and cluster throughput
> > > (as monitored by vmstat on OSD servers) trails off to zero.  After a
> > > short idle period, the cluster goes back "up:active", clients
> > > renew their caps, and cluster throughput goes back to its maximum
> > > until the next cycle starts.
> > > 
> > > I believe this is a scaling issue because when I use pg_bits = 5
> > > and pgp_bits = 5 (instead of the default 6) to build the filesystem,
> > > I can write>20 TB using the same test, with no instances of the
> > > cluster going  "laggy or crashed".  Perhaps it is related to
> > > the encoding time for pg_stat_t that I reported above?
> > 
> > Yeah, that sounds like the culprit to me.  Can you try with the latest
> > master?
> 
> I'm 24 TB into a 25 TB run with pg_bits = 6, and the cluster
> has been up:active the whole time.  Seems fixed to me!

Hooray!

> <checks running ceph -w output>
> 
> Interesting. One of my servers just died.
> 
> FWIW, when all 24 OSDs on that server dropped out, the cluster went
> "up:active(laggy or crashed)" for about 8 minutes while recovery started,
> then went back "up:active" while recovery is finishing.
> 
> I thought you'd be interested to know - is that behavior unremarkable?
> 
> FWIW, here's what happened on that server:
> 
> [510342.517157] ------------[ cut here ]------------
> [510342.521855] kernel BUG at fs/btrfs/extent_io.c:3982!
> [510342.526894] invalid opcode: 0000 [#1] SMP
> [510342.531102] CPU 4
> [510342.533028] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm
> ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror
> dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap
> macvlan tun kvm uinput sg sd_mod joydev ata_piix libata button microcode
> mpt2sas scsi_transport_sas raid_class scsi_mod serio_raw pcspkr mlx4_ib ib_mad
> ib_core mlx4_en mlx4_core cxgb4 i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support
> ehci_hcd uhci_hcd ioatdma dm_mod i7core_edac edac_core nfs nfs_acl auth_rpcgss
> fscache lockd sunrpc tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
> [510342.587836]
> [510342.589412] Pid: 16609, comm: kworker/4:2 Not tainted 3.3.1-00162-gd8b2857
> #15 Supermicro X8DTH-i/6/iF/6F/X8DTH
> [510342.599601] RIP: 0010:[<ffffffffa057924c>]  [<ffffffffa057924c>]
> btrfs_release_extent_buffer_page.clone.0+0x2c/0x130 [btrfs]
> [510342.610893] RSP: 0018:ffff88015fb6ba10  EFLAGS: 00010202
> [510342.616277] RAX: 0000000000000004 RBX: ffff880ab81865a0 RCX:
> ffff880174bc0230
> [510342.623476] RDX: ffff8801335bf9b1 RSI: 00000000000d0fb8 RDI:
> ffff880ab81865a0
> [510342.630675] RBP: ffff88015fb6ba40 R08: 0000000000000038 R09:
> 0000000000000003
> [510342.637874] R10: 0000000000000008 R11: ffff8804658c9e40 R12:
> ffff88015fb6a000
> [510342.645069] R13: ffff880ab81865a0 R14: 000000000000000e R15:
> ffff88015fb6bc10
> [510342.652268] FS:  0000000000000000(0000) GS:ffff880627c80000(0000)
> knlGS:0000000000000000
> [510342.660418] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [510342.666234] CR2: ffffffffff600400 CR3: 0000000001a05000 CR4:
> 00000000000006e0
> [510342.673427] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [510342.680627] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [510342.687827] Process kworker/4:2 (pid: 16609, threadinfo ffff88015fb6a000,
> task ffff880102ca4410)
> [510342.696669] Stack:
> [510342.698769]  ffff880100000000 ffff880ab81865a0 ffff88015fb6a000
> ffff8806057d2eb0
> [510342.706297]  000000000000000e ffff88015fb6bc10 ffff88015fb6ba70
> ffffffffa05793f2
> [510342.713825]  ffff88015fb6bb80 ffff880ab81865a0 ffff88015fb6bb50
> 0000000000000008
> [510342.721362] Call Trace:
> [510342.723912]  [<ffffffffa05793f2>] release_extent_buffer+0xa2/0xe0 [btrfs]
> [510342.730790]  [<ffffffffa05795b4>] free_extent_buffer+0x34/0x80 [btrfs]
> [510342.737407]  [<ffffffffa057a126>] btree_write_cache_pages+0x246/0x410
> [btrfs]
> [510342.744637]  [<ffffffffa054e96a>] btree_writepages+0x3a/0x50 [btrfs]
> [510342.751060]  [<ffffffff810fc421>] do_writepages+0x21/0x40
> [510342.756537]  [<ffffffff810f0b0b>] __filemap_fdatawrite_range+0x5b/0x60
> [510342.763136]  [<ffffffff810f0de3>] filemap_fdatawrite_range+0x13/0x20
> [510342.769568]  [<ffffffffa0554ecf>] btrfs_write_marked_extents+0x7f/0xe0
> [btrfs]
> [510342.776867]  [<ffffffffa0554f5e>]
> btrfs_write_and_wait_marked_extents+0x2e/0x60 [btrfs]
> [510342.784951]  [<ffffffffa0554fbb>]
> btrfs_write_and_wait_transaction+0x2b/0x50 [btrfs]
> [510342.792768]  [<ffffffffa055604c>] btrfs_commit_transaction+0x7ac/0xa10
> [btrfs]
> [510342.800060]  [<ffffffff81079540>] ? set_next_entity+0x90/0xa0
> [510342.805875]  [<ffffffff8105f5d0>] ? wake_up_bit+0x40/0x40
> [510342.811365]  [<ffffffffa0556590>] ? btrfs_end_transaction+0x20/0x20
> [btrfs]
> [510342.818403]  [<ffffffffa05565af>] do_async_commit+0x1f/0x30 [btrfs]
> [510342.824748]  [<ffffffffa0556590>] ? btrfs_end_transaction+0x20/0x20
> [btrfs]
> [510342.831774]  [<ffffffff81058680>] process_one_work+0x140/0x490
> [510342.837673]  [<ffffffff8105a417>] worker_thread+0x187/0x3f0
> [510342.843319]  [<ffffffff8105a290>] ? manage_workers+0x120/0x120
> [510342.849225]  [<ffffffff8105f02e>] kthread+0x9e/0xb0
> [510342.854176]  [<ffffffff81486c64>] kernel_thread_helper+0x4/0x10
> [510342.860168]  [<ffffffff8147d84a>] ? retint_restore_args+0xe/0xe
> [510342.866161]  [<ffffffff8105ef90>] ?
> kthread_freezable_should_stop+0x80/0x80
> [510342.873198]  [<ffffffff81486c60>] ? gs_change+0xb/0xb
> [510342.878322] Code: 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 66 66 66
> 66 90 8b 47 38 49 89 fd 85 c0 75 0c 48 8b 47 20 4c 8d 7f 20 84 c0 79 04 <0f>
> 0b eb fe 48 8b 47 20 a8 04 75 f4 48 8b 07 49 89 c4 4c 03 67
> [510342.898331] RIP  [<ffffffffa057924c>]
> btrfs_release_extent_buffer_page.clone.0+0x2c/0x130 [btrfs]
> [510342.907294]  RSP <ffff88015fb6ba10>
> [510342.911241] ---[ end trace 62013c6b6e2e5135 ]---
> 
> I'm running 3.3.1 + Chris Mason's btrfs for-linus branch (commit
> bc3f116fec194f,
> just before merge into 3.4-rc1).  This kernel also has the ceph for-next
> branch
> (commit c666601a935b, just before merge into 3.4-rc1), but that shouldn't
> matter
> for this.
> 
> > 
> > > The problem with using pg_bits = 5 is that the data distribution
> > > is not particularly even; after writing 20 TB to 288 OSDs I see
> > > (max OSD use)/(min OSD use) = ~2. Even with pg_bits = 6 after
> > > writing 20 TB I see (max OSD use)/(min OSD use) = ~1.5.
> > > I think I'd like that variability to be even smaller.
> > 
> > There is some infrastructure in the monitor to correct for the statistical
> > imbalance, but it isn't triggered automatically yet.  It's probably time
> > to look at that.
> 
> FWIW I've noticed while learning about how Ceph hashes objects into
> PGs and onto OSDs that Bob Jenkins has a hash function newer than
> what Ceph is using; see http://burtleburtle.net/bob/c/lookup3.c.
> 
> I haven't been able to find any comparisons of the older hash
> to this newer one, wrt. uniformity; I wonder if it would be
> more suitable, and thus reduce the imbalance without the need
> for correction?

It's possible.  We plan to have a hard look at CRUSH and do some 
improvements in a month or two.  This will be part of that.

> > > AFAICS I'm getting 3 pools of (n_OSDs<<  pg_bits) PGs, one pool
> > > each for each of CEPH_DATA_RULE, CEPH_METADATA_RULE, and
> > > CEPH_RBD_RULE.  So, for 288 OSDs I get 3*(288<<6) = 55296 PGs,
> > > plus a few thousand more for the localized PGs.
> > > 
> > > I can't seem to find any use of CEPH_RBD_RULE in the code, other
> > > than to create that pool.  What am I missing?  I'd like to just
> > > not create that pool to reduce my PG count - what problems might
> > > that cause?
> > 
> > None.   We create the rbd pool by default but it isn't used by the
> > filesystem; it's just the default pool used by the 'rbd' command line
> > tool.
> 
> OK, thanks for verifying.
> > 
> > > Also, what would be the downside if I tried to not create the
> > > CEPH_METADATA_RULE pool, and just put everything into the
> > > CEPH_DATA_RULE pool?  That way I could run with just one pool.
> > 
> > You could do that too.  The idea was that people might want a different
> > replication level or placement for metadata (faster nodes, more replicas,
> > whatever).
> 
> Again, thanks for verifying.
> 
> > 
> > But.. try with master first, as the PG scaling issue needs fixing
> > regardless, is hopefully fixed now, and will probably make all of this
> > moot... :)
> 
> I'll keep those tricks in mind if I ever get enough hardware
> that the number of PGs becomes an issue again.
> 
> > 
> > > In the longer run, can anything be done to keep the monitor
> > > daemon responsive when running with thousands of OSDs under a
> > > heavy write load?
> > 
> > Right now the monitor is being used to aggregate usage information, which
> > is probably not the best use of its time.  I don't expect it will become a
> > real problem for a while, though (as long as we avoid bugs like this one).
> 
> Hmm, if I did a few runs with bigger pg_bits values, that would give
> some idea of how far away that time is, right?

Yeah.  Just watching ceph-mon CPU and memory utilization will give you a 
good idea.. it should scale linearly with the size of the cluster.  I 
suspect we can get pretty far by throwing hardware at it, but I'm not sure 
how close we are to it being problematic.

> Thanks for the timely bug fix!  I really appreciate it.

And thanks for testing!

sage

>
> 
> -- Jim
> 
> > sage
> > 
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html