On Thu, Jul 12, 2012 at 09:52:27AM +1000, Dave Chinner wrote: > On Wed, Jul 11, 2012 at 05:48:22PM -0500, Ben Myers wrote: > > On Thu, Jul 12, 2012 at 07:40:41AM +1000, Dave Chinner wrote: > > > These two patches need to go to Linus before 3.5 is released. The > > > first fixes the btree cursor leak properly, and the second fixes a > > > significant performance regression reported by Mel Gorman. Can you > > > review them and send them onwards to Linus? > > > > I have some test exposure on the first one, but not on the second. I'd rather > > have a known performance regression in 3.5 than start popping stacks again this > > late toward the end of the rc cycle. I suggest that #2 go in for the 3.6 merge > > window and back to 3.5 via -stable. Any other opinions? > > I'd prefer not to have to go via a -stable kernel. It's a pain to > have to keep track and maintain multiple -stable kernels... I tend to agree. > As it is, I've never once seen a popped stack through a metadata > allocation triggered path. The worst I've seen in the past few days > is this: a bmbt block allocation in the sync write path: > > Depth Size Location (45 entries) > ----- ---- -------- > 0) 5880 40 zone_statistics+0xbd/0xc0 > 1) 5840 272 get_page_from_freelist+0x3db/0x870 > 2) 5568 304 __alloc_pages_nodemask+0x192/0x840 > 3) 5264 80 alloc_pages_current+0xb0/0x120 > 4) 5184 96 xfs_buf_allocate_memory+0xfc/0x270 > 5) 5088 80 xfs_buf_get_map+0x15d/0x1b0 > 6) 5008 64 xfs_buf_read_map+0x33/0x130 > 7) 4944 48 xfs_buf_readahead_map+0x51/0x70 > 8) 4896 64 xfs_btree_reada_bufs+0xa3/0xc0 > 9) 4832 48 xfs_btree_readahead_sblock+0x7e/0xa0 > 10) 4784 16 xfs_btree_readahead+0x5f/0x80 > 11) 4768 96 xfs_btree_increment+0x52/0x360 > 12) 4672 208 xfs_btree_rshift+0x6bf/0x730 > 13) 4464 288 xfs_btree_delrec+0x1308/0x14d0 > 14) 4176 64 xfs_btree_delete+0x41/0xe0 > 15) 4112 112 xfs_alloc_fixup_trees+0x21b/0x580 > 16) 4000 192 xfs_alloc_ag_vextent_near+0xb05/0xcd0 > 17) 3808 32 xfs_alloc_ag_vextent+0x228/0x2a0 > 18) 3776 112 __xfs_alloc_vextent+0x4b7/0x910 > 19) 3664 80 xfs_alloc_vextent+0xa5/0xb0 > 20) 3584 224 xfs_bmbt_alloc_block+0xd1/0x240 > 21) 3360 240 xfs_btree_split+0xe2/0x7e0 > 22) 3120 96 xfs_btree_make_block_unfull+0xde/0x1d0 > 23) 3024 240 xfs_btree_insrec+0x587/0x8e0 > 24) 2784 112 xfs_btree_insert+0x6b/0x1d0 > 25) 2672 256 xfs_bmap_add_extent_delay_real+0x1fd4/0x2210 > 26) 2416 80 xfs_bmapi_allocate+0x290/0x350 > 27) 2336 368 xfs_bmapi_write+0x66e/0xa60 > 28) 1968 176 xfs_iomap_write_allocate+0x131/0x350 > 29) 1792 112 xfs_map_blocks+0x302/0x390 > 30) 1680 192 xfs_vm_writepage+0x19b/0x5a0 > 31) 1488 32 __writepage+0x1a/0x50 > 32) 1456 304 write_cache_pages+0x258/0x4d0 > 33) 1152 96 generic_writepages+0x4d/0x70 > 34) 1056 48 xfs_vm_writepages+0x4d/0x60 > 35) 1008 16 do_writepages+0x21/0x50 > 36) 992 64 __filemap_fdatawrite_range+0x59/0x60 > 37) 928 16 filemap_fdatawrite_range+0x13/0x20 > 38) 912 80 xfs_flush_pages+0x75/0xb0 > 39) 832 176 xfs_file_buffered_aio_write+0xdf/0x1e0 > 40) 656 128 xfs_file_aio_write+0x157/0x1d0 > 41) 528 272 do_sync_write+0xde/0x120 > 42) 256 48 vfs_write+0xa8/0x160 > 43) 208 80 sys_write+0x4a/0x90 > 44) 128 128 system_call_fastpath+0x16/0x1b Nice stack! > This is not quite the insanely complex, worst case double tree split > path (i.e. bmbt split, followed by a allocbt split), but it's still > very close to the worst case stack usage I've seen in above > xfs_bmapi_write() (i.e. >3.5k of stack from xfs_vm_writepage() to > the last XFS function in the stack). > > In theory, the directory code could trigger such a stack, and it > calls xfs_bmapi_write() from roughly the same stack depth. It woul > dtake quite a large directory to be causing splits in the bmbt tree, > so it's a relatively rare occurance.... > > Also, while this is arguably a metadata allocation here, this is in > the data writeback path so perhaps the xfs_bmbt_alloc_block() call > needs to trigger allocation via a workqueue here. That would make > the worst case bmbt split usage (even for directory allocation) > switch stacks. It's rare enough that it shouldn't introduce > performance issues.... > > What is notable here, however, is that none of the directory or > inode allocation paths consumed more stack than this sync write > case. That indicates that my statement assumption about inode and > directory allocation does hold some water.... > > > I'll feel better about if after some testing, so I'll get tests running asap. > > What testing have you and Mel done? > > Mel reran his performance tests that were showing regressions, and > those regressions went away, and I've rerun all my perf tests and > xfstests for the past couple of days and not seen any changes in > performance or stack usage. Sounds ok. Wednesday night's testing didn't go well due to a hardware issue and a PEBKAC, but last night's was fine. Based on that smoke test and what you've described I am more comfortable with this. I really don't want to break something at the last minute. Reviewed-by: Ben Myers <bpm@xxxxxxx> _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs