On 09/07/2013 08:31 AM, Brian Foster wrote: > On 09/06/2013 05:35 PM, Dave Chinner wrote: >> On Thu, Sep 05, 2013 at 05:17:10PM -0400, Michael L. Semon wrote: >> .... >>> [ 814.376620] XFS (sdb4): Mounting Filesystem >>> [ 815.050778] XFS (sdb4): Ending clean mount >>> [ 823.169368] >>> [ 823.170932] ====================================================== >>> [ 823.172146] [ INFO: possible circular locking dependency detected ] >>> [ 823.172146] 3.11.0+ #5 Not tainted >>> [ 823.172146] ------------------------------------------------------- >>> [ 823.172146] dirstress/5276 is trying to acquire lock: >>> [ 823.172146] (sb_internal){.+.+.+}, at: [<c11c5e60>] xfs_trans_alloc+0x1f/0x35 >>> [ 823.172146] >>> [ 823.172146] but task is already holding lock: >>> [ 823.172146] (&(&ip->i_lock)->mr_lock){+++++.}, at: [<c1206cfb>] xfs_ilock+0x100/0x1f1 >>> [ 823.172146] >>> [ 823.172146] which lock already depends on the new lock. >>> [ 823.172146] >>> [ 823.172146] >>> [ 823.172146] the existing dependency chain (in reverse order) is: >>> [ 823.172146] >>> [ 823.172146] -> #1 (&(&ip->i_lock)->mr_lock){+++++.}: >>> [ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11 >>> [ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e >>> [ 823.172146] [<c14bff98>] _raw_spin_lock+0x47/0x74 >>> [ 823.172146] [<c1116247>] __mark_inode_dirty+0x171/0x38c >>> [ 823.172146] [<c111acab>] __set_page_dirty+0x5f/0x95 >>> [ 823.172146] [<c111b93e>] mark_buffer_dirty+0x58/0x12b >>> [ 823.172146] [<c111baff>] __block_commit_write.isra.17+0x64/0x82 >>> [ 823.172146] [<c111c197>] block_write_end+0x2b/0x53 >>> [ 823.172146] [<c111c201>] generic_write_end+0x42/0x9a >>> [ 823.172146] [<c11a42d5>] xfs_vm_write_end+0x60/0xbe >>> [ 823.172146] [<c10bd47a>] generic_file_buffered_write+0x140/0x20f >>> [ 823.172146] [<c11b2347>] xfs_file_buffered_aio_write+0x10b/0x205 >>> [ 823.172146] [<c11b24ee>] xfs_file_aio_write+0xad/0xec >>> [ 823.172146] [<c10f0c5f>] do_sync_write+0x60/0x87 >>> [ 823.172146] [<c10f0e0f>] vfs_write+0x9c/0x189 >>> [ 823.172146] [<c10f0fc6>] SyS_write+0x49/0x81 >>> [ 823.172146] [<c14c14bb>] sysenter_do_call+0x12/0x32 >>> [ 823.172146] >>> [ 823.172146] -> #0 (sb_internal){.+.+.+}: >>> [ 823.172146] [<c106e972>] validate_chain.isra.35+0xfc7/0xff4 >>> [ 823.172146] [<c1070a11>] __lock_acquire+0x345/0xa11 >>> [ 823.172146] [<c1071c45>] lock_acquire+0x88/0x17e >>> [ 823.172146] [<c10f36eb>] __sb_start_write+0xad/0x177 >>> [ 823.172146] [<c11c5e60>] xfs_trans_alloc+0x1f/0x35 >>> [ 823.172146] [<c120a823>] xfs_inactive+0x129/0x4a3 >>> [ 823.172146] [<c11c280d>] xfs_fs_evict_inode+0x6c/0x114 >>> [ 823.172146] [<c1106678>] evict+0x8e/0x15d >>> [ 823.172146] [<c1107126>] iput+0xc4/0x138 >>> [ 823.172146] [<c1103504>] dput+0x1b2/0x257 >>> [ 823.172146] [<c10f1a30>] __fput+0x140/0x1eb >>> [ 823.172146] [<c10f1b0f>] ____fput+0xd/0xf >>> [ 823.172146] [<c1048477>] task_work_run+0x67/0x90 >>> [ 823.172146] [<c1001ea5>] do_notify_resume+0x61/0x63 >>> [ 823.172146] [<c14c0cfa>] work_notifysig+0x1f/0x25 >>> [ 823.172146] >>> [ 823.172146] other info that might help us debug this: >>> [ 823.172146] >>> [ 823.172146] Possible unsafe locking scenario: >>> [ 823.172146] >>> [ 823.172146] CPU0 CPU1 >>> [ 823.172146] ---- ---- >>> [ 823.172146] lock(&(&ip->i_lock)->mr_lock); >>> [ 823.172146] lock(sb_internal); >>> [ 823.172146] lock(&(&ip->i_lock)->mr_lock); >>> [ 823.172146] lock(sb_internal); >> >> Ah, now there's something I missed in all the xfs_inactive >> transaction rework - you can't call >> xfs_trans_alloc()/xfs-trans_reserve with the XFS_ILOCK_??? held. >> It's not the freeze locks you really have to worry about deadlocking >> if you do, it's deadlocking against log space that is much more >> likely. >> >> i.e. if you hold the ILOCK, the AIL can't get it to flush the inode >> to disk. That means if the inode you hold locked is pinning the tail >> of the log and there is no logspace for the transaction you are >> about to run, xfs_trans_reserve() will block forever waiting for the >> inode to be flushed and the tail of the log to move forward. This >> will end up blocking all further reservations and hence deadlock the >> filesystem... >> >> Brian, if you rewrite xfs_inactive in the way that I suggested, this >> problem goes away ;) >> >> Thanks for reporting this, Michael. >> > > Oh, very interesting scenario. Thanks again for catching this, Michael, > and for the analysis, Dave. I'll get this cleaned up in the next > revision as well. > > Brian > >> Cheers, >> >> Dave. >> > No problem, Brian. I'll try out your userspace as well. I had worked a bit on getting some sane numbers that are better than "results for copying 3 kernel gits to a 1k-blocksize FS with write cache turned off." Here's my attempt, as a more formal report. Thanks! Michael [REPORT FOLLOWS] Lockdep threw off the debug numbers for your patchset--a new lockdep is at the very end--so these tests were done with a fairly non-debug setup. The write cache is on for these tests as well. Casual "user command" benchmark using built 3.11.0+ kernel git tarball. The idea behind it: 1) Unpack a tarball, and 2) do something with its contents. The total files are among these: $TEST_DIR/a/linux/ ... $TEST_DIR/b/linux/ ... $TEST_DIR/kernel-git-built-2013-08-05.tar.gz The file systems are as follows: v4: mkfs.xfs -l logdev=$TEST_LOGDEV $TEST_DEV v4dirent: mkfs.xfs -n ftype=1 -l logdev=$TEST_LOGDEV $TEST_DEV v4d512bi: mkfs.xfs -n ftype=1 -i log=9 -l logdev=$TEST_LOGDEV $TEST_DEV v5: mkfs.xfs -m crc=1 -l logdev=$TEST_LOGDEV $TEST_DEV Dave had a benchmark set to break down v5 performance changes into a 512-byte-inode component and a CRC component. This is my edition of the benchmark, done with old spinning rust on x86 hardware, and updated to reflect your patchset. Patchset notation: "normal" is the normal xfs-oss/master with Dave's latest patches on top; "itree" adds the inode btree code. This is non-debug XFS. Tracers, lockdep, and almost all other "Kernel Hacking" kernel config items are not enabled. It's still a CONFIG_KERNEL_DEBUG=y kernel, though. ======================= REAL ======================== command patch v4 v4dirent v4d512bi v5 ==========|======|========|========|========|======== tar -xf normal 103.202 104.951 101.771 104.486 tar -xf itree 104.610 101.705 98.784 101.919 ----------+------+--------+--------+--------+-------- sha256sum normal 227.456 228.321 231.947 234.127 sha256sum itree 230.233 229.375 231.509 233.253 ----------+------+--------+--------+--------+-------- cp -r a b normal 239.714 242.754 248.994 249.584 cp -r a b itree 241.273 243.216 248.531 254.501 ----------+------+--------+--------+--------+-------- find . normal 11.894 12.370 12.324 12.397 find . itree 12.043 12.310 12.736 13.216 ----------+------+--------+--------+--------+-------- rm -r b normal 8.556 8.744 11.298 11.774 rm -r b itree 8.904 8.981 10.590 12.057 ----------+------+--------+--------+--------+-------- cp -r b a normal 262.065 256.448 272.290 272.221 cp -r b a itree 264.116 265.875 267.346 270.811 ----------+------+--------+--------+--------+-------- rm -r b normal 8.585 9.258 8.791 10.058 rm -r b itree 9.061 8.345 9.909 9.273 ----------+------+--------+--------+--------+-------- stat normal 161.853 162.772 163.555 165.046 stat itree 162.641 163.148 163.698 164.015 ----------+------+--------+--------+--------+-------- sha check normal 133.938 133.016 141.352 142.921 sha check itree 133.885 133.399 138.013 143.315 ----------+------+--------+--------+--------+-------- cp tarball normal 44.102 42.812 43.603 43.722 cp tarball itree 43.724 44.187 44.339 42.761 ==========|======|========|========|========|======== TOTAL normal 1201.365 1201.446 1235.925 1246.336 TOTAL itree 1210.490 1210.541 1225.455 1245.121 ======================= USER ======================== command patch v4 v4dirent v4d512bi v5 ==========|======|========|========|========|======== tar -xf normal 59.223 59.473 58.817 59.640 tar -xf itree 59.420 59.473 58.953 59.893 ----------+------+--------+--------+--------+-------- sha256sum normal 49.877 49.877 49.787 49.730 sha256sum itree 49.437 49.863 49.583 49.673 ----------+------+--------+--------+--------+-------- cp -r a b normal 0.697 0.707 0.743 0.800 cp -r a b itree 0.657 0.710 0.677 0.703 ----------+------+--------+--------+--------+-------- find . normal 0.257 0.237 0.233 0.223 find . itree 0.283 0.223 0.223 0.203 ----------+------+--------+--------+--------+-------- rm -r b normal 0.170 0.120 0.147 0.160 rm -r b itree 0.160 0.163 0.130 0.137 ----------+------+--------+--------+--------+-------- cp -r b a normal 0.817 0.763 0.817 0.763 cp -r b a itree 0.737 0.740 0.787 0.670 ----------+------+--------+--------+--------+-------- rm -r b normal 0.170 0.153 0.140 0.133 rm -r b itree 0.140 0.157 0.143 0.163 ----------+------+--------+--------+--------+-------- stat normal 1.660 1.653 1.570 1.720 stat itree 1.737 1.727 1.700 1.630 ----------+------+--------+--------+--------+-------- sha check normal 58.467 58.603 58.550 58.370 sha check itree 58.157 58.183 58.620 58.343 ----------+------+--------+--------+--------+-------- cp tarball normal 0.023 0.027 0.033 0.037 cp tarball itree 0.017 0.020 0.020 0.020 ==========|======|========|========|========|======== TOTAL normal 171.361 171.613 170.837 171.576 TOTAL itree 170.745 171.259 170.836 171.435 ======================== SYS ======================== command patch v4 v4dirent v4d512bi v5 ==========|======|========|========|========|======== tar -xf normal 19.770 19.800 19.960 20.770 tar -xf itree 19.550 19.930 20.067 20.963 ----------+------+--------+--------+--------+-------- sha256sum normal 17.157 14.607 14.393 16.053 sha256sum itree 17.277 14.813 14.550 15.007 ----------+------+--------+--------+--------+-------- cp -r a b normal 18.697 18.973 18.687 19.253 cp -r a b itree 19.033 18.993 18.783 19.703 ----------+------+--------+--------+--------+-------- find . normal 0.820 0.573 0.537 0.597 find . itree 0.793 0.593 0.547 0.610 ----------+------+--------+--------+--------+-------- rm -r b normal 3.883 3.827 3.800 3.967 rm -r b itree 4.053 3.937 4.003 4.143 ----------+------+--------+--------+--------+-------- cp -r b a normal 19.043 19.083 18.753 19.503 cp -r b a itree 19.203 19.100 19.040 19.680 ----------+------+--------+--------+--------+-------- rm -r b normal 4.097 3.947 3.900 4.123 rm -r b itree 4.287 4.067 4.093 4.227 ----------+------+--------+--------+--------+-------- stat normal 11.337 10.730 10.727 10.680 stat itree 11.080 10.827 10.800 10.457 ----------+------+--------+--------+--------+-------- sha check normal 8.970 8.920 8.980 9.507 sha check itree 9.053 9.143 8.540 9.420 ----------+------+--------+--------+--------+-------- cp tarball normal 5.537 5.397 5.470 5.373 cp tarball itree 5.390 5.313 5.460 5.343 ==========|======|========|========|========|======== TOTAL normal 109.311 105.857 105.207 109.826 TOTAL itree 109.719 106.716 105.883 109.553 The rest of this report is supplementary noise and the lockdep. # XFS kernel configuration: CONFIG_XFS_FS=y CONFIG_XFS_QUOTA=y CONFIG_XFS_POSIX_ACL=y CONFIG_XFS_RT=y # CONFIG_XFS_WARN is not set # CONFIG_XFS_DEBUG is not set # `uname -a` output: Linux plbearer 3.11.0+ #4 Sat Sep 7 13:04:53 EDT 2013 i686 Intel(R) Pentium(R) 4 CPU 1.80GHz GenuineIntel GNU/Linux RAM: 512 MB, less 9 MB for capture kernel # Hard drive used for $TEST_DEV: Model Family: Western Digital Caviar Device Model: WDC WD600BB-75CAA0 User Capacity: 60,022,480,896 bytes [60.0 GB] # Hard drive used for $TEST_LOGDEV and kernel-git tarball: Model Family: Seagate Barracuda 7200.9 Device Model: ST3120814A User Capacity: 120,034,123,776 bytes [120 GB] root@plbearer:~/results# uname -a # Sample xfs_info output for $TEST_DEV, to show how XFS is using the partition: meta-data=/dev/sdb4 isize=512 agcount=4, agsize=720896 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 data = bsize=4096 blocks=2883584, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =external bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # cat /proc/partitions major minor #blocks name 8 0 117220824 sda 8 1 98304 sda1 # shared /boot 8 2 1 sda2 8 5 65536 sda5 8 6 131072 sda6 # $TEST_LOGDEV 8 7 131072 sda7 8 8 616448 sda8 8 9 11275264 sda9 # inactive root (v5 XFS), holds tarball 8 10 104895960 sda10 11 0 1048575 sr0 8 16 58615704 sdb 8 17 3406848 sdb1 # active root partition (JFS) 8 18 786432 sdb2 8 19 20971520 sdb3 8 20 11534336 sdb4 # $TEST_DEV 8 21 4128768 sdb5 8 22 786432 sdb6 8 23 524288 sdb7 8 24 524288 sdb8 8 25 1048576 sdb9 8 26 10708871 sdb10 8 27 4194304 sdb11 Lockdep that kept tar jobs from completing. It was found during several other tests before I gave up on the debug benchmark idea. ================================= [ INFO: inconsistent lock state ] 3.11.0+ #2 Not tainted --------------------------------- inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage. tar/287 [HC0[0]:SC0[0]:HE1:SE1] takes: (&(&ip->i_lock)->mr_lock){++++?-}, at: [<c120cc1b>] xfs_ilock+0x100/0x1f1 {IN-RECLAIM_FS-W} state was registered at: [<c1070697>] __lock_acquire+0x63b/0x1953 [<c1072515>] lock_acquire+0x88/0x17e [<c104fc0f>] down_write_nested+0x4f/0x9d [<c120cc1b>] xfs_ilock+0x100/0x1f1 [<c11bd00d>] xfs_reclaim_inode+0xf4/0x30a [<c11bd4d4>] xfs_reclaim_inodes_ag+0x2b1/0x488 [<c11bd729>] xfs_reclaim_inodes_nr+0x2d/0x33 [<c11c7e1e>] xfs_fs_free_cached_objects+0x13/0x15 [<c10f3778>] prune_super+0xd1/0x15c [<c10c99da>] shrink_slab+0x143/0x3d8 [<c10cc768>] kswapd+0x45d/0x835 [<c104b617>] kthread+0xa7/0xa9 [<c14e26b7>] ret_from_kernel_thread+0x1b/0x28 irq event stamp: 23333689 hardirqs last enabled at (23333689): [<c14e1375>] _raw_spin_unlock_irq+0x27/0x36 hardirqs last disabled at (23333688): [<c14e11ea>] _raw_spin_lock_irq+0x15/0x7a softirqs last enabled at (23333678): [<c1031d2d>] __do_softirq+0x142/0x2ce softirqs last disabled at (23333649): [<c1031fec>] irq_exit+0x6d/0x73 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&ip->i_lock)->mr_lock); <Interrupt> lock(&(&ip->i_lock)->mr_lock); *** DEADLOCK *** 3 locks held by tar/287: #0: (sb_writers#8){.+.+.+}, at: [<c110d858>] mnt_want_write+0x1e/0x3e #1: (&(&ip->i_lock)->mr_lock){++++?-}, at: [<c120cc1b>] xfs_ilock+0x100/0x1f1 #2: (sb_internal){.+.+.+}, at: [<c11cb6d0>] xfs_trans_alloc+0x1f/0x35 stack backtrace: CPU: 0 PID: 287 Comm: tar Not tainted 3.11.0+ #2 Hardware name: Dell Computer Corporation Dimension 2350/07W080, BIOS A01 12/17/2002 de296540 de296540 de3e7d5c c14db319 de3e7d98 c14d8708 c1618f4e c1619343 0000011f 00000000 00000000 00000000 00000000 00000001 00000001 c1619343 0000000a de2969b0 00000400 de3e7dcc c106e6c4 0000000a de3e7e1c c10703d1 Call Trace: [<c14db319>] dump_stack+0x16/0x18 [<c14d8708>] print_usage_bug+0x1dc/0x1e6 [<c106e6c4>] mark_lock+0x28c/0x2c1 [<c10703d1>] ? __lock_acquire+0x375/0x1953 [<c106dc88>] ? print_shortest_lock_dependencies+0x18f/0x18f [<c106e77a>] mark_held_locks+0x81/0xe7 [<c106f136>] lockdep_trace_alloc+0xa1/0xe3 [<c10ec3ed>] kmem_cache_alloc+0x28/0x1f2 [<c11cd53f>] ? kmem_zone_alloc+0x55/0xd0 [<c11cd53f>] kmem_zone_alloc+0x55/0xd0 [<c11cb6d0>] ? xfs_trans_alloc+0x1f/0x35 [<c11cd5cb>] kmem_zone_zalloc+0x11/0x36 [<c11cb663>] _xfs_trans_alloc+0x2e/0x7c [<c11cb6de>] xfs_trans_alloc+0x2d/0x35 [<c1210743>] xfs_inactive+0x129/0x4a3 [<c106ebaf>] ? trace_hardirqs_on+0xb/0xd [<c14e1375>] ? _raw_spin_unlock_irq+0x27/0x36 [<c11c807d>] xfs_fs_evict_inode+0x6c/0x114 [<c1108268>] evict+0x8e/0x15d [<c1108d16>] iput+0xc4/0x138 [<c10fec3c>] do_unlinkat+0x127/0x17f [<c102547e>] ? vmalloc_sync_all+0x133/0x133 [<c10fecb7>] SyS_unlinkat+0x23/0x3a [<c14e273b>] sysenter_do_call+0x12/0x32 _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs