> Bufmap rewrite is really completely untested - > it's done pretty much blindly and I'd be surprised as hell if it has no > brainos at the first try. You did pretty good, it takes me two tries to get hello world right... Right off the bat, the kernel crashed, because: static struct slot_map rw_map = { .c = -1, .q = __WAIT_QUEUE_HEAD_INITIALIZER(rw_map.q) }; static struct slot_map readdir_map = { .c = -1, .q = __WAIT_QUEUE_HEAD_INITIALIZER(rw_map.q) }; ^ | D'OH! But after that stuff almost worked... It can still "sort of" wedge up. We think that when dbench is running and the client-core is killed, you can hit orangefs_bufmap_finalize -> mark_killed -> run_down/schedule(). while those wait_for_completion_* schedules of extant ops in wait_for_matching_downcall have also given up the processor... Then... when you interrupt dbench, stuff starts flowing again... I added a couple of gossip statements inside of mark_killed and run_down... Feb 15 16:40:15 be1 kernel: [ 349.981597] orangefs_bufmap_finalize: called Feb 15 16:40:15 be1 kernel: [ 349.981600] mark_killed enter Feb 15 16:40:15 be1 kernel: [ 349.981602] mark_killed: leave Feb 15 16:40:15 be1 kernel: [ 349.981603] mark_killed enter Feb 15 16:40:15 be1 kernel: [ 349.981605] mark_killed: leave Feb 15 16:40:15 be1 kernel: [ 349.981606] run_down: enter:-1: Feb 15 16:40:15 be1 kernel: [ 349.981608] run_down: leave Feb 15 16:40:15 be1 kernel: [ 349.981609] run_down: enter:-2: Feb 15 16:40:15 be1 kernel: [ 349.981610] run_down: before schedule:-2: stuff just sits here while dbench is still running. Then Ctrl-C on dbench and off to the races again. eb 15 16:42:28 be1 kernel: [ 483.049927] *** wait_for_matching_downcall: operation interrupted by a signal (tag 16523, op ffff880013418000) Feb 15 16:42:28 be1 kernel: [ 483.049930] Interrupted: Removed op ffff880013418000 from htable_ops_in_progress Feb 15 16:42:28 be1 kernel: [ 483.049932] orangefs: service_operation orangefs_inode_getattr returning: -4 for ffff880013418000. Feb 15 16:42:28 be1 kernel: [ 483.050116] *** wait_for_matching_downcall: operation interrupted by a signal (tag 16518, op ffff8800001a8000) Feb 15 16:42:28 be1 kernel: [ 483.050118] Interrupted: Removed op ffff8800001a8000 from htable_ops_in_progress Feb 15 16:42:28 be1 kernel: [ 483.050120] orangefs: service_operation orangefs_inode_getattr returning: -4 for ffff8800001a8000. Martin already has a patch... What do you think? I'm headed home for supper... -Mike On Mon, Feb 15, 2016 at 1:45 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > On Mon, Feb 15, 2016 at 12:46:51PM -0500, Mike Marshall wrote: >> I pushed the list_del up to the kernel.org for-next branch... >> >> And I've been running tests with the CRUDE bandaid... weird >> results... >> >> No oopses, no WARN_ONs... I was running dbench and ls -R >> or find and kill-minus-nining different ones of them with no >> perceived resulting problems, so I moved on to signalling >> the client-core to abort... it restarted numerous times, >> and then stuff wedged up differently than I've seen before. > > There are other problems with that thing (starting with the fact that > retrying readdir/wait_for_direct_io can try to grab a slot despite the > bufmap winding down). OK, at that point I think we should try to see > if bufmap rewrite works - I've rebased on top of your branch and pushed > (head at 8c3bc9a). Bufmap rewrite is really completely untested - > it's done pretty much blindly and I'd be surprised as hell if it has no > brainos at the first try. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html