Al... I read your description of waiting in orangefs_devreq_write_iter() and it seems to make sense, so I built a version with it removed, and I also still have the bit in file.c with handle_io_error removed. I put a git diff at the end of this message in case it is not clear what I did... that I GOT HERE I put in there was to help me see the results of running the mmap test you showed me... I use dbench as a tester, maybe too much, but when I break the code I can usually see that it is broken when I run dbench. So... without the waiting in orangefs_devreq_write_iter, I ran three separate dbenches like this at the same time: /home/hubcap/dbench -c client.txt 10 -t 300 That's 30 threads and tons of concurrent IO, both reading and writing, and all three dbenches complete properly... so... looking at the code makes it seem reasonble to get rid of the wait, and tests make it seem reasonable too. On to the restarting of client-core... they tell me it works in production with our frankensteined works-on-everything out-of-tree kernel module, I'm not sure. But in my tests, if I kill the client-core bad things happen... sometimes the client-core doesn't restart, and the kernel gets sick (hangs or slows way down but no oops). When the client-core does restart, the activity I had going on (dbench again) fizzles out, and the filesystem is corrupted... [root@be1 d.dbench1]# pwd /pvfsmnt/d.dbench1 [root@be1 d.dbench1]# ls clients client.txt [root@be1 d.dbench1]# rm -rf clients rm: cannot remove ‘clients/client5/~dmtmp/PWRPNT’: Directory not empty Once when I stopped the client-core, the kernel did oops: Jan 24 11:17:39 be1 kernel: [ 4284.798158] Call Trace: Jan 24 11:17:39 be1 kernel: [ 4284.799274] [<ffffffff8110547c>] ? print_time.part.9+0x6c/0x90 Jan 24 11:17:39 be1 kernel: [ 4284.801904] [<ffffffff811a265d>] ? irq_work_queue+0xd/0x80 Jan 24 11:17:39 be1 kernel: [ 4284.804376] [<ffffffff81106c32>] ? wake_up_klogd+0x32/0x40 Jan 24 11:17:39 be1 kernel: [ 4284.806850] [<ffffffff810f63f4>] lock_acquire+0xc4/0x150 Jan 24 11:17:39 be1 kernel: [ 4284.809274] [<ffffffff813471ab>] ? put_back_slot+0x1b/0x70 Jan 24 11:17:39 be1 kernel: [ 4284.811790] [<ffffffff817db871>] _raw_spin_lock+0x31/0x40 Jan 24 11:17:39 be1 kernel: [ 4284.814230] [<ffffffff813471ab>] ? put_back_slot+0x1b/0x70 Jan 24 11:17:39 be1 kernel: [ 4284.816750] [<ffffffff813471ab>] put_back_slot+0x1b/0x70 Jan 24 11:17:39 be1 kernel: [ 4284.819180] [<ffffffff813479ab>] orangefs_readdir_index_put+0x4b/0x70 Jan 24 11:17:39 be1 kernel: [ 4284.822081] [<ffffffff81346f22>] orangefs_readdir+0xd42/0xd50 Jan 24 11:17:39 be1 kernel: [ 4284.824672] [<ffffffff810f344d>] ? trace_hardirqs_on+0xd/0x10 Jan 24 11:17:39 be1 kernel: [ 4284.827214] [<ffffffff81256f7f>] iterate_dir+0x9f/0x130 Jan 24 11:17:39 be1 kernel: [ 4284.829567] [<ffffffff81257491>] SyS_getdents+0xa1/0x140 Jan 24 11:17:39 be1 kernel: [ 4284.832022] [<ffffffff81257010>] ? iterate_dir+0x130/0x130 Jan 24 11:17:39 be1 kernel: [ 4284.834456] [<ffffffff817dc332>] entry_SYSCALL_64_fastpath+0x12/0x76 I turned on the kernel crash dump stuff so that I could see more, but it hasn't crashed again... Anyhow, I don't think the "restart the client-core" code is up to snuff <g>. I'll look closer at how the out-of-tree module works, maybe it really does work and we've broken it with our massive changes to the upstream version over the last few years. I see that the client (whose job it is to restart the client-core) and the client-core implement signal handling with signal(2), whose man page says to use sigaction(2) instead... # pwd /home/hubcap/linux [root@be1 linux]# git diff diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c index 812844f..d8e138f 100644 --- a/fs/orangefs/devorangefs-req.c +++ b/fs/orangefs/devorangefs-req.c @@ -421,6 +421,7 @@ wakeup: * application reading/writing this device to return until * the buffers are done being used. */ +/* if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO) { long n = wait_for_completion_interruptible_timeout(&op->done, op_timeout_secs * HZ); @@ -434,6 +435,7 @@ wakeup: __func__); } } +*/ out: op_release(op); return ret; diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c index c585063..208f0ee 100644 --- a/fs/orangefs/file.c +++ b/fs/orangefs/file.c @@ -245,15 +245,9 @@ populate_shared_memory: buffer_index, iter, new_op->downcall.resp.io.amt_complete); - if (ret < 0) { - /* - * put error codes in downcall so that handle_io_error() - * preserves it properly - */ - new_op->downcall.status = ret; - handle_io_error(); - goto out; - } +gossip_err("%s: I GOT HERE, ret:%zd:\n", __func__, ret); + if (ret < 0) + goto done_copying; } gossip_debug(GOSSIP_FILE_DEBUG, "%s(%pU): Amount written as returned by the sys-io call:%d\n", @@ -263,6 +257,8 @@ populate_shared_memory: ret = new_op->downcall.resp.io.amt_complete; +done_copying: + /* * tell the device file owner waiting on I/O that this read has * completed and it can return now. On Sat, Jan 23, 2016 at 11:05 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > On Sun, Jan 24, 2016 at 12:16:15AM +0000, Al Viro wrote: >> On Sat, Jan 23, 2016 at 05:36:41PM -0500, Mike Marshall wrote: >> > AV> IOW, would the following do the right thing? >> > AV> That would've left us with only one caller of >> > AV> handle_io_error()... >> > >> > It works. With your simplified code all >> > the needed things still happen: complete and >> > bufmap_put... >> > >> > I've never had an error there unless I forgot >> > to turn on the client-core... >> > >> > You must be looking for a way to get rid of >> > another macro <g>... >> >> That as well, but mostly I want to sort the situation with cancels out and >> get a better grasp on when can that code be reached. BTW, an error at that >> spot is trivial to arrange - just pass read() a destination with munmapped >> page in the middle and it'll trigger just fine. IOW, >> p = mmap(NULL, 65536, PROT_READ|PROT_WRITE, >> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >> munmap(p + 16384, 16384); >> read(fd, p, 65536); >> with fd being a file on orangefs should step into that. > > Hmm... I've just realized that I don't understand why you are waiting in > the orangefs_devreq_write_iter() at all. After all, you've reserved the > slot in wait_for_direct_io() and do not release it until right after you've > done complete(&op->done). Why should server block while your read(2) copies > the data from bufmap to its real destination? After all, any further > request hitting the same slot won't come until the slot is released anyway, > right? What (and why) would the server be doing to that slot in the > meanwhile? It's already copied whatever data it was going to copy as part of > that file_read... > > Speaking of your slot allocator - surely it would be better to maintain the > count of unused slots there? For example, replace that waitq with semaphore, > initialize it to number of slots, and have wait_for_a_slot() do > if (down_interruptible(slargs->slots_sem)) { > whine "interrupted" > return -EINTR; > } > /* now we know that there's an empty slot for us */ > spin_lock(slargs->slots_lock); > n = find_first_zero_bit(slargs->slots_bitmap, slargs->slot_count); > set_bit(slargs->slots_bitmap, n); > spin_unlock(slargs->slots_lock); > // or a lockless variant thereof, for that matter - just need to be > // carefull with barriers > return n; > with put_back_slot() being > spin_lock > clear_bit > spin_unlock > up > > Wait a minute... What happens if you have a daemon disconnect while somebody > is holding a bufmap slot? Each of those (as well as whoever's waiting for > a slot to come free) is holding a reference to bufmap. devreq_mutex won't > do a damn thing, contrary to the comment in ->release() - it's _not_ held > across the duration of wait_for_direct_io(). > > We'll just decrement the refcount of bufmap, do nothing since it hasn't > reached zero, proceed to mark all ops as purged, wake each service_operation() > up and sod off. Now, the holders of those slots will call > orangefs_get_bufmap_init(), get 1 (since we hadn't dropped the last reference > yet - can it *ever* see 0 there, actually?) and return -EAGAIN. With > wait_for_direct_io() noticing that, freeing the slot and going into restart. > And if there was the only one, we are fine, but what if there were several? > > Suppose there had been two read() going on at the time of disconnect. > The first one drops the slot. Good, refcount of bufmap is down to 1 now. > And we go back to orangefs_bufmap_get(). Which calls orangefs_bufmap_ref(). > Which sees that __orangefs_bufmap is still non-NULL. And promptly regains > the reference and allocates the slot in that sucker. Then it's back to > service_operations()... Which will check is_daemon_in_service(), possibly > getting "yes, it is" (if the new one got already started). Ouch. > > The fun part in all of that is that new daemon won't be able to get new > bufmap in place until all users of the old one run down. What a mess... > > Ho-hum... So we need to > a) have ->release() wait for all slots to get freed > b) have orangefs_bufmap_get() recognize that situation and > _not_ get new reference to bufmap that is already going down. > c) move the "wait for new daemon to come and install a new bufmap" > to some point past the release of the slot - service_operation() is too > early for that to work. > > Or am I missing something simple that makes the scenario above not go that > way? Confused... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html