On Tue, Feb 09, 2016 at 05:40:49PM +0000, Al Viro wrote: > Could you try, on top of those fixes, comment the entire > if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO) { > long n = wait_for_completion_interruptible_timeout(&op->done, > op_timeout_secs * HZ); > if (unlikely(n < 0)) { > gossip_debug(GOSSIP_DEV_DEBUG, > "%s: signal on I/O wait, aborting\n", > __func__); > } else if (unlikely(n == 0)) { > gossip_debug(GOSSIP_DEV_DEBUG, > "%s: timed out.\n", > __func__); > } > } > in orangefs_devreq_write_iter() out and see if the corruption happens? Another thing: what's the protocol rules regarding the cancels? The current code looks very odd - if we get a hit by a signal after the daemon has picked e.g. read request but before it had replied, we will call orangefs_cancel_op_in_progress(), which will call service_operation() with ORANGEFS_OP_CANCELLATION which will. And that'll insert the cancel request into list and practically immediately notice that we have a pending signal, remove the cancel request from the list and bugger off. With daemon almost certainly *not* getting to see it at all. I've asked that before if anybody has explained that, I've missed that reply. How the fuck is that supposed to work? Forget the kernel-side implementation details, what should the daemon see in such situation? I would expect something like "you can't reuse a slot until operation has been either completed or purged or a cancel had been sent and ACKed by the daemon". Is that what is intended? If so, the handling of cancels might be better off asynchronous - let the slot freeing be done after the cancel had been ACKed and _not_ in the context of original syscall... There are some traces of AIO support in that thing; could this be a victim of trimming async parts for submission into the mainline? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html