On Mon, 8 Dec 2014 14:58:55 -0500 "J. Bruce Fields" <bfields@xxxxxxxxxxxx> wrote: > On Mon, Dec 08, 2014 at 02:54:29PM -0500, Jeff Layton wrote: > > On Mon, 8 Dec 2014 13:57:31 -0500 > > "J. Bruce Fields" <bfields@xxxxxxxxxxxx> wrote: > > > > > On Tue, Dec 02, 2014 at 11:50:24AM -0500, J. Bruce Fields wrote: > > > > On Tue, Dec 02, 2014 at 07:14:22AM -0500, Jeff Layton wrote: > > > > > On Tue, 2 Dec 2014 06:57:50 -0500 > > > > > Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > On Mon, 1 Dec 2014 19:38:19 -0500 > > > > > > Trond Myklebust <trondmy@xxxxxxxxx> wrote: > > > > > > > > > > > > > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote: > > > > > > > > I find it hard to think about how we expect this to affect performance. > > > > > > > > So it comes down to the observed results, I guess, but just trying to > > > > > > > > get an idea: > > > > > > > > > > > > > > > > - this eliminates sp_lock. I think the original idea here was > > > > > > > > that if interrupts could be routed correctly then there > > > > > > > > shouldn't normally be cross-cpu contention on this lock. Do > > > > > > > > we understand why that didn't pan out? Is hardware capable of > > > > > > > > doing this really rare, or is it just too hard to configure it > > > > > > > > correctly? > > > > > > > > > > > > > > One problem is that a 1MB incoming write will generate a lot of > > > > > > > interrupts. While that is not so noticeable on a 1GigE network, it is > > > > > > > on a 40GigE network. The other thing you should note is that this > > > > > > > workload was generated with ~100 clients pounding on that server, so > > > > > > > there are a fair amount of TCP connections to service in parallel. > > > > > > > Playing with the interrupt routing doesn't necessarily help you so > > > > > > > much when all those connections are hot. > > > > > > > > > > > > > > > > > In principle though, the percpu pool_mode should have alleviated the > > > > > contention on the sp_lock. When an interrupt comes in, the xprt gets > > > > > queued to its pool. If there is a pool for each cpu then there should > > > > > be no sp_lock contention. The pernode pool mode might also have > > > > > alleviated the lock contention to a lesser degree in a NUMA > > > > > configuration. > > > > > > > > > > Do we understand why that didn't help? > > > > > > > > Yes, the lots-of-interrupts-per-rpc problem strikes me as a separate if > > > > not entirely orthogonal problem. > > > > > > > > (And I thought it should be addressable separately; Trond and I talked > > > > about this in Westford. I think it currently wakes a thread to handle > > > > each individual tcp segment--but shouldn't it be able to do all the data > > > > copying in the interrupt and wait to wake up a thread until it's got the > > > > entire rpc?) > > > > > > By the way, Jeff, isn't this part of what's complicating the workqueue > > > change? That would seem simpler if we didn't need to queue work until > > > we had the full rpc. > > > > > > > No, I don't think that really adds much in the way of complexity there. > > > > I have that set working. Most of what's holding me up from posting the > > next iteration of that set is performance. So far, my testing shows > > that the workqueue-based code is slightly slower. I've been trying to > > figure out why that is and whether I can do anything about it. Maybe > > I'll go ahead and post it as a second RFC set, until I can get to the > > bottom of the perf delta. > > > > I have pondered doing what you're suggesting above though and it's not a > > trivial change. > > > > The problem is that all of the buffers into which we do receives are > > associated with the svc_rqst (which we don't really have when the > > interrupt comes in), and not the svc_xprt (which we do have at that > > point). > > > > So, you'd need to restructure the code to hang a receive buffer off > > of the svc_xprt. > > Have you looked at svsk->sk_pages and svc_tcp_{save,restore}_pages? > > --b. > Ahh, no I hadn't...interesting. So, basically do the receive into the rqstp's buffer, and if you don't get everything you need you stuff the pages into the sk_pages array to await the next pass. Weird design... Ok, so you could potentially flip that around. Do the receive into the sk_pages buffer in softirq context, and then hand those off to the rqst (in some fashion) once you've received a full RPC. You'd have to work out how to replenish the sk_pages after each receive, and what to do about RDMA, but it's probably doable. > > Once you receive an entire RPC, you'd then have to > > flip that buffer over to a svc_rqst, queue up the job and grab a new > > buffer for the xprt (maybe you could swap them?). > > > > The problem is what to do if you don't have a buffer (or svc_rqst) > > available when an IRQ comes in. You can't allocate one from softirq > > context, so you'd need to offload that case to a workqueue or something > > anyway (which adds a bit of complexity as you'd then have to deal with > > two different receive paths). > > > > I'm also not sure about RDMA. When you get an RPC, the server usually > > has to do an RDMA READ from the client to pull all of the data in. I > > don't think you want to do that from softirq context, so that would > > also need to be queued up somehow. > > > > All of that said, it would probably reduce some context switching if > > we can make that work. Also, I suspect that doing that in the context > > of the workqueue-based code would probably be at least a little simpler. > > > > -- > > Jeff Layton <jlayton@xxxxxxxxxxxxxxx> -- Jeff Layton <jlayton@xxxxxxxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html