On Fri, Nov 14, 2014 at 4:40 PM, Tom Herbert <therbert@xxxxxxxxxx> wrote: > On Fri, Nov 14, 2014 at 4:24 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >> On Fri, Nov 14, 2014 at 4:06 PM, Tom Herbert <therbert@xxxxxxxxxx> wrote: >>> On Fri, Nov 14, 2014 at 2:10 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>>> On Fri, Nov 14, 2014 at 1:36 PM, Tom Herbert <therbert@xxxxxxxxxx> wrote: >>>>> On Fri, Nov 14, 2014 at 12:34 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>>>>> On Fri, Nov 14, 2014 at 12:25 PM, Tom Herbert <therbert@xxxxxxxxxx> wrote: >>>>>>> On Fri, Nov 14, 2014 at 12:16 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>>>>>>> On Fri, Nov 14, 2014 at 11:52 AM, Tom Herbert <therbert@xxxxxxxxxx> wrote: >>>>>>>>> On Fri, Nov 14, 2014 at 11:33 AM, Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote: >>>>>>>>>> On Fri, 2014-11-14 at 09:17 -0800, Andy Lutomirski wrote: >>>>>>>>>> >>>>>>>>>>> As a heavy user of RFS (and finder of bugs in it, too), here's my >>>>>>>>>>> question about this API: >>>>>>>>>>> >>>>>>>>>>> How does an application tell whether the socket represents a >>>>>>>>>>> non-actively-steered flow? If the flow is subject to RFS, then moving >>>>>>>>>>> the application handling to the socket's CPU seems problematic, as the >>>>>>>>>>> socket's CPU might move as well. The current implementation in this >>>>>>>>>>> patch seems to tell me which CPU the most recent packet came in on, >>>>>>>>>>> which is not necessarily very useful. >>>>>>>>>> >>>>>>>>>> Its the cpu that hit the TCP stack, bringing dozens of cache lines in >>>>>>>>>> its cache. This is all that matters, >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Some possibilities: >>>>>>>>>>> >>>>>>>>>>> 1. Let SO_INCOMING_CPU fail if RFS or RPS are in play. >>>>>>>>>> >>>>>>>>>> Well, idea is to not use RFS at all. Otherwise, it is useless. >>>>>>>> >>>>>>>> Sure, but how do I know that it'll be the same CPU next time? >>>>>>>> >>>>>>>>>> >>>>>>>>> Bear in mind this is only an interface to report RX CPU and in itself >>>>>>>>> doesn't provide any functionality for changing scheduling, there is >>>>>>>>> obviously logic needed in user space that would need to do something. >>>>>>>>> >>>>>>>>> If we track the interrupting CPU in skb, the interface could be easily >>>>>>>>> extended to provide the interrupting CPU, the RPS CPU (calculated at >>>>>>>>> reported time), and the CPU processing transport (post steering which >>>>>>>>> is what is currently returned). That would provide the complete >>>>>>>>> picture to control scheduling a flow from userspace, and an interface >>>>>>>>> to selectively turn off RFS for a socket would make sense then. >>>>>>>> >>>>>>>> I think that a turn-off-RFS interface would also want a way to figure >>>>>>>> out where the flow would go without RFS. Can the network stack do >>>>>>>> that (e.g. evaluate the rx indirection hash or whatever happens these >>>>>>>> days)? >>>>>>>> >>>>>>> Yes,. We need the rxhash and the CPU that packets are received on from >>>>>>> the device for the socket. The former we already have, the latter >>>>>>> might be done by adding a field to skbuff to set received CPU. Given >>>>>>> the L4 hash and interrupting CPU we can calculated the RPS CPU which >>>>>>> is where packet would have landed with RFS off. >>>>>> >>>>>> Hmm. I think this would be useful for me. It would *definitely* be >>>>>> useful for me if I could pin an RFS flow to a cpu of my choice. >>>>>> >>>>> Andy, can you elaborate a little more on your use case. I've thought >>>>> several times about an interface to program the flow table from >>>>> userspace, but never quite came up with a compelling use case and >>>>> there is the security concern that a user could "steal" cycles from >>>>> arbitrary CPUs. >>>> >>>> I have a bunch of threads that are pinned to various CPUs or groups of >>>> CPUs. Each thread is responsible for a fixed set of flows. I'd like >>>> those flows to go to those CPUs. >>>> >>>> RFS will eventually do it, but it would be nice if I could >>>> deterministically ask for a flow to be routed to the right CPU. Also, >>>> if my thread bounces temporarily to another CPU, I don't really need >>>> the flow to follow it -- I'd like it to stay put. >>>> >>> Okay, how about it we have a SO_RFS_LOCK_FLOW sockopt. When this is >>> called on a socket we can lock the socket to CPU binding to the >>> current CPU it is called from. It could be unlocked at a later point. >>> Would this satisfy your requirements? >> >> Yes, I think. Especially if it bypassed the hash table. > > Unfortunately we can't easily bypass the hash table. The only way I > know of to to do that is to perform the socket lookup to do steering > (I tried that early on, but it was pretty costly). What happens if you just call ndo_rx_flow_steer and do something to keep the result from expiring? >> >>> As I mentioned, there is no material functionality in this patch and >>> it should be independent of RFS. It simply returns the CPU where the >>> stack processed the packet. Whether or not this is meaningful >>> information to the algorithm being implemented in userspace is >>> completely up to the caller to decide. >> >> Agreed. >> >> My only concern is that writing that userspace algorithm might result >> in surprises if RFS is on. Having the user program notice the problem >> early and alert the admin might help keep Murphy's Law at bay here. >> > By Murphy's law we'd also have to consider that the flow hash could > change after reading the results so that the scheduling done in > userspace is completely wrong until the CPU is read again. > Synchronizing kernel and device state with userspace state is not > always so easy. One way to mitigate is to use ancillary data which > would provide real time information and obviate the need for another > system call. Hmm. That would work, too. I don't know how annoyed user code would be at having to read ancillary data, though. The flow hash really shouldn't change much, though, right? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html