Hi Tejun, I missed to acknowledge your point that we need both - hard limit and soft limit/weight. Current patchset is only based on hard limit. I see that weight would be another helfpul layer in chain that we can implement after this as incremental that makes review, debugging manageable? Parav On Mon, Sep 14, 2015 at 4:39 PM, Parav Pandit <pandit.parav@xxxxxxxxx> wrote: > On Sat, Sep 12, 2015 at 1:36 AM, Hefty, Sean <sean.hefty@xxxxxxxxx> wrote: >>> > Trying to limit the number of QPs that an app can allocate, >>> > therefore, just limits how much of the address space an app can use. >>> > There's no clear link between QP limits and HW resource limits, >>> > unless you assume a very specific underlying implementation. >>> >>> Isn't that the point though? We have several vendors with hardware >>> that does impose hard limits on specific resources. There is no way to >>> avoid that, and ultimately, those exact HW resources need to be >>> limited. >> >> My point is that limiting the number of QPs that an app can allocate doesn't necessarily mean anything. Is allocating 1000 QPs with 1 entry each better or worse than 1 QP with 10,000 entries? Who knows? > > I think it means if its RDMA RC QP, than whether you can talk to 1000 > nodes or 1 node in network. > When we deploy MPI application, it know the rank of the application, > we know the cluster size of the deployment and based on that resource > allocation can be done. > If you meant to say from performance point of view, than resource > count is possibly not the right measure. > > Just because we have not defined those interface for performance today > in this patch set, doesn't mean that we won't do it. > I could easily see a number_of_messages/sec as one interface to be > added in future. > But that won't stop process hoarders to stop taking away all the QPs, > just the way we needed PID controller. > > Now when it comes to Intel implementation, if it driver layer knows > (in future we new APIs) that whether 10 or 100 user QPs should map to > few hw-QPs or more hw-QPs (uSNIC). > so that hw-QP exposed to one cgroup is isolated from hw-QP exposed to > other cgroup. > If hw- implementation doesn't require isolation, it could just > continue from single pool, its left to the vendor implementation on > how to use this information (this API is not present in the patch). > > So cgroup can also provides a control point for vendor layer to tune > internal resource allocation based on provided matrix, which cannot be > done by just providing "memory usage by RDMA structures". > > If I have to compare it with other cgroup knobs, low level individual > knobs by itself, doesn't serve any meaningful purpose either. > Just by defined how much CPU to use or how much memory to use, it > cannot define the application performance either. > I am not sure, whether iocontroller can achieve 10 million IOPs by > defining single CPU and 64KB of memory. > all the knobs needs to be set in right way to reach desired number. > > In similar line RDMA resource knobs as individual knobs are not > definition of performance, its just another knob. > >> >>> If we want to talk about abstraction, then I'd suggest something very >>> general and simple - two limits: >>> '% of the RDMA hardware resource pool' (per device or per ep?) >>> 'bytes of kernel memory for RDMA structures' (all devices) >> >> Yes - this makes more sense to me. >> > > Sean, Jason, > Help me to understand this scheme. > > 1. How does the % of resource, is different than absolute number? With > rest of the cgroups systems we define absolute number at most places > to my knowledge. > Such as (a) number_of_tcp_bytes, (b) IOPs of block device, (c) cpu cycles etc. > 20% of QP = 20 QPs when 100 QPs are with hw. > I prefer to keep the resource scheme consistent with other resource > control points - i.e. absolute number. > > 2. bytes of kernel memory for RDMA structures > One QP of one vendor might consume X bytes and other Y bytes. How does > the application knows how much memory to give. > application can allocate 100 QP of each 1 entry deep or 1 QP of 100 > entries deep as in Sean's example. > Both might consume almost same memory. > Application doing 100 QP allocation, still within limit of memory of > cgroup leaves other applications without any QP. > I don't see a point of memory footprint based scheme, as memory limits > are well addressed by more smarter memory controller anyway. > > I do agree with Tejun, Sean on the point that abstraction level has to > be different for using RDMA and thats why libfabrics and other > interfaces are emerging which will take its own time to get stabilize, > integrated. > > Until pure IB style RDMA programming model exist - based on RDMA > resource based scheme, I think control point also has to be on > resources. > Once a stable abstraction level is on table (possibly across fabric > not just RDMA), than a right resource controller can be implemented. > Even when RDMA abstraction layer arrives, as Jason mentioned, at the > end it would consume some hw resource anyway, that needs to be > controlled too. > > Jason, > If the hardware vendor defines the resource pool without saying its > resource QP or MR, how would actually management/control point can > decide what should be controlled to what limit? > We will need additional user space library component to decode than, > after that it needs to be abstracted out as QP or MR so that it can be > deal in vendor agnostic way as application layer. > and than it would look similar to what is being proposed here? -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html