On Wed, Mar 2, 2016 at 11:09 PM, Tejun Heo <tj@xxxxxxxxxx> wrote: > Hello, > >> +struct rdma_cgroup { >> + struct cgroup_subsys_state css; >> + >> + spinlock_t rpool_list_lock; /* protects resource pool list */ >> + struct list_head rpool_head; /* head to keep track of all resource >> + * pools that belongs to this cgroup. >> + */ > > I think we usually don't tail wing these comments. ok. I will put comments in separate line. > >> +}; >> + >> +struct rdmacg_pool_info { >> + const char **resource_name_table; >> + int table_len; > > Align fields consistently? I've said this multiple times now but > please make the available resources constant and document them in > Documentation/cgroup-v2.txt. o.k. I will align them. o.k. I will document the resource constants defined by IB stack in cgroup-v2.txt. >> + >> +/* APIs for RDMA/IB stack to publish when a device wants to >> + * participate in resource accounting >> + */ > > Please follow CodingStyle. Wasn't this pointed out a couple times > already? Yes. You did. I fixed at few places. I missed this out. Sorry. I am doing now. > >> +enum rdmacg_file_type { >> + RDMACG_RESOURCE_MAX, >> + RDMACG_RESOURCE_STAT, >> +}; > > Heh, usually not a good sign. If you're using this to call into a > common function and then switch out on the file type, just switch out > at the method level and factor out common part into helpers. > Methods for both the constants are same. Code changes between two file type is hardly 4 lines of code. So there is no need of additional helper functions. So in currently defined functions rdmacg_resource_read() and rdmacg_resource_set_max() works on file type. >> +/* resource tracker per resource for rdma cgroup */ >> +struct rdmacg_resource { >> + int max; >> + int usage; >> +}; > > Align fields? Above one seems to be aligned. Not sure what am I missing. I am aligning all the structures anyways. > >> +/** > > The above indicates kerneldoc comments, which this isn't. Fixed for this comment. > >> + * resource pool object which represents, per cgroup, per device >> + * resources. There are multiple instance >> + * of this object per cgroup, therefore it cannot be embedded within >> + * rdma_cgroup structure. It is maintained as list. > > Consistent paragraph fill? Fixed. > >> + */ >> +struct rdmacg_resource_pool { >> + struct list_head cg_list; >> + struct list_head dev_list; >> + >> + struct rdmacg_device *device; >> + struct rdmacg_resource *resources; >> + struct rdma_cgroup *cg; /* owner cg used during device cleanup */ >> + >> + int refcnt; /* count active user tasks of this pool */ >> + int num_max_cnt; /* total number counts which are set to max */ >> +}; > > Formatting. Aligning all the fields now in next patch. > >> +static inline void set_resource_limit(struct rdmacg_resource_pool *rpool, >> + int index, int new_max) > > Is inline necessary? Compiler can figure out these. Yes. Removed. > >> +static struct rdmacg_resource_pool* > ^ > space > >> +find_cg_rpool_locked(struct rdma_cgroup *cg, >> + struct rdmacg_device *device) > ... >> +static void uncharge_cg_resource(struct rdma_cgroup *cg, >> + struct rdmacg_device *device, >> + int index, int num) >> +{ > ... >> + rpool->refcnt--; >> + if (rpool->refcnt == 0 && rpool->num_max_cnt == pool_info->table_len) { > > If the caller charges 2 and then uncharges 1 two times, the refcnt > underflows? Why not just track how many usages are zero? > This is certainly must fix bug. Changed refcnt to usage_sum and changed to do usage_sum -= num during uncharging and usage_sum += num during charing. > ... >> +void rdmacg_uncharge(struct rdma_cgroup *cg, >> + struct rdmacg_device *device, >> + int index, int num) >> +{ >> + struct rdma_cgroup *p; >> + >> + for (p = cg; p; p = parent_rdmacg(p)) >> + uncharge_cg_resource(p, device, index, num); >> + >> + css_put(&cg->css); >> +} >> +EXPORT_SYMBOL(rdmacg_uncharge); > > So, I suppose the code is trying to save lock contention with > fine-grained locking; Yes. > however, as the code is currently structured, > it's just increasing the number of lock ops and it'd be highly likely > to cheaper to simply use a single lock. Single lock per subsystem? I understood that you were ok to have per cgroup fine grain lock. > If you're working up the tree > grabbing lock at each level, per-node locking doesn't save you > anything. Also, it introduces conditions where charges are spuriously > denied when there are racing requestors. If scalability becomes an > issue, the right way to address is adding percpu front cache. > >> +void rdmacg_query_limit(struct rdmacg_device *device, >> + int *limits) >> +{ >> + struct rdma_cgroup *cg, *p; >> + struct rdmacg_resource_pool *rpool; >> + struct rdmacg_pool_info *pool_info; >> + int i; >> + >> + cg = task_rdmacg(current); >> + pool_info = &device->pool_info; > > Nothing seems to prevent @cg from going away if this races with > @current being migrated to a different cgroup. Have you run this with > lockdep and rcu debugging enabled? This should have triggered a > warning. No. debugging was disabled. I will enable and run. Help me little to understand, Even if above function races with migration, do you mean cg can be freed before css_get is executed? If yes, than I guess I need subsystem level lock under which I need to perform css_get()? If not, whatever cg was returned, that should be ok as long as cg reference is hold. May be I am missing something here. > > ... >> + for (p = cg; p; p = parent_rdmacg(p)) { >> + spin_lock(&cg->rpool_list_lock); >> + rpool = find_cg_rpool_locked(cg, device); >> + if (rpool) { >> + for (i = 0; i < pool_info->table_len; i++) >> + limits[i] = min_t(int, limits[i], >> + rpool->resources[i].max); > > So, the O complexity wise, things like the above are pretty bad and > the above pattern is used in hot paths. No. Its hot path. Typically applications do query configuration once is their life cycle and allocate resources more often. > I suppose there can only be a handful of devices per system? Right. I have described that in the document as well. Typically there are 1 to 4 devices in system and since the lock is per cgroup, though this loop appears to be heavy on O complexity wise, its not that bad. Hierarchical limit honoring is anyway default behavior we are adhering to. > > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html