On Thu, 3 Sep 2009, Lee Schermerhorn wrote: > > > @@ -1451,17 +1507,143 @@ static void __init hugetlb_sysfs_init(vo > > > return; > > > > > > for_each_hstate(h) { > > > - err = hugetlb_sysfs_add_hstate(h); > > > + err = hugetlb_sysfs_add_hstate(h, hugepages_kobj, > > > + hstate_kobjs, &hstate_attr_group); > > > if (err) > > > printk(KERN_ERR "Hugetlb: Unable to add hstate %s", > > > h->name); > > > } > > > } > > > > > > +#ifdef CONFIG_NUMA > > > + > > > +struct node_hstate { > > > + struct kobject *hugepages_kobj; > > > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; > > > +}; > > > +struct node_hstate node_hstates[MAX_NUMNODES]; > > > + > > > +static struct attribute *per_node_hstate_attrs[] = { > > > + &nr_hugepages_attr.attr, > > > + &free_hugepages_attr.attr, > > > + &surplus_hugepages_attr.attr, > > > + NULL, > > > +}; > > > + > > > +static struct attribute_group per_node_hstate_attr_group = { > > > + .attrs = per_node_hstate_attrs, > > > +}; > > > + > > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp) > > > +{ > > > + int nid; > > > + > > > + for (nid = 0; nid < nr_node_ids; nid++) { > > > + struct node_hstate *nhs = &node_hstates[nid]; > > > + int i; > > > + for (i = 0; i < HUGE_MAX_HSTATE; i++) > > > + if (nhs->hstate_kobjs[i] == kobj) { > > > + if (nidp) > > > + *nidp = nid; > > > + return &hstates[i]; > > > + } > > > + } > > > + > > > + BUG(); > > > + return NULL; > > > +} > > > + > > > +void hugetlb_unregister_node(struct node *node) > > > +{ > > > + struct hstate *h; > > > + struct node_hstate *nhs = &node_hstates[node->sysdev.id]; > > > + > > > + if (!nhs->hugepages_kobj) > > > + return; > > > + > > > + for_each_hstate(h) > > > + if (nhs->hstate_kobjs[h - hstates]) { > > > + kobject_put(nhs->hstate_kobjs[h - hstates]); > > > + nhs->hstate_kobjs[h - hstates] = NULL; > > > + } > > > + > > > + kobject_put(nhs->hugepages_kobj); > > > + nhs->hugepages_kobj = NULL; > > > +} > > > + > > > +static void hugetlb_unregister_all_nodes(void) > > > +{ > > > + int nid; > > > + > > > + for (nid = 0; nid < nr_node_ids; nid++) > > > + hugetlb_unregister_node(&node_devices[nid]); > > > + > > > + register_hugetlbfs_with_node(NULL, NULL); > > > +} > > > + > > > +void hugetlb_register_node(struct node *node) > > > +{ > > > + struct hstate *h; > > > + struct node_hstate *nhs = &node_hstates[node->sysdev.id]; > > > + int err; > > > + > > > + if (nhs->hugepages_kobj) > > > + return; /* already allocated */ > > > + > > > + nhs->hugepages_kobj = kobject_create_and_add("hugepages", > > > + &node->sysdev.kobj); > > > + if (!nhs->hugepages_kobj) > > > + return; > > > + > > > + for_each_hstate(h) { > > > + err = hugetlb_sysfs_add_hstate(h, nhs->hugepages_kobj, > > > + nhs->hstate_kobjs, > > > + &per_node_hstate_attr_group); > > > + if (err) { > > > + printk(KERN_ERR "Hugetlb: Unable to add hstate %s" > > > + " for node %d\n", > > > + h->name, node->sysdev.id); > > > > Maybe add `err' to the printk so we know whether it was an -ENOMEM > > condition or sysfs problem? > > Just the raw negative number? > Sure. I'm making the assumption that the printk is actually necessary in the first place, which is rather unorthodox for functions that can otherwise silently recover by unregistering the attribute. Using the printk implies you want to know about the failure, yet additional debugging would be necessary to even identify what the failure was without printing the errno. > > > > > + hugetlb_unregister_node(node); > > > + break; > > > + } > > > + } > > > +} > > > + > > > +static void hugetlb_register_all_nodes(void) > > > +{ > > > + int nid; > > > + > > > + for (nid = 0; nid < nr_node_ids; nid++) { > > > > Don't you want to do this for all nodes in N_HIGH_MEMORY? I don't think > > we should be adding attributes for memoryless nodes. > > > Well, I wondered about that. The persistent huge page allocation code > is careful to skip over nodes where it can't allow a huge page there, > whether or not the node has [any] memory. So, it's safe to leave it > this way. It's safe, but it seems inconsistent to allow hugepage attributes to appear in /sys/devices/node/node* for nodes that have no memory. > And, I was worried about the interaction with memory hotplug, > as separate from node hotplug. The current code handles node hot plug, > bug I wasn't sure about memory hot-plug within a node. If memory hotplug doesn't update nodes_state[N_HIGH_MEMORY] then there are plenty of other users that will currently fail as well. > I.e., would the > node driver get called to register the attributes in this case? Not without a MEM_ONLINE notifier. > Maybe > that case doesn't exist, so I don't have to worry about it. I think > this is somewhat similar to the top cpuset mems_allowed being set to all > possible to cover any subsequently added nodes/memory. > That's because the page allocator's zonelists won't try to allocate from a memoryless node and the only hook into the cpuset code in that path is to check whether a nid is set in cpuset_current_mems_allowed. It's quite different from providing per-node allocation and freeing mechanisms for pages on nodes without memory like this approach. > > > Index: linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h > > > =================================================================== > > > --- linux-2.6.31-rc7-mmotm-090827-0057.orig/include/linux/numa.h 2009-08-28 09:21:17.000000000 -0400 > > > +++ linux-2.6.31-rc7-mmotm-090827-0057/include/linux/numa.h 2009-08-28 09:21:31.000000000 -0400 > > > @@ -10,4 +10,6 @@ > > > > > > #define MAX_NUMNODES (1 << NODES_SHIFT) > > > > > > +#define NO_NODEID_SPECIFIED (-1) > > > + > > > #endif /* _LINUX_NUMA_H */ > > > > Hmm, so we already have NUMA_NO_NODE in the ia64 and x86_64 code and > > NID_INVAL in the ACPI code, both of which are defined to -1. Maybe rename > > your addition here in favor of NUMA_NO_NODE, remove it from the ia64 and > > x86 arch headers, and convert the ACPI code? > > OK, replacing 'NO_NODEID_SPECIFIED' with 'NUMA_NO_NODE,' works w/o > decending into header dependency hell. The symbol is already visible in > hugetlb.c I'll fix that. NUMA_NO_NODE may be visible in hugetlb.c for ia64 and x86, but probably not for other architectures so it should be moved to include/linux/numa.h. > But, ACPI? Not today, thanks :). Darn :) -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html