On Mon, 3 Jun 2013 23:29:40 +0400 Glauber Costa <glommer@xxxxxxxxxx> wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Now that we have an LRU list API, we can start to enhance the > implementation. This splits the single LRU list into per-node lists > and locks to enhance scalability. Do we have any runtime measurements? They're pretty important for justifying inclusion of the code. Measurememnts for non-NUMA and uniprocessor kernels would be useful in making that decision as well. In fact a lot of the patchset is likely to be injurious to small machines. We should quantify this and then persade ourselves that the large-machine gains are worth the small-machine losses. > Items are placed on lists > according to the node the memory belongs to. To make scanning the > lists efficient, also track whether the per-node lists have entries > in them in a active nodemask. > > Note: > We use a fixed-size array for the node LRU, this struct can be very big > if MAX_NUMNODES is big. If this becomes a problem this is fixable by > turning this into a pointer and dynamically allocating this to > nr_node_ids. This quantity is firwmare-provided, and still would provide > room for all nodes at the cost of a pointer lookup and an extra > allocation. Because that allocation will most likely come from a > different slab cache than the main structure holding this structure, we > may very well fail. Surprised. How big is MAX_NUMNODES likely to get? lib/flex_array.c might be of use here. > > ... > > -struct list_lru { > +struct list_lru_node { > spinlock_t lock; > struct list_head list; > long nr_items; > +} ____cacheline_aligned_in_smp; > + > +struct list_lru { > + /* > + * Because we use a fixed-size array, this struct can be very big if > + * MAX_NUMNODES is big. If this becomes a problem this is fixable by > + * turning this into a pointer and dynamically allocating this to > + * nr_node_ids. This quantity is firwmare-provided, and still would > + * provide room for all nodes at the cost of a pointer lookup and an > + * extra allocation. Because that allocation will most likely come from > + * a different slab cache than the main structure holding this > + * structure, we may very well fail. > + */ > + struct list_lru_node node[MAX_NUMNODES]; > + nodemask_t active_nodes; Some documentation of the data structure would be helpful. It appears that active_nodes tracks (ie: duplicates) node[x].nr_items!=0. It's unclear that active_nodes is really needed - we could just iterate across all items in list_lru.node[]. Are we sure that the correct tradeoff decision was made here? What's the story on NUMA node hotplug, btw? > }; > > > ... > > unsigned long > -list_lru_walk( > - struct list_lru *lru, > - list_lru_walk_cb isolate, > - void *cb_arg, > - unsigned long nr_to_walk) > +list_lru_count(struct list_lru *lru) > { > + long count = 0; > + int nid; > + > + for_each_node_mask(nid, lru->active_nodes) { > + struct list_lru_node *nlru = &lru->node[nid]; > + > + spin_lock(&nlru->lock); > + BUG_ON(nlru->nr_items < 0); This is buggy. The bit in lru->active_nodes could be cleared by now. We can only make this assertion if we recheck lru->active_nodes[nid] inside the spinlocked region. > + count += nlru->nr_items; > + spin_unlock(&nlru->lock); > + } > + > + return count; > +} > +EXPORT_SYMBOL_GPL(list_lru_count); list_lru_count()'s return value is of course approximate. If callers require that the returned value be exact, they will need to provide their own locking on top of list_lru's internal locking (which would then become redundant). This is the sort of thing which should be discussed in the interface documentation. list_lru_count() can be very expensive. > > ... > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html