On 01/18/2013 12:08 AM, Dave Chinner wrote: > On Thu, Jan 17, 2013 at 04:51:03PM -0800, Glauber Costa wrote: >> On 01/17/2013 04:10 PM, Dave Chinner wrote: >>> and we end up with: >>> >>> lru_add(struct lru_list *lru, struct lru_item *item) >>> { >>> node_id = min(object_to_nid(item), lru->numnodes); >>> >>> __lru_add(lru, node_id, &item->global_list); >>> if (memcg) { >>> memcg_lru = find_memcg_lru(lru->memcg_lists, memcg_id) >>> __lru_add_(memcg_lru, node_id, &item->memcg_list); >>> } >>> } >> >> A follow up thought: If we have multiple memcgs, and global pressure >> kicks in (meaning none of them are particularly under pressure), >> shouldn't we try to maintain fairness among them and reclaim equal >> proportions from them all the same way we do with sb's these days, for >> instance? > > I don't like the complexity. The global lists will be reclaimed in > LRU order, so it's going to be as fair as can be. If there's a memcg > that has older unused objectsi than the others, then froma global > perspective they should be reclaimed first because the memcg is not > using them... > Disclaimer: I don't necessarily disagree with you, but let's explore the playing field... How do we know ? The whole point of memcg is maintaining a best-effort isolation between different logically separated entities. If we need to reclaim 1k dentries, and your particular memcg has the first 1k dentries in the LRU, it is not obvious that you should be hurt more than others just because you are idle for longer. You could of course argue that under global reclaim, we should lift our fairness attempts in this case, specially given cost and complexity trade offs, but still. >> I would argue that if your memcg is small, the list of dentries is >> small: scan it all for the nodes you want shouldn't hurt. > > on the contrary - the memcg might be small, but what happens if > someone ran a find across all the filesytsems on the system in it? > Then the LRU will be huge, and scanning expensive... > The memcg being small means that the size of the list is limited, no? It can never use more memory than X bytes, which translates to Y entries. If you do it, you will trigger reclaim for yourself. > We can't make static decisions about small and large, and we can't > trust heuristics to get it right, either. If we have a single list, > we don't/can't do node-aware reclaim efficiently and so shouldn't > even try. > >> if the memcg is big, it will have per-node lists anyway. > > But may have no need for them due to the workload. ;) > sure. >> Given that, do we really want to pay the price of two list_heads >> in the objects? > > I'm just looking at ways at making the infrastructure sane. If the > cost is an extra 16 bytes per object on a an LRU, then that a small > price to pay for having robust memory reclaim infrastructure.... > We still need to evaluate that versus the solution to use some sort of dynamic node-list allocation, which would allow us to address online nodes instead of possible nodes. This way, memory overhead may very well be bounded enough so everybody gets to be per-node. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html