Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > On Tue, Jan 02, 2024 at 04:42:42PM +0800, Huang, Ying wrote: >> Gregory Price <gregory.price@xxxxxxxxxxxx> writes: >> >> > On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote: >> >> Gregory Price <gourry.memverge@xxxxxxxxx> writes: >> >> >> >> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) >> >> > +{ >> >> > + nodemask_t nodemask = pol->nodes; >> >> > + unsigned int target, weight_total = 0; >> >> > + int nid; >> >> > + unsigned char weights[MAX_NUMNODES]; >> >> >> >> MAX_NUMNODSE could be as large as 1024. 1KB stack space may be too >> >> large? >> >> >> > >> > I've been struggling with a good solution to this. We need a local copy >> > of weights to prevent weights from changing out from under us during >> > allocation (which may take quite some time), but it seemed unwise to >> > to allocate 1KB heap in this particular path. >> > >> > Is my concern unfounded? If so, I can go ahead and add the allocation >> > code. >> >> Please take a look at NODEMASK_ALLOC(). >> > > This is not my question. NODEMASK_ALLOC calls kmalloc/kfree. > > Some of the allocations on the stack can be replaced with a scratch > allocation, that's no big deal. > > I'm specifically concerned about: > weighted_interleave_nid > alloc_pages_bulk_array_weighted_interleave > > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those > contexts. If kmalloc/kfree is safe fine, this problem is trivial. > > If not, there is no good solution to this without pre-allocating a > scratch area per-task. You need to audit whether it's safe for all callers. I guess that you need to allocate pages after calling, so you can use the same GFP flags here. >> >> I don't think barrier() is needed to wait for memory operations for >> >> stack. It's usually used for cross-processor memory order. >> >> >> > >> > This is present in the old interleave code. To the best of my >> > understanding, the concern is for mempolicy->nodemask rebinding that can >> > occur when cgroups.cpusets.mems_allowed changes. >> > >> > so we can't iterate over (mempolicy->nodemask), we have to take a local >> > copy. >> > >> > My *best* understanding of the barrier here is to prevent the compiler >> > from reordering operations such that it attempts to optimize out the >> > local copy (or do lazy-fetch). >> > >> > It is present in the original interleave code, so I pulled it forward to >> > this, but I have not tested whether this is a bit paranoid or not. >> > >> > from `interleave_nid`: >> > >> > /* >> > * The barrier will stabilize the nodemask in a register or on >> > * the stack so that it will stop changing under the code. >> > * >> > * Between first_node() and next_node(), pol->nodes could be changed >> > * by other threads. So we put pol->nodes in a local stack. >> > */ >> > barrier(); >> >> Got it. This is kind of READ_ONCE() for nodemask. To avoid to add >> comments all over the place. Can we implement a wrapper for it? For >> example, memcpy_once(). __read_once_size() in >> tools/include/linux/compiler.h can be used as reference. >> >> Because node_weights[] may be changed simultaneously too. We may need >> to consider similar issue for it too. But RCU seems more appropriate >> for node_weights[]. >> > > Weights are collected individually onto the stack because we have to sum > them up before we actually apply the weights. > > A stale weight is not offensive. RCU is not needed and doesn't help. When you copy weights from iw_table[] to stack, it's possible for compiler to cache its contents in register, or merge, split the memory operations. At the same time, iw_table[] may be changed simultaneously via sysfs interface. So, we need a mechanism to guarantee that we read the latest contents consistently. > The reason the barrier is needed is not weights, it's the nodemask. Yes. So I said that we need similar stuff for weights. > So you basically just want to replace barrier() with this and drop the > copy/pasted comments: > > static void read_once_policy_nodemask(struct mempolicy *pol, nodemask_t *mask) > { > /* > * The barrier will stabilize the nodemask in a register or on > * the stack so that it will stop changing under the code. > * > * Between first_node() and next_node(), pol->nodes could be changed > * by other threads. So we put pol->nodes in a local stack. > */ > barrier(); > __builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t)); > barrier(); > } > > - nodemask_t nodemask = pol->nodemask > - barrier() > + nodemask_t nodemask; > + read_once_policy_nodemask(pol, &nodemask) > > Is that right? Yes. Something like that. Or even more general (need to be optimized?), static inline static void memcpy_once(void *dst, void *src, size_t n) { barrier(); memcpy(dst, src, n); barrier(); } memcpy_once(&nodemask, &pol->nodemask, sizeof(nodemask)); The comments can be based on that of READ_ONCE(). -- Best Regards, Huang, Ying