Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > On Thu, Jan 04, 2024 at 01:39:31PM +0800, Huang, Ying wrote: >> Gregory Price <gregory.price@xxxxxxxxxxxx> writes: >> >> > On Wed, Jan 03, 2024 at 01:46:56PM +0800, Huang, Ying wrote: >> >> Gregory Price <gregory.price@xxxxxxxxxxxx> writes: >> >> > I'm specifically concerned about: >> >> > weighted_interleave_nid >> >> > alloc_pages_bulk_array_weighted_interleave >> >> > >> >> > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those >> >> > contexts. If kmalloc/kfree is safe fine, this problem is trivial. >> >> > >> >> > If not, there is no good solution to this without pre-allocating a >> >> > scratch area per-task. >> >> >> >> You need to audit whether it's safe for all callers. I guess that you >> >> need to allocate pages after calling, so you can use the same GFP flags >> >> here. >> >> >> > >> > After picking away i realized that this code is usually going to get >> > called during page fault handling - duh. So kmalloc is almost never >> > safe (or can fail), and we it's nasty to try to handle those errors. >> >> Why not just OOM for allocation failure? >> > > 2 notes: > > 1) callers of weighted_interleave_nid do not expect OOM conditions, they > expect a node selection. On error, we would simply return the local > numa node without indication of failure. > > 2) callers of alloc_pages_bulk_array_weighted_interleave receive the > total number of pages allocated, and they are expected to detect > pages allocated != pages requested, and then handle whether to > OOM or simply retry (allocation may fail for a variety of reasons). > > By introducing an allocation into this area, if an allocation failure > occurs, we would essentially need to silently squash it and return > either local_node (interleave_nid) or return 0 (bulk allocator) and > allow the allocation logic to handle any subsequent OOM condition. > > That felt less desirable than just allocating a scratch space up front > in the mempolicy and avoiding the issue altogether. > >> > Instead of doing that, I simply chose to implement the scratch space >> > in the mempolicy structure >> > >> > mempolicy->wil.scratch_weights[MAX_NUMNODES]. >> > >> > We eat an extra 1kb of memory in the mempolicy, but it gives us a safe >> > scratch space we can use any time the task is allocating memory, and >> > prevents the need for any fancy error handling. That seems like a >> > perfectly reasonable tradeoff. >> >> I don't think that this is a good idea. The weight array is temporary. >> > > It's temporary, but it's also only used in the context of the task while > the alloc lock is held. > > If you think it's fine to introduce another potential OOM generating > spot, then I'll just go ahead and allocate the memory on the fly. > > I do want to point out, though, that weighted_interleave_nid is called > per allocated page. So now we're not just collecting weights to > calculate the offset, we're doing an allocation (that can fail) per page > allocated for that region. > > The bulk allocator amortizes the cost of this allocation by doing it > once while allocating a chunk of pages - but the weighted_interleave_nid > function is called per-page. > > By comparison, the memory cost to just allocate a static scratch area in > the mempolicy struct is only incurred by tasks with a mempolicy. > > > So we're talking ~1MB for 1024 threads with mempolicies to avoid error > conditions mid-page-allocation and to reduce the cost associated with > applying weighted interleave. Think about this again. Why do we need weights array on stack? I think this is used to keep weights consistent. If so, we don't need weights array on stack. Just use RCU to access global weights array. -- Best Regards, Huang, Ying