On Thu, Jan 04, 2024 at 01:39:31PM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > > > On Wed, Jan 03, 2024 at 01:46:56PM +0800, Huang, Ying wrote: > >> Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > >> > I'm specifically concerned about: > >> > weighted_interleave_nid > >> > alloc_pages_bulk_array_weighted_interleave > >> > > >> > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those > >> > contexts. If kmalloc/kfree is safe fine, this problem is trivial. > >> > > >> > If not, there is no good solution to this without pre-allocating a > >> > scratch area per-task. > >> > >> You need to audit whether it's safe for all callers. I guess that you > >> need to allocate pages after calling, so you can use the same GFP flags > >> here. > >> > > > > After picking away i realized that this code is usually going to get > > called during page fault handling - duh. So kmalloc is almost never > > safe (or can fail), and we it's nasty to try to handle those errors. > > Why not just OOM for allocation failure? > 2 notes: 1) callers of weighted_interleave_nid do not expect OOM conditions, they expect a node selection. On error, we would simply return the local numa node without indication of failure. 2) callers of alloc_pages_bulk_array_weighted_interleave receive the total number of pages allocated, and they are expected to detect pages allocated != pages requested, and then handle whether to OOM or simply retry (allocation may fail for a variety of reasons). By introducing an allocation into this area, if an allocation failure occurs, we would essentially need to silently squash it and return either local_node (interleave_nid) or return 0 (bulk allocator) and allow the allocation logic to handle any subsequent OOM condition. That felt less desirable than just allocating a scratch space up front in the mempolicy and avoiding the issue altogether. > > Instead of doing that, I simply chose to implement the scratch space > > in the mempolicy structure > > > > mempolicy->wil.scratch_weights[MAX_NUMNODES]. > > > > We eat an extra 1kb of memory in the mempolicy, but it gives us a safe > > scratch space we can use any time the task is allocating memory, and > > prevents the need for any fancy error handling. That seems like a > > perfectly reasonable tradeoff. > > I don't think that this is a good idea. The weight array is temporary. > It's temporary, but it's also only used in the context of the task while the alloc lock is held. If you think it's fine to introduce another potential OOM generating spot, then I'll just go ahead and allocate the memory on the fly. I do want to point out, though, that weighted_interleave_nid is called per allocated page. So now we're not just collecting weights to calculate the offset, we're doing an allocation (that can fail) per page allocated for that region. The bulk allocator amortizes the cost of this allocation by doing it once while allocating a chunk of pages - but the weighted_interleave_nid function is called per-page. By comparison, the memory cost to just allocate a static scratch area in the mempolicy struct is only incurred by tasks with a mempolicy. So we're talking ~1MB for 1024 threads with mempolicies to avoid error conditions mid-page-allocation and to reduce the cost associated with applying weighted interleave. ~Gregory