On 05/18/2017 07:07 PM, Christoph Lameter wrote: > On Thu, 18 May 2017, Vlastimil Babka wrote: > >>> The race is where? If you expand the node set during the move of the >>> application then you are safe in terms of the legacy apps that did not >>> include static bindings. >> >> No, that expand/shrink by itself doesn't work against parallel > > Parallel? I think we are clear that ithis is inherently racy against the > app changing policies etc etc? There is a huge issue there already. The > app needs to be well behaved in some heretofore undefined way in order to > make moves clean. The code is safe against mbind() changing a vma's mempolicy parallel to another thread page faulting within that vma, because mbind() takes mmap_sem for write, and page faults take it for read. The per-task mempolicy can be changed by set_mempolicy() call which means the task itself doesn't allocate stuff in parallel. So, the application never needed to be "well behaved" wrt changing its own mempolicies. Now with mempolicy rebinding due to cpuset migrations, the application cannot be "well behaved" as it has no way to learn about being under a cpuset, or cpuset change. Any application can be put in a cpuset and we can't really expect that all would be adapted, even if the necessary interfaces existed. Thus, the rebinding implementation in the kernel itself has to be robust against parallel allocations. >> get_page_from_freelist going through a zonelist. Moving from node 0 to >> 1, with zonelist containing nodes 1 and 0 in that order: >> >> - mempolicy mask is 0 >> - zonelist iteration checks node 1, it's not allowed, skip > > There is an allocation from node 1? Sorry, I missed to mention the full scenario. Let's say the allocation is on cpu local to node 1, so it gets zonelist from node 1, which contains nodes 1 and 0 in that order. > This is not allowed before the move. > So it should fail. Not skipping to another node. > >> - mempolicy mask is 0,1 (expand) >> - mempolicy mask is 1 (shrink) >> - zonelist iteration checks node 0, it's not allowed, skip >> - OOM > > Are you talking about a race here between zonelist scanning and the > moving? That has been there forever. As far as I can tell from my git archeology in [1] there was always some kind of protection against the race (generation counters, two-step protocol, seqlock...), which however had some corner cases. This patch is merely plugging the last known one. > And frankly there are gazillions of these races. I don't know about any other existing race that we don't handle after this patch. > The best thing to do is > to get the cpuset moving logic out of the kernel and into user space. > > Understand that this is a heuristic and maybe come up with a list of > restrictions that make an app safe. An safe app that can be moved must f.e > > 1. Not allocate new memory while its being moved > 2. Not change memory policies after its initialization and while its being > moved. As I explainer eariler in this mail, changing mempolicy by app itself is safe, the problem was always due to cpuset-triggered rebinding. > 3. Not save memory policy state in some variable (because the logic to > translate the memory policies for the new context cannot find it). > > ... > > Again cpuset process migration is a huge mess that you do not want to > have in the kernel and AFAICT this is a corner case with difficult > semantics. Better have that in user space... Moving this out of kernel etc is changing the current semantics and breaking existing userspace, this patch is a fix within the existing one. [1] https://marc.info/?l=linux-mm&m=148611344511408&w=2 > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> > -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html