On Thu, 25 Mar 2010 21:33:58 +0800 Miao Xie <miaox@xxxxxxxxxxxxxx> wrote: > on 2010-3-11 19:03, Nick Piggin wrote: > >> Ok, I try to make a new patch by using seqlock. > > > > Well... I do think seqlocks would be a bit simpler because they don't > > require this checking and synchronizing of this patch. > Hi, Nick Piggin > > I have made a new patch which uses seqlock to protect mems_allowed and mempolicy. > please review it. That's an awfully big patch for a pretty small bug? > Subject: [PATCH] [PATCH -mmotm] cpuset,mm: use seqlock to protect task->mempolicy and mems_allowed (v2) > > Before applying this patch, cpuset updates task->mems_allowed by setting all > new bits in the nodemask first, and clearing all old unallowed bits later. > But in the way, the allocator can see an empty nodemask, though it is infrequent. > > The problem is following: > The size of nodemask_t is greater than the size of long integer, so loading > and storing of nodemask_t are not atomic operations. If task->mems_allowed > don't intersect with new_mask, such as the first word of the mask is empty > and only the first word of new_mask is not empty. When the allocator > loads a word of the mask before > > current->mems_allowed |= new_mask; > > and then loads another word of the mask after > > current->mems_allowed = new_mask; > > the allocator gets an empty nodemask. Probably we could fix this via careful ordering of the updates, barriers and perhaps some avoicance action at the reader side. > Besides that, if the size of nodemask_t is less than the size of long integer, > there is another problem. when the kernel allocater invokes the following function, > > struct zoneref *next_zones_zonelist(struct zoneref *z, > enum zone_type highest_zoneidx, > nodemask_t *nodes, > struct zone **zone) > { > /* > * Find the next suitable zone to use for the allocation. > * Only filter based on nodemask if it's set > */ > if (likely(nodes == NULL)) > ...... > else > while (zonelist_zone_idx(z) > highest_zoneidx || > (z->zone && !zref_in_nodemask(z, nodes))) > z++; > > *zone = zonelist_zone(z); > return z; > } > > if we change nodemask between two calls of zref_in_nodemask(), such as > Task1 Task2 > zref_in_nodemask(z = node0's z, nodes = 1-2) > zref_in_nodemask return 0 > nodes = 0 > zref_in_nodemask(z = node1's z, nodes = 0) > zref_in_nodemask return 0 > z will overflow. And maybe we can fix this by taking a copy into a local. > when the kernel allocater accesses task->mempolicy, there is the same problem. And maybe we can fix those in a similar way. But it's all too much, and we'll just break it again in the future. So yup, I guess locking is needed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>