On Mon, May 09, 2016 at 11:00:52PM +0900, Sergey Senozhatsky wrote: > zs_can_compact() has two race conditions in its core calculation: > > unsigned long obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) - > zs_stat_get(class, OBJ_USED); > > 1) classes are not locked, so the numbers of allocated and used > objects can change by the concurrent ops happening on other CPUs > 2) shrinker invokes it from preemptible context > > Depending on the circumstances, thus, OBJ_ALLOCATED can become > less than OBJ_USED, which can result in either very high or > negative `total_scan' value calculated later in do_shrink_slab(). > > do_shrink_slab() has some logic to prevent those cases: > > vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 > vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 > vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-64 > vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 > vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 > vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 > > However, due to the way `total_scan' is calculated, not every > shrinker->count_objects() overflow can be spotted and handled. > To demonstrate the latter, I added some debugging code to do_shrink_slab() > (x86_64) and the results were: > > vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615] > vmscan: but total_scan > 0: 92679974445502 > vmscan: resulting total_scan: 92679974445502 > [..] > vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615] > vmscan: but total_scan > 0: 22634041808232578 > vmscan: resulting total_scan: 22634041808232578 > > Even though shrinker->count_objects() has returned an overflowed value, > the resulting `total_scan' is positive, and, what is more worrisome, it > is insanely huge. This value is getting used later on in > shrinker->scan_objects() loop: > > while (total_scan >= batch_size || > total_scan >= freeable) { > unsigned long ret; > unsigned long nr_to_scan = min(batch_size, total_scan); > > shrinkctl->nr_to_scan = nr_to_scan; > ret = shrinker->scan_objects(shrinker, shrinkctl); > if (ret == SHRINK_STOP) > break; > freed += ret; > > count_vm_events(SLABS_SCANNED, nr_to_scan); > total_scan -= nr_to_scan; > > cond_resched(); > } > > `total_scan >= batch_size' is true for a very-very long time and > 'total_scan >= freeable' is also true for quite some time, because > `freeable < 0' and `total_scan' is large enough, for example, > 22634041808232578. The only break condition, in the given scheme of > things, is shrinker->scan_objects() == SHRINK_STOP test, which is a > bit too weak to rely on, especially in heavy zsmalloc-usage scenarios. > > To fix the issue, take a pool stat snapshot and use it instead of > racy zs_stat_get() calls. > > Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@xxxxxxxxx> > Cc: Minchan Kim <minchan@xxxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> [4.3+] Acked-by: Minchan Kim <minchan@xxxxxxxxxx> Thanks! -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html