Hi Dmitry, On Thu, Jun 11, 2020 at 07:43:19PM +0300, Dmitry Yakunin wrote: > Hello! > > We are faced with similar problems with hwpoisoned pages > on one of our production clusters after kernel update to stable 4.19. > Application that does a lot of memory allocations sometimes caught SIGBUS signal > with message in dmesg about hardware memory corruption fault. > In kernel and mce logs we saw messages about soft offlining pages with > correctable errors. Those events always had happened before application > was killed. This is not the behavior we expect. We want our application to > continue working on a smaller set of available pages in the system. > > This issue is difficult to reproduce, but we suppose that the reason for such > behavior is that compaction does not check for page poisonness while processing > free pages, so as a result valid userspace data gets migrated to bad pages. > We wrote the simple test: > - soft offline first 4 pages in every 64 continuous pages in ZONE_NORMAL > through writing pfn to /sys/devices/system/memory/soft_offline_page > - force compaction by echo 1 >> /proc/sys/vm/compact_memory > Without this patch series after these steps bash became unusable > and every attempt to run any command leads to SIGBUS with message about > hardware memory corruption fault. And after applying this series to our kernel > tree we cannot reproduce such SIGBUSes by our test. On upstream kernel 5.7 > this behavior is still reproducible. > > So, we want to know, why this patchset wasn't merged to the upstream? > Is there any problems in such rework for {soft,hard}-offline handling? No technical reason, it's just because I didn't have enough power to push this to be merged. Really sorry about that. > BTW, this patchset should be updated with upstream changes in mm. I'm working this now and still need more testing to confirm, but I hope I'll update and post this for 5.9. Thanks, Naoya Horiguchi