> On 2024/11/28 13:46, Tomohiro Misono (Fujitsu) wrote: > >>>> On 2024/11/21 12:55, Tomohiro Misono wrote: > >>>>> commit 44b8f8bf2438 ("mm: memory-failure: add memory failure stats > >>>> > >>>> Sorry for late, I've been swamped recently. > >>> > >>> Hi, > >>> Thanks for your comments. > >>> > >>>> > >>>>> to sysfs") introduces per NUMA memory error stats which show > >>>>> breakdown of HardwareCorrupted of /proc/meminfo in > >>>>> /sys/devices/system/node/nodeX/memory_failure. > >>>> > >>>> Thanks for your patch. > >>>> > >>>>> > >>>>> However, HardwareCorrupted also counts soft-offline pages. So, add > >>>>> soft-offline stats in mf_stats too to represent more accurate status. > >>>> > >>>> Adding soft-offline stats makes sense to me. > >>> > >>> Thanks for confirming. > >> > >> Agreed with Miaohe. > >> > >>> > >>>> > >>>>> > >>>>> This updates total count as: > >>>>> total = recovered + ignored + failed + delayed + soft_offline> > >>>>> Test example: > >>>>> 1) # grep HardwareCorrupted /proc/meminfo > >>>>> HardwareCorrupted: 0 kB > >>>>> 2) soft-offline 1 page by madvise(MADV_SOFT_OFFLINE) > >>>>> 3) # grep HardwareCorrupted /proc/meminfo > >>>>> HardwareCorrupted: 4 kB > >>>>> # grep -r "" /sys/devices/system/node/node0/memory_failure > >>>>> /sys/devices/system/node/node0/memory_failure/total:1 > >>>>> /sys/devices/system/node/node0/memory_failure/soft_offline:1 > >>>>> /sys/devices/system/node/node0/memory_failure/recovered:0 > >>>>> /sys/devices/system/node/node0/memory_failure/ignored:0 > >>>>> /sys/devices/system/node/node0/memory_failure/failed:0 > >>>>> /sys/devices/system/node/node0/memory_failure/delayed:0 > >>>>> > >>>>> Signed-off-by: Tomohiro Misono <misono.tomohiro@xxxxxxxxxxx> > >>>>> --- > >>>>> Hello > >>>>> > >>>>> This is RFC because I'm not sure adding SOFT_OFFLINE in enum > >>>>> mf_result is a right approach. Also, maybe is it better to move > >>>>> update_per_node_mf_stats() into num_poisoned_pages_inc()? > >>>>> > >>>>> I omitted some cleanups and sysfs doc update in this version to > >>>>> highlight changes. I'd appreciate any suggestions. > >>>>> > >>>>> Regards, > >>>>> Tomohiro Misono > >>>>> > >>>>> include/linux/mm.h | 2 ++ > >>>>> include/linux/mmzone.h | 4 +++- > >>>>> mm/memory-failure.c | 9 +++++++++ > >>>>> 3 files changed, 14 insertions(+), 1 deletion(-) > >>>>> > >>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h > >>>>> index 5d6cd523c7c0..7f93f6883760 100644 > >>>>> --- a/include/linux/mm.h > >>>>> +++ b/include/linux/mm.h > >>>>> @@ -3991,6 +3991,8 @@ enum mf_result { > >>>>> MF_FAILED, /* Error: handling failed */ > >>>>> MF_DELAYED, /* Will be handled later */ > >>>>> MF_RECOVERED, /* Successfully recovered */ > >>>>> + > >>>>> + MF_RES_SOFT_OFFLINE, /* Soft-offline */ > >>>> > >>>> It might not be a good idea to add MF_RES_SOFT_OFFLINE here. 'mf_result' is used to record > >>>> the result of memory failure handler. So it might be inappropriate to add MF_RES_SOFT_OFFLINE > here. > >>> > >>> Understood. As I don't see other suitable place to put ENUM value, how about changing like below? > >>> Or, do you prefer adding another ENUM type instead of this? > >> > >> I think SOFT_OFFLINE-ed is one of the results of successfully > >> recovered, and the other one is HARD_OFFLINE-ed. So how about make a > >> separate sub-ENUM for MF_RECOVERED? Something like: > > > > Thanks for the suggestion. > > > >> > >> enum mf_recovered_result { > >> MF_RECOVERED_SOFT_OFFLINE, > >> MF_RECOVERED_HARD_OFFLINE, > >> }; > > > > Ok. > > > >> > >> And > >> 1. total = recovered + ignored + failed + delayed > >> 2. recovered = soft_offline + hard_offline > > > > Do you mean mf_stats now have 7 entries in sysfs? > > (total, ignored, failed, delayed, recovered, hard_offline, soft_offline, then recovered = hard_offline + > soft_offline) > > Or 6 entries ? (in that case, hard_offline = recovered - soft_offline) > > It might be simpler to understand for user if total is just the sum of other entries like this RFC, > > but I'd like to know other opinions. > > Will it be better to have below items? > " > total > ignored > failed > dalayed > hard_offline > soft_offline > " > > though this will break the previous interface. > Any thoughts? That would be great, but these files are under stable ABI and I don't think we can change them, right? https://docs.kernel.org/admin-guide/abi-stable.html Userspace programs are free to use these interfaces with no restrictions, and backward compatibility for them will be guaranteed for at least 2 years. Most interfaces (like syscalls) are expected to never change and always be available. Regards, Tomohiro Misono