On 2023/02/23 15:24, lizhijian@xxxxxxxxxxx wrote: > Hello folks, > > This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature. > I really hope you can provide some feedback. > > pmem memmap can also be called pmem metadata here. > > ### Background and motivate overview ### > --- > Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what > happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support. > However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze > trouble around pmem (especially Filesystem-DAX). > > > A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation > can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for > more details. In fsdax, struct page array becomes very important, it is one of the key data to find > status of reverse map. > > So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means > troubleshooters are unable to check more details about pmem from the dumpfile. > > ### Make pmem memmap dump support ### > --- > Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the > crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled. > > First, based on our previous investigation, according to the location of metadata and the scope of > dump, we can divide it into the following four cases: A, B, C, D. > It should be noted that although we mentioned case A&B below, we do not want these two cases to be > part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly, > it may contain user sensitive data. > > +-------------+----------+------------+ > |\+--------+\ metadata location | > | ++-----------------------+ > | dump scope | mem | PMEM | > +-------------+----------+------------+ > | entire pmem | A | B | > +-------------+----------+------------+ > | metadata | C | D | > +-------------+----------+------------+ > > Case A&B: unsupported > - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem > region into vmcore's PT_LOADs in kexec-tools. > - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs > are readable, and then skips/excludes the specific page according to its attributes. But in the case > of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw > errors[2] when specific -d options are specified. > Accordingly, we should make makedumpfile to ignore these errors if it's pmem region. > > Because these above cases are not in our goal, we must consider how to prevent the data part of pmem > from reading by the dump application(makedumpfile). > > Case C: native supported > metadata is stored in mem, and the entire mem/ram is dumpable. > > Case D: unsupported && need your input > To support this situation, the makedumpfile needs to know the location of metadata for each pmem > namespace and the address and size of metadata in the pmem [start, end) > > We have thought of a few possible options: > > 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y} > exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata > 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of > each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata. Hi Zhijian, sorry, probably I don't understand enough, but do these mean that 1. /proc/vmcore exports pmem regions with PT_LOADs, which contain unreadable ones, and 2. makedumpfile gets to know the readable regions somehow? Then /proc/vmcore with pmem cannot be captured by other commands, e.g. cp command? Thanks, Kazu > 3) others ? > > But then we found that we have always ignored a user case, that is, the user could save the dumpfile > to the pmem. Neither of these two options can solve this problem, because the pmem drivers will > re-initialize the metadata during the pmem drivers loading process, which leads to the metadata > we dumped is inconsistent with the metadata at the moment of the crash happening. > Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be > destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing > dumpfile on the filesystem/partition based on pmem. > > So here I hope you can provide some ideas about this feature/requirement and on the possible solution > for the cases A&B&D mentioned above, it would be greatly appreciated. > > If I’m missing something, feel free to let me know. Any feedback & comment are very welcome. > > > [1] Pmem region layout: > ^<--namespace0.0---->^<--namespace0.1------>^ > | | | > +--+m----------------+--+m------------------+---------------------+-+a > |++|e |++|e | |+|l > |++|t |++|t | |+|i > |++|a |++|a | |+|g > |++|d namespace0.0 |++|d namespace0.1 | un-allocated |+|n > |++|a fsdax |++|a devdax | |+|m > |++|t |++|t | |+|e > +--+a----------------+--+a------------------+---------------------+-+n > | |t > v<-----------------------pmem region------------------------------->v > > [2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@xxxxxxxxxxxxxxxx/T/ > > > Thanks > Zhijian