Re: [MAKDUMPFILE PATCH] Add option to estimate the size of vmcore dump files

lijiang <lijiang@xxxxxxxxxx> · Mon, 2 Nov 2020 16:07:14 +0800

在 2020年10月30日 14:29, HAGIO KAZUHITO(萩尾　一仁) 写道:
> -----Original Message-----
>> 在 2020年10月28日 16:32, HAGIO KAZUHITO(萩尾　一仁) 写道:
>>> Hi Julien,
>>>
>>> sorry for my delayed reply.
>>>
>>> -----Original Message-----
>>>>>>>>> A user might want to know how much space a vmcore file will take on
>>>>>>>>> the system and how much space on their disk should be available to
>>>>>>>>> save it during a crash.
>>>>>>>>>
>>>>>>>>> The option --vmcore-size does not create the vmcore file but provides
>>>>>>>>> an estimation of the size of the final vmcore file created with the
>>>>>>>>> same make dumpfile options.
>>>>>
>>>>> Interesting.  Do you have any actual use case?  e.g. used by kdumpctl?
>>>>> or use it in kdump initramfs?
>>>>>
>>>>
>>>> Yes, the idea would be to use this in mkdumprd to have a more accurate
>>>> estimate of the dump size (currently it cannot take compression into
>>>> account and warns about potential lack of space, considering the system
>>>> memory size as a whole).
>>>
>>> Hmm, I'm not sure how you are going to implement in mkdumprd, but I do not
>>> recommend that you use it to determine how much disk space should be
>>> allocated for crash dump.  Because, I think that
>>>
>>> - It cannot estimate the dump size when a real crash occurs, e.g. if slab
>>> explodes with non-zero data, almost all memory will be captured by makedumpfile
>>
>> I agree with you, but this could be rare? If yes, I'm not sure if it is worth
>> thinking more about the rare situations.
> 
> Cases that a dumpfile is inflated with -d 31 might be rare, but if users
> need user data, e.g. for gcore, underestimation will occur easily.
> 
Yes, that's true.

>>
>>> even with -d 31, and compression ratio varies with data in memory.
>>
>> Indeed.
>>
>>> Also, in most cases, mkdumprd runs at boot time or construction phase
>>> with less memory usage, not at usual application running time.  So it
>>> can underestimate the needed size easily.
>>>
>> If administrator can monitor the estimated size periodically, maybe it
>> won't be a problem?
> 
> I think most of them cannot or do not do that, and even if they could do,
> when a panic occurs by an unknown problem, can you depend on that estimation?
> 
This requires user to evaluate the risk. The tools only provide a reference
value at a certain time point, and remind users of such risks.

>>
>>> - The system might need a full vmcore and need to change makedumpfile's
>>> dump level for an issue in the future.  But many systems cannot change
>>> their disk space allocation easily.  So we should prevent users from
>>> having minimum disk space for crash dump.
>>>
>>> So, the following is from mkdumprd on Fedora 32, personally I think this
>>> is good for now.
>>>
>>>     if [ $avail -lt $memtotal ]; then
>>>         echo "Warning: There might not be enough space to save a vmcore."
>>>         echo "         The size of $2 should be greater than $memtotal kilo bytes."
>>>     fi
>>>
>> Currently, some users are complaining that mkdumprd overestimates the needed size,
>> and most vmcores are significantly smaller than the size of system memory.
>>
>> Furthermore, in most cases, the system memory will not be completely exhausted, but
>> that still depends on how the memory is used in the system, for example:
>> [1] make the stressful test for memory
>> [2] always occupies amount of memory and not release it.
>>
>> For the above two cases, there may be rare.
> 
> I've seen and worked on thousands of support cases, memory is exhausted
> easily and unexpectedly..  Especially nowadays I often see panics by
> vm.panic_on_oom.
> 
>> Therefore, can we find out a compromise
>> between the size of vmcore and system memory so that makedumpfile can estimate the
>> size of vmcore more accurately?
>>
>> And finally, mkdumprd can use the estimated size of vmcore instead of system memory(memtotal)
>> to determine if the target disk has enough space to store vmcore.
> 
> The current mkdumprd just warns the possibility of lack of space,
> it doesn't fail.  I think this is a good balance.
> 
> Users can choose the estimated size over the whole memory size with
> their discretion.  Providing the useful estimation tool for them
> might be good.
> 
> But, if we do so, we should let users know the tradeoff between the
> disk space and the risk of failure.  So I believe that we should
> continue to warn the possibility of failure of capturing vmcore
> with less space than the whole memory.
> 
Our understanding is consistent about this issue. Maybe we could have a document
to explain the details.

Thanks.
Lianbo

> Thanks,
> Kazu
> 
> 
>>
>>
>> Thanks.
>> Lianbo
>>
>>> The patch's functionality itself might be useful and I don't reject, though.
>>>
>>>>>>>>> @@ -4643,6 +4706,8 @@ write_buffer(int fd, off_t offset, void *buf, size_t buf_size, char *file_name)
>>>>>>>>>                   }
>>>>>>>>>                   if (!write_and_check_space(fd, &fdh, sizeof(fdh), file_name))
>>>>>>>>>                           return FALSE;
>>>>>>>>> +       } else if (info->flag_vmcore_size && fd == info->fd_dumpfile) {
>>>>>>>>> +               return write_buffer_update_size_info(offset, buf, buf_size);
>>>>>
>>>>> Why do we need this function?  makedumpfile actually writes zero-filled
>>>>> pages to the dumpfile with -d 0, and doesn't write them with -d 1.
>>>>> So isn't "write_bytes += buf_size" enough?  For example, with -d 30,
>>>>>
>>>>
>>>> The reason I went with this method was to make an estimate of the number
>>>> of blocks actually allocated on the disk (since depending on how the
>>>> data written is scattered in the file, there might be a significant
>>>> difference between bytes written vs actual size allocated on disk). But
>>>> I realize that there is some misunderstanding from my end since written
>>>> 0 do make block allocation as opposed to not writing at some offset
>>>> (skipping the with lseek() ), I would need to fix that.
>>>>
>>>> To highlight the behaviour I'm talking about:
>>>> $ dd if=/dev/zero of=./testfile bs=4096 count=1 seek=1
>>>> 1+0 records in
>>>> 1+0 records out
>>>> 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000302719 s, 13.5 MB/s
>>>> $ du -h testfile
>>>> 4.0K	testfile
>>>>
>>>> $ dd if=/dev/zero of=./testfile bs=4096 count=2
>>>> 2+0 records in
>>>> 2+0 records out
>>>> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000373002 s, 22.0 MB/s
>>>> $ du -h testfile
>>>> 8.0K	testfile
>>>>
>>>>
>>>> So, do you think it's not worth bothering estimating the number of
>>>> blocks allocated an that I should only consider the number of bytes written?
>>>
>>> Yes, makedumpfile almost doesn't make empty (sparse) blocks,
>>> so the error would be small enough.
>>>
>>>>>>>>
>>>>>>>> I like the idea, but sometimes we use makedumpfile to generate a
>>>>>>>> dumpfile in the primary kernel as well. For example:
>>>>>>>>
>>>>>>>> $ makedumpfile -d 31 -x vmlinux /proc/kcore dumpfile
>>>>>>>>
>>>>>>>> In such use-cases it is useful to use --vmcore-size and still generate
>>>>>>>> the dumpfile (right now the default behaviour is not to generate a
>>>>>>>> dumpfile when --vmcore-size is specified). Maybe we need to think more
>>>>>>>> on supporting this use-case as well.
>>>>>>>>
>>>>>>>
>>>>>>> The thing is, if you are generating the dumpfile, you can just check the
>>>>>>> size of the file created with "du -b" or some other command.
>>>>>>
>>>>>> I agree, but I just was looking to replace the two  'makedumpfile +
>>>>>> du' steps with a single 'makedumpfile --vmcore-size' step.
>>>>>>
>>>>>>> Overall I don't mind supporting your case as well. Maybe that can depend
>>>>>>> on whether a vmcore/dumpfile filename is provided:
>>>>>>>
>>>>>>> $ makedumpfile -d 31 -x vmlinux /proc/kcore    # only estimates the size
>>>>>>>
>>>>>>> $ makedumpfile -d 31 -x vmlinux /proc/kcore dumpfile  # writes the
>>>>>>> dumpfile and gives the final size
>>>>>>>
>>>>>>> Any thought, opinions, suggestions?
>>>>>>
>>>>>> Let's wait for Kazu's opinion on the same, but I am ok with using a
>>>>>> two-step 'makedumpfile + du' approach for now (and later expand
>>>>>> --vmcore-size as we encounter more use-cases).
>>>>>>
>>>>>> @Kazuhito Hagio : What's your opinion on the above?
>>>>>
>>>>> I would prefer only estimating with the option.
>>>>>
>>>>> And if the write_bytes method above is usable, it can be shown also
>>>>> in report messages when wrote the dumpfile.
>>>>>
>>>>
>>>> Let me know your preferred approach considering my comment above and
>>>> I'll send out a v2.
>>>
>>> I'm rethinking about what command options makedumpfile should have.
>>> If once we add an option to makedumpfile, we cannot change it easily,
>>> so I'd like to think carefully.
>>>
>>> The calculated size might be useful if it's printed so that it can be
>>> easily post-processed by scripts, e.g. for automated tests.  If so,
>>> makedumpfile already prints its statistics with "--message-level 16",
>>> and it might be useful to also print them by an option like "--show-stats".
>>>
>>>   # makedumpfile --show-stats -l -d 31 vmcore dump.ld31
>>>   total_pages xxx
>>>   excluded_pages yyy
>>>   ...
>>>   write_bytes zzz
>>>
>>> Also, if we also have "--dry-run" option to not write actually, it's
>>> explicit and meets Bhupesh's use case.  What do you think?
>>>
>>> Thanks,
>>> Kazu
>>>
>>> _______________________________________________
>>> kexec mailing list
>>> kexec@xxxxxxxxxxxxxxxxxxx
>>> http://lists.infradead.org/mailman/listinfo/kexec
>>>
> 
> _______________________________________________
> kexec mailing list
> kexec@xxxxxxxxxxxxxxxxxxx
> http://lists.infradead.org/mailman/listinfo/kexec
> 

_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec