Re: [patch 08/87] proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks

Baoquan He <bhe@xxxxxxxxxx> · Wed, 10 Nov 2021 15:22:25 +0800

On 11/08/21 at 06:31pm, Andrew Morton wrote:
> From: David Hildenbrand <david@xxxxxxxxxx>
> Subject: proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks
> 
> Let's support multiple registered callbacks, making sure that registering
> vmcore callbacks cannot fail.  Make the callback return a bool instead of
> an int, handling how to deal with errors internally.  Drop unused
> HAVE_OLDMEM_PFN_IS_RAM.
> 
> We soon want to make use of this infrastructure from other drivers:
> virtio-mem, registering one callback for each virtio-mem device, to
> prevent reading unplugged virtio-mem memory.
> 
> Handle it via a generic vmcore_cb structure, prepared for future
> extensions: for example, once we support virtio-mem on s390x where the
> vmcore is completely constructed in the second kernel, we want to detect
> and add plugged virtio-mem memory ranges to the vmcore in order for them
> to get dumped properly.
> 
> Handle corner cases that are unexpected and shouldn't happen in sane
> setups: registering a callback after the vmcore has already been opened
> (warn only) and unregistering a callback after the vmcore has already been
> opened (warn and essentially read only zeroes from that point on).
                               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I am fine with the whole patch except of one concern. As above sentence
underscored states, if a callback is unregistered when vmcore has been
opened, it will read out zeros from that point on. And it's done by
judging global variable 'vmcore_cb_unstable' in pfn_is_ram(). This will
cause vmcore dumping in makedumpfile only being able to read out zero
page since then, and may cost long extra time to finish.

Please see remap_oldmem_pfn_checked(). In makedumpfile, we default to
mmap 4M memory region at one time, then copy out. With this patch, and if
vmcore_cb_unstable is true, kernel will mmap page by page. The extra
time could be huge, e.g on machine with TBs memory, and we only get a
useless vmcore because of loss of core data with high probability.

I am thinking if we can simply panic in the case, since the left dumping
are all zeroed, very likely the vmcore is unavailable any more.

......

>  static bool pfn_is_ram(unsigned long pfn)
>  {
> -	int (*fn)(unsigned long pfn);
> -	/* pfn is ram unless fn() checks pagetype */
> +	struct vmcore_cb *cb;
>  	bool ret = true;
>  
> -	/*
> -	 * Ask hypervisor if the pfn is really ram.
> -	 * A ballooned page contains no data and reading from such a page
> -	 * will cause high load in the hypervisor.
> -	 */
> -	fn = oldmem_pfn_is_ram;
> -	if (fn)
> -		ret = !!fn(pfn);
> +	lockdep_assert_held_read(&vmcore_cb_rwsem);
> +	if (unlikely(vmcore_cb_unstable))
> +		return false;
> +
> +	list_for_each_entry(cb, &vmcore_cb_list, next) {
> +		if (unlikely(!cb->pfn_is_ram))
> +			continue;
> +		ret = cb->pfn_is_ram(cb, pfn);
> +		if (!ret)
> +			break;
> +	}
>  
>  	return ret;
>  }
>  
......