Re: [PATCH] Fix machdep->HZ calculation for kernel versions > 2.6.0

lijiang <lijiang@xxxxxxxxxx> · Thu, 22 Apr 2021 22:26:19 +0800

在 2021年04月22日 17:33, HAGIO KAZUHITO(萩尾　一仁) 写道:
> -----Original Message-----
>> -----Original Message-----
>>> 在 2021年01月12日 16:24, HAGIO KAZUHITO(萩尾　一仁) 写道:
>>>> Hi Bhupesh,
>>>>
>>>> -----Original Message-----
>>>>> We have hard-coded the HZ value for some ARCHs to either 1000 or 100
>>>>> (mainly for kernel versions > 2.6.0), which causes 'help -m' to show
>>>>> an incorrect hz value for various architectures.
>>>>
>>>> Good catch.  but seems crash uses (cfq_slice_async * 25) for machdep->hz
>>>> if it exists (please see task_init()), RHEL7 has it, but RHEL8 does not.
>>>> What do you see on RHEL8 for x86_64 with your patch?
>>>>
>>>
>>> The symbol 'cfq_slice_async' has been removed from upstream kernel:
>>> f382fb0bcef4 ("block: remove legacy IO schedulers")
>>>
>>> And RHEL8 also removed it.
>>>
>>>> We should search for an alternate way like the current one first.
>>>>
>>>
>>> Currently, there are several ways to get the value of HZ as below:
>>>
>>> [1] calculate hz via the symbol 'cfq_slice_async'
>>>     But this symbol has been removed from upstream kernel
>>
>> According to [0] below, the 'cfq_slice_async' cannot be used for the HZ
>> calculation on 4.8 and later kernels.  I've not found a perfect alternate,
>> but how about using 'bfq_timeout' for 4.12 and later including RHEL8?
> 
> e.g. like this:
> 
> --- a/task.c
> +++ b/task.c
> @@ -417,7 +417,16 @@ task_init(void)
>  
>  	STRUCT_SIZE_INIT(cputime_t, "cputime_t");
>  
> -	if (symbol_exists("cfq_slice_async")) {
> +	if (symbol_exists("bfq_timeout")) {
> +		uint bfq_timeout;
> +		get_symbol_data("bfq_timeout", sizeof(int), &bfq_timeout);
> +		if (bfq_timeout) {
> +			machdep->hz = bfq_timeout * 8;
> +			if (CRASHDEBUG(2))
> +				fprintf(fp, "bfq_timeout exists: setting hz to %d\n",
> +					machdep->hz);
> +		}
> +	} else if (symbol_exists("cfq_slice_async")) {
>  		uint cfq_slice_async;
>  
>  		get_symbol_data("cfq_slice_async", sizeof(int),
> 
> 
> Lianbo, could you try this on ppc64le if it looks good?
> 
Sure. On my ppc64le machine, crash got 96hz after applying the above patch. The reason
is that kernel calculates the value of bfq_timeout as below:

bfq_timeout = HZ / 8;

The actual value of HZ is 100, so bfq_timeout = 100 / 8 = 12, but in crash, we calculate
the value of HZ:

HZ = bfq_timeout * 8 = 12 * 8 = 96

It seems that this is not the result what we expected.

> btw, I thought 'read_expire' was better than the 'bfq_timeout' because it
> was introduced at 2.6.16 and has been unchanged, but most of kernels(vmlinux)

Sounds good. But unfortunately, the 'read_expire' is a static variable in kernel, we
can not get it directly by the symbol search. Maybe we should try to find a static
variable(kernel) in another ways. 

If it is possible, I would tend to use the 'write_expire' to calculate the value of HZ
in crash as below, that can avoid the above issues and get a correct result.

HZ = write_expire / 5;

/*
 * source: block/mq-deadline.c
 */
static const int write_expire = 5 * HZ

For example:
+       if (symbol_exists("write_expire")) { ----> Here, it failed, maybe we can try to find the symbol in another way.
+               uint write_expire;
+               get_symbol_data("write_expire", sizeof(int), &write_expire);
+               if (write_expire) {
+                       machdep->hz = write_expire / 5;
+                       if (CRASHDEBUG(2))
+                               fprintf(fp, "write_expire exists: setting hz to %d\n",
+                                       machdep->hz);
+               }
+       }  else

> that I have do not have a symbol for it.  (some optimization?)
> 
I can get the values of 'read_expire' and 'write_expire' in the latest rhel8 or later.

crash> p read_expire
$1 = 50
crash> p write_expire
$2 = 500

Thanks.
Linabo

> static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
> 
>      RELEASE: 4.18.0-80.el8.x86_64
> 
> crash> p read_expire
> No symbol "read_expire" in current context.
> p: gdb request failed: p read_expire
> 
> Thanks,
> Kazu
> 
>>
>> const int bfq_timeout = HZ / 8;
>>
>>      RELEASE: 4.18.0-80.el8.x86_64
>>
>> crash> p bfq_timeout
>> bfq_timeout = $1 = 125
>>
>> This value has not been changed since its introduction (aee69d78dec0).
>> Recent kernels configured with CONFIG_IOSCHED_BFQ=y can be covered with this?
>>
>> [0] https://listman.redhat.com/archives/crash-utility/2021-April/msg00026.html
>>
>> Thanks,
>> Kazu
>>
>>
>>>
>>> [2] hardcode hz with the value 1000 (if kernel version > 2.6.0)
>>>
>>> [3] get the hz value from vmcore, but that relies on kernel config
>>>     such as CONFIG_IKCONFIG, etc.
>>>
>>> [4] Use sysconf(_SC_CLK_TCK) on some arches, not all arches.
>>>     See the micro definition of HZ in the defs.h
>>>
>>> There seems to be no perfect solution. Any ideas?
>>>
>>>
>>> Thanks.
>>> Lianbo
>>>
>>>> Thanks,
>>>> Kazu
>>>>
>>>>>
>>>>> I tested this on ppc64le and x86_64 and the hz value reported is 1000,
>>>>> whereas the kernel CONFIG_HZ_100 is set to Y. See some logs below:
>>>>>
>>>>> crash> help -m
>>>>>               flags: 124000f5
>>>>>
>>>
>> (KSYMS_START|MACHDEP_BT_TEXT|VM_4_LEVEL|VMEMMAP|VMEMMAP_AWARE|PHYS_ENTRY_L4|SWAP_ENTRY_L4|RADIX_MMU|OP
>>>>> AL_FW)
>>>>>              kvbase: c000000000000000
>>>>>   identity_map_base: c000000000000000
>>>>>            pagesize: 65536
>>>>>           pageshift: 16
>>>>>            pagemask: ffffffffffff0000
>>>>>          pageoffset: ffff
>>>>>           stacksize: 16384
>>>>>                  hz: 1000
>>>>>                 mhz: 2800
>>>>>
>>>>> [host@rhel7]$ grep CONFIG_HZ_100= redhat/configs/kernel-3.10.0-ppc64le.config
>>>>> CONFIG_HZ_100=y
>>>>>
>>>>> Fix the same by using the sysconf(_SC_CLK_TCK) value instead of the
>>>>> hardcoded HZ values depending on kernel versions.
>>>>>
>>>>
>>
>>
>> --
>> Crash-utility mailing list
>> Crash-utility@xxxxxxxxxx
>> https://listman.redhat.com/mailman/listinfo/crash-utility
> 

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/crash-utility