Re: [PATCH] Fix machdep->HZ calculation for kernel versions > 2.6.0

lijiang <lijiang@xxxxxxxxxx> · Fri, 23 Apr 2021 15:22:53 +0800

在 2021年04月23日 14:41, lijiang 写道:
> 在 2021年04月22日 22:26, lijiang 写道:
>> 在 2021年04月22日 17:33, HAGIO KAZUHITO(萩尾　一仁) 写道:
>>> -----Original Message-----
>>>> -----Original Message-----
>>>>> 在 2021年01月12日 16:24, HAGIO KAZUHITO(萩尾　一仁) 写道:
>>>>>> Hi Bhupesh,
>>>>>>
>>>>>> -----Original Message-----
>>>>>>> We have hard-coded the HZ value for some ARCHs to either 1000 or 100
>>>>>>> (mainly for kernel versions > 2.6.0), which causes 'help -m' to show
>>>>>>> an incorrect hz value for various architectures.
>>>>>>
>>>>>> Good catch.  but seems crash uses (cfq_slice_async * 25) for machdep->hz
>>>>>> if it exists (please see task_init()), RHEL7 has it, but RHEL8 does not.
>>>>>> What do you see on RHEL8 for x86_64 with your patch?
>>>>>>
>>>>>
>>>>> The symbol 'cfq_slice_async' has been removed from upstream kernel:
>>>>> f382fb0bcef4 ("block: remove legacy IO schedulers")
>>>>>
>>>>> And RHEL8 also removed it.
>>>>>
>>>>>> We should search for an alternate way like the current one first.
>>>>>>
>>>>>
>>>>> Currently, there are several ways to get the value of HZ as below:
>>>>>
>>>>> [1] calculate hz via the symbol 'cfq_slice_async'
>>>>>     But this symbol has been removed from upstream kernel
>>>>
>>>> According to [0] below, the 'cfq_slice_async' cannot be used for the HZ
>>>> calculation on 4.8 and later kernels.  I've not found a perfect alternate,
>>>> but how about using 'bfq_timeout' for 4.12 and later including RHEL8?
>>>
>>> e.g. like this:
>>>
>>> --- a/task.c
>>> +++ b/task.c
>>> @@ -417,7 +417,16 @@ task_init(void)
>>>  
>>>  	STRUCT_SIZE_INIT(cputime_t, "cputime_t");
>>>  
>>> -	if (symbol_exists("cfq_slice_async")) {
>>> +	if (symbol_exists("bfq_timeout")) {
>>> +		uint bfq_timeout;
>>> +		get_symbol_data("bfq_timeout", sizeof(int), &bfq_timeout);
>>> +		if (bfq_timeout) {
>>> +			machdep->hz = bfq_timeout * 8;
>>> +			if (CRASHDEBUG(2))
>>> +				fprintf(fp, "bfq_timeout exists: setting hz to %d\n",
>>> +					machdep->hz);
>>> +		}
>>> +	} else if (symbol_exists("cfq_slice_async")) {
>>>  		uint cfq_slice_async;
>>>  
>>>  		get_symbol_data("cfq_slice_async", sizeof(int),
>>>
>>>
>>> Lianbo, could you try this on ppc64le if it looks good?
>>>
>> Sure. On my ppc64le machine, crash got 96hz after applying the above patch. The reason
>> is that kernel calculates the value of bfq_timeout as below:
>>
>> bfq_timeout = HZ / 8;
>>
>> The actual value of HZ is 100, so bfq_timeout = 100 / 8 = 12, but in crash, we calculate
>> the value of HZ:
>>
>> HZ = bfq_timeout * 8 = 12 * 8 = 96
>>
>> It seems that this is not the result what we expected.
>>
>>> btw, I thought 'read_expire' was better than the 'bfq_timeout' because it
>>> was introduced at 2.6.16 and has been unchanged, but most of kernels(vmlinux)
>>
>> Sounds good. But unfortunately, the 'read_expire' is a static variable in kernel, we
>> can not get it directly by the symbol search. Maybe we should try to find a static
>> variable(kernel) in another ways. 
>>
>> If it is possible, I would tend to use the 'write_expire' to calculate the value of HZ
>> in crash as below, that can avoid the above issues and get a correct result.
>>
>> HZ = write_expire / 5;
>>
>> /*
>>  * source: block/mq-deadline.c
>>  */
>> static const int write_expire = 5 * HZ
>>
>> For example:
>> +       if (symbol_exists("write_expire")) { ----> Here, it failed, maybe we can try to find the symbol in another way.
>> +               uint write_expire;
>> +               get_symbol_data("write_expire", sizeof(int), &write_expire);
>> +               if (write_expire) {
>> +                       machdep->hz = write_expire / 5;
>> +                       if (CRASHDEBUG(2))
>> +                               fprintf(fp, "write_expire exists: setting hz to %d\n",
>> +                                       machdep->hz);
>> +               }
>> +       }  else
>>
>>> that I have do not have a symbol for it.  (some optimization?)
>>>
>> I can get the values of 'read_expire' and 'write_expire' in the latest rhel8 or later.
>>
>> crash> p read_expire
>> $1 = 50
>> crash> p write_expire
>> $2 = 500
>>
>> Thanks.
>> Linabo
>>
> 
> How do you think about the following changes? It works for me.
> 
> /*
>  * source: net/ipv4/inetpeer.c
>  * int inet_peer_minttl __read_mostly = 120 * HZ;  /* TTL under high load: 120 sec */
>  */
> 
> diff --git a/task.c b/task.c
> index 423cd45..4af3ef3 100644
> --- a/task.c
> +++ b/task.c
> @@ -417,7 +417,17 @@ task_init(void)
>  
>         STRUCT_SIZE_INIT(cputime_t, "cputime_t");
>  
> -       if (symbol_exists("cfq_slice_async")) {
> +       if (symbol_exists("inet_peer_minttl")) {
> +               uint inet_peer_minttl;
> +               get_symbol_data("inet_peer_minttl", sizeof(int), &inet_peer_minttl);
> +               if (inet_peer_minttl) {
> +                       machdep->hz = inet_peer_minttl / 120;
> +                       if (CRASHDEBUG(2))
> +                               fprintf(fp, "inet_peer_minttl exists: setting hz to %d\n",
> +                                       machdep->hz);
> +               }
> +       }  else if (symbol_exists("cfq_slice_async")) {
>                 uint cfq_slice_async;
> 

And, I would tend to replace the 'cfq_slice_async' with the 'inet_peer_minttl' as below,
the reason is that it has hardly changed so far(v2.6.12-rc2), and the variable is in the
net/ipv4/inetpeer.c module, which is supported by most kernel configuration. What's your
opinion?

diff --git a/task.c b/task.c
index 423cd454502b..5994fe2b7351 100644
--- a/task.c
+++ b/task.c
@@ -417,18 +417,18 @@ task_init(void)

        STRUCT_SIZE_INIT(cputime_t, "cputime_t");

-       if (symbol_exists("cfq_slice_async")) {
-               uint cfq_slice_async;
+       if (symbol_exists("inet_peer_minttl")) {
+               int inet_peer_minttl;

-               get_symbol_data("cfq_slice_async", sizeof(int), 
-                       &cfq_slice_async);
+               get_symbol_data("inet_peer_minttl", sizeof(int),
+                       &inet_peer_minttl);

-               if (cfq_slice_async) {
-                       machdep->hz = cfq_slice_async * 25; 
+               if (inet_peer_minttl) {
+                       machdep->hz = inet_peer_minttl / 120;

                        if (CRASHDEBUG(2))
                                fprintf(fp,
-                                   "cfq_slice_async exists: setting hz to %d\n", 
+                                   "inet_peer_minttl exists: setting hz to %d\n",
                                        machdep->hz);
                }
        }
-- 

Thanks.

> Thanks.
> Lianbo
> 
>>> static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
>>>
>>>      RELEASE: 4.18.0-80.el8.x86_64
>>>
>>> crash> p read_expire
>>> No symbol "read_expire" in current context.
>>> p: gdb request failed: p read_expire
>>>
>>> Thanks,
>>> Kazu
>>>
>>>>
>>>> const int bfq_timeout = HZ / 8;
>>>>
>>>>      RELEASE: 4.18.0-80.el8.x86_64
>>>>
>>>> crash> p bfq_timeout
>>>> bfq_timeout = $1 = 125
>>>>
>>>> This value has not been changed since its introduction (aee69d78dec0).
>>>> Recent kernels configured with CONFIG_IOSCHED_BFQ=y can be covered with this?
>>>>
>>>> [0] https://listman.redhat.com/archives/crash-utility/2021-April/msg00026.html
>>>>
>>>> Thanks,
>>>> Kazu
>>>>
>>>>
>>>>>
>>>>> [2] hardcode hz with the value 1000 (if kernel version > 2.6.0)
>>>>>
>>>>> [3] get the hz value from vmcore, but that relies on kernel config
>>>>>     such as CONFIG_IKCONFIG, etc.
>>>>>
>>>>> [4] Use sysconf(_SC_CLK_TCK) on some arches, not all arches.
>>>>>     See the micro definition of HZ in the defs.h
>>>>>
>>>>> There seems to be no perfect solution. Any ideas?
>>>>>
>>>>>
>>>>> Thanks.
>>>>> Lianbo
>>>>>
>>>>>> Thanks,
>>>>>> Kazu
>>>>>>
>>>>>>>
>>>>>>> I tested this on ppc64le and x86_64 and the hz value reported is 1000,
>>>>>>> whereas the kernel CONFIG_HZ_100 is set to Y. See some logs below:
>>>>>>>
>>>>>>> crash> help -m
>>>>>>>               flags: 124000f5
>>>>>>>
>>>>>
>>>> (KSYMS_START|MACHDEP_BT_TEXT|VM_4_LEVEL|VMEMMAP|VMEMMAP_AWARE|PHYS_ENTRY_L4|SWAP_ENTRY_L4|RADIX_MMU|OP
>>>>>>> AL_FW)
>>>>>>>              kvbase: c000000000000000
>>>>>>>   identity_map_base: c000000000000000
>>>>>>>            pagesize: 65536
>>>>>>>           pageshift: 16
>>>>>>>            pagemask: ffffffffffff0000
>>>>>>>          pageoffset: ffff
>>>>>>>           stacksize: 16384
>>>>>>>                  hz: 1000
>>>>>>>                 mhz: 2800
>>>>>>>
>>>>>>> [host@rhel7]$ grep CONFIG_HZ_100= redhat/configs/kernel-3.10.0-ppc64le.config
>>>>>>> CONFIG_HZ_100=y
>>>>>>>
>>>>>>> Fix the same by using the sysconf(_SC_CLK_TCK) value instead of the
>>>>>>> hardcoded HZ values depending on kernel versions.
>>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Crash-utility mailing list
>>>> Crash-utility@xxxxxxxxxx
>>>> https://listman.redhat.com/mailman/listinfo/crash-utility
>>>

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/crash-utility