Re: FIO log averaging issue with --write_bw_log and --write_iops_log using > 1000 log_avg_msec values

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Chris,

Sorry for the slow reply. So this change is to deal with the case
where the time being added is two more seconds in size?

On 2 March 2017 at 21:52, Chris Taylor <ctaylor@xxxxxxxxxx> wrote:
> Hi,
> I am less concerned about any possible timing issues as they are probably miniscule and I can't think of the scenario where it would be a problem and cause less accurate results to really be concerned about.
> Tests pretty much prove we are getting close to what we expect:
> fio --eta=never --ioengine=libaio --readwrite=write --time_based=1 --runtime=60 --filename=/tmp/fio --size=1M --direct=1 --name=test --write_iops_log=/tmp/test --log_avg_msec=5500 --log_unix_epoch=1
> 1488489137632, 6413, 1, 0
> 1488489143131, 5503, 1, 0
> 1488489148631, 5585, 1, 0
> 1488489154131, 5883, 1, 0
> 1488489159631, 5726, 1, 0
> 1488489165131, 5331, 1, 0
> 1488489170631, 5772, 1, 0
> 1488489176131, 5661, 1, 0
> 1488489181631, 5970, 1, 0
> 1488489187131, 5776, 1, 0
> [5499, 5500, 5500, 5500, 5500, 5500, 5500, 5500, 5500]
>
> However, the problem still exists where in timeval_add_msec gets a value greater than 1000 msecs the tv->tv_usec it will grow beyond usecs (1000000) and causes unexpected behavior.  Possibly this is the better fix to avoid other callers having to adjust for  the overflow in timeval_add_msecs?\
> diff --git a/time.c b/time.c
> index f5dc049..a318ad5 100644
> --- a/time.c
> +++ b/time.c
> @@ -9,10 +9,9 @@ static unsigned long ns_granularity;
>  void timeval_add_msec(struct timeval *tv, unsigned int msec)
>  {
>         tv->tv_usec += 1000 * msec;
> -       if (tv->tv_usec >= 1000000) {
> -               tv->tv_usec -= 1000000;
> -               tv->tv_sec++;
> -       }
> +    tv->tv_sec += tv->tv_usec / 1000000;
> +    tv->tv_usec %= 1000000;
> +
>  }
>
>  /*
> Running the same job as above with the change:
> 1488489763099, 5773, 1, 0
> 1488489768599, 5852, 1, 0
> 1488489774099, 5654, 1, 0
> 1488489779599, 5482, 1, 0
> 1488489785099, 5650, 1, 0
> 1488489790599, 5752, 1, 0
> 1488489796099, 5390, 1, 0
> 1488489801599, 5746, 1, 0
> 1488489807099, 5610, 1, 0
> 1488489812599, 5361, 1, 0
> [5500, 5500, 5500, 5500, 5500, 5500, 5500, 5500, 5500]
>
> Thanks,
> Chris
>
> -----Original Message-----
> From: Sitsofe Wheeler [mailto:sitsofe@xxxxxxxxx]
> Sent: Tuesday, February 28, 2017 11:50 AM
> To: Chris Taylor <ctaylor@xxxxxxxxxx>
> Cc: fio@xxxxxxxxxxxxxxx
> Subject: Re: FIO log averaging issue with --write_bw_log and --write_iops_log using > 1000 log_avg_msec values
>
> Hi,
>
> Re timing issues: do you mean because of usec /= 1000; on line 428? As mtime_since() returns milliseconds it would mean all fractional millisecond amounts would be pushed downward so potentially things could be be under by a millisecond. If that's what you're thinking of I guess how bad that is depends on how much accuracy you need and how much jitter there is within the system already. Perhaps you can demonstrate a problem scenario so folks numerate than I will see the problem too...
>
> Re rounding problem - fair enough.
>
> On 28 February 2017 at 00:47, Chris Taylor <ctaylor@xxxxxxxxxx> wrote:
>> Hi Sitsofe,
>> Looks like this is handled in commit be6bb2b72608d7efbec13d06c67446e229136afa.
>> I am no longer able to reproduce the issue with latest version of FIO 2.17, with that change as usecs will now be 0.  Could this potentially cause some timing issues?
>>
>> Prior to that commit...
>> The job below would result in bw_logs and iops_log that have skewed
>> results both the add_iops_samples and add_bw_samples makes a call to
>> mtime_since with start time as the _sample_time
>> ./stat.c:2435:  spent = mtime_since(&td->bw_sample_time, t)
>> ./stat.c:2510:  spent = mtime_since(&td->iops_sample_time, t);
>>
>> ./stat.c:2473:  timeval_add_msec(&td->bw_sample_time,
>> td->o.bw_avg_time);
>> ./stat.c:2548:  timeval_add_msec(&td->iops_sample_time,
>> td->o.iops_avg_time);
>>
>> stat.c:2480:27: runtime error: unsigned integer overflow: 2000 - 18446744073708449 cannot be represented in type 'unsigned long'
>> stat.c:2555:29: runtime error: unsigned integer overflow: 2000 - 18446744073708449 cannot be represented in type 'unsigned long'
>>
>> The overflow at stat.c:2265 is a different issue possibly a rounding problem with msec and usecs.
>
> --
> Sitsofe | http://sucs.org/~sits/
>
> On 28 February 2017 at 00:47, Chris Taylor <ctaylor@xxxxxxxxxx> wrote:
>> Hi Sitsofe,
>> Looks like this is handled in commit be6bb2b72608d7efbec13d06c67446e229136afa.
>> I am no longer able to reproduce the issue with latest version of FIO 2.17, with that change as usecs will now be 0.  Could this potentially cause some timing issues?
>>
>> Prior to that commit...
>> The job below would result in bw_logs and iops_log that have skewed
>> results both the add_iops_samples and add_bw_samples makes a call to
>> mtime_since with start time as the _sample_time
>> ./stat.c:2435:  spent = mtime_since(&td->bw_sample_time, t)
>> ./stat.c:2510:  spent = mtime_since(&td->iops_sample_time, t);
>>
>> ./stat.c:2473:  timeval_add_msec(&td->bw_sample_time,
>> td->o.bw_avg_time);
>> ./stat.c:2548:  timeval_add_msec(&td->iops_sample_time,
>> td->o.iops_avg_time);
>>
>> stat.c:2480:27: runtime error: unsigned integer overflow: 2000 - 18446744073708449 cannot be represented in type 'unsigned long'
>> stat.c:2555:29: runtime error: unsigned integer overflow: 2000 - 18446744073708449 cannot be represented in type 'unsigned long'
>>
>> The overflow at stat.c:2265 is a different issue possibly a rounding problem with msec and usecs.
>>
>>
>> Thanks,
>> Chris
>>
>> On 23 February 2017 at 20:55, Chris Taylor <ctaylor@xxxxxxxxxx> wrote:
>>> There is an issue in FIO when using the detailed bandwidth and iops logging with averaging over a period of time > 1 second.
>>> It seems that usecs overflows which later causes negative time diff values resulting in skewed toward 0 results.
>>> I have attached a potential fix that should prevent usecs from going beyond 1000000.
>>
>> Is this problem shown by the following job:
>>
>> ./fio --ioengine=null --write_bw_log=/dev/null --log_avg_msec=2000
>> --size=100M --runtime=10s --time_based --name=overflow
>>
>> Among Clang's undefined behaviour sanitizer output was this for the above:
>> stat.c:2265:43: runtime error: unsigned integer overflow: 1999 - 2000 cannot be represented in type 'unsigned long'
>> stat.c:2265:28: runtime error: unsigned integer overflow: 3999 -
>> 4294967295 cannot be represented in type 'unsigned long'
>>
>>> diff --git a/time.c b/time.c
>>> index f5dc049..c748bee 100644
>>> --- a/time.c
>>> +++ b/time.c
>>> @@ -8,11 +8,18 @@ static unsigned long ns_granularity;
>>>
>>> void timeval_add_msec(struct timeval *tv, unsigned int msec) {
>>> -       tv->tv_usec += 1000 * msec;
>>> -       if (tv->tv_usec >= 1000000) {
>>> -               tv->tv_usec -= 1000000;
>>> -               tv->tv_sec++;
>>> -       }
>>> +       int adj_usec = 1000 * msec;
>>> +       int adj_sec = 0;
>>> +       tv->tv_usec +=  adj_usec;
>>> +       if (adj_usec >= 1000000) {
>>> +               adj_sec = adj_usec / 1000000;
>>> +               tv->tv_usec -=  adj_sec * 1000000;
>>> +               tv->tv_sec += adj_sec;
>>> +    }
>>> +    if (tv->tv_usec >= 1000000){
>>> +            tv->tv_usec -= 1000000;
>>> +            tv->tv_sec++;
>>> +    }
>>> }
>>
>> Is it still safe to use int if you're targeting a 32 bit platform?
>> --
>> Sitsofe | http://sucs.org/~sits/
>
>
>
> --
> Sitsofe | http://sucs.org/~sits/



-- 
Sitsofe | http://sucs.org/~sits/
--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel]     [Linux SCSI]     [Linux IDE]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux