On 1/12/24 2:58 AM, Jens Axboe wrote: >On 1/11/24 6:12 PM, Xiaobing Li wrote: >> On 1/10/24 16:15 AM, Jens Axboe wrote: >>> On 1/10/24 2:05 AM, Xiaobing Li wrote: >>>> On 1/5/24 04:02 AM, Pavel Begunkov wrote: >>>>> On 1/3/24 05:49, Xiaobing Li wrote: >>>>>> On 12/30/23 9:27 AM, Pavel Begunkov wrote: >>>>>>> Why it uses jiffies instead of some task run time? >>>>>>> Consequently, why it's fine to account irq time and other >>>>>>> preemption? (hint, it's not) >>>>>>> >>>>>>> Why it can't be done with userspace and/or bpf? Why >>>>>>> can't it be estimated by checking and tracking >>>>>>> IORING_SQ_NEED_WAKEUP in userspace? >>>>>>> >>>>>>> What's the use case in particular? Considering that >>>>>>> one of the previous revisions was uapi-less, something >>>>>>> is really fishy here. Again, it's a procfs file nobody >>>>>>> but a few would want to parse to use the feature. >>>>>>> >>>>>>> Why it just keeps aggregating stats for the whole >>>>>>> life time of the ring? If the workload changes, >>>>>>> that would either totally screw the stats or would make >>>>>>> it too inert to be useful. That's especially relevant >>>>>>> for long running (days) processes. There should be a >>>>>>> way to reset it so it starts counting anew. >>>>>> >>>>>> Hi, Jens and Pavel, >>>>>> I carefully read the questions you raised. >>>>>> First of all, as to why I use jiffies to statistics time, it >>>>>> is because I have done some performance tests and found that >>>>>> using jiffies has a relatively smaller loss of performance >>>>>> than using task run time. Of course, using task run time is >>>>> >>>>> How does taking a measure for task runtime looks like? I expect it to >>>>> be a simple read of a variable inside task_struct, maybe with READ_ONCE, >>>>> in which case the overhead shouldn't be realistically measurable. Does >>>>> it need locking? >>>> >>>> The task runtime I am talking about is similar to this: >>>> start = get_system_time(current); >>>> do_io_part(); >>>> sq->total_time += get_system_time(current) - start; >>> >>> Not sure what get_system_time() is, don't see that anywhere. >>> >>>> Currently, it is not possible to obtain the execution time of a piece of >>>> code by a simple read of a variable inside task_struct. >>>> Or do you have any good ideas? >>> >>> I must be missing something, because it seems like all you need is to >>> read task->stime? You could possible even make do with just logging busy >>> loop time, as getrusage(RUSAGE_THREAD, &stat) from userspace would then >>> give you the total time. >>> >>> stat.ru_stime would then be the total time, the thread ran, and >>> 1 - (above_busy_stime / stat.ru_stime) would give you the time the >>> percentage of time the thread ran and did useful work (eg not busy >>> looping. >> >> getrusage can indeed get the total time of the thread, but this >> introduces an extra function call, which is relatively more >> complicated than defining a variable. In fact, recording the total >> time of the loop and the time of processing the IO part can achieve >> our observation purpose. Recording only two variables will have less >> impact on the existing performance, so why not choose a simpler and >> effective method. > >I'm not opposed to exposing both of them, it does make the API simpler. >If we can call it an API... I think the main point was using task->stime >for it rather than jiffies etc. Hi, Jens and Pavel I modified the code according to your opinions. I got the total time of the sqpoll thread through getrusage. eg: fdinfo.c: +long sq_total_time = 0; +long sq_work_time = 0; if (has_lock && (ctx->flags & IORING_SETUP_SQPOLL)) { struct io_sq_data *sq = ctx->sq_data; sq_pid = sq->task_pid; sq_cpu = sq->sq_cpu; + struct rusage r; + getrusage(sq->thread, RUSAGE_SELF, &r); + sq_total_time = r.ru_stime.tv_sec * 1000000 + r.ru_stime.tv_usec; + sq_work_time = sq->work_time; } seq_printf(m, "SqThread:\t%d\n", sq_pid); seq_printf(m, "SqThreadCpu:\t%d\n", sq_cpu); +seq_printf(m, "SqTotalTime:\t%ldus\n", sq_total_time); +seq_printf(m, "SqWorkTime:\t%ldus\n", sq_work_time); seq_printf(m, "UserFiles:\t%u\n", ctx->nr_user_files); The working time of the sqpoll thread is obtained through ktime_get(). eg: sqpoll.c: +ktime_t start, diff; +start = ktime_get(); list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) { int ret = __io_sq_thread(ctx, cap_entries); if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list))) sqt_spin = true; } if (io_run_task_work()) sqt_spin = true; +diff = ktime_sub(ktime_get(), start); +if (sqt_spin == true) + sqd->work_time += ktime_to_us(diff); The test results are as follows: Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq SqMask: 0x3 SqHead: 3197153 SqTail: 3197153 CachedSqHead: 3197153 SqThread: 9231 SqThreadCpu: 11 SqTotalTime: 92215321us SqWorkTime: 15106578us Do you think this solution work?