Hi, Maintainer, Is there any feedback for this patch? -----Original Message----- From: Fan, Shupeng (Contractors) <Shupeng.Fan@xxxxxxxx> Sent: Monday, December 11, 2023 3:44 PM To: linux-btrace@xxxxxxxxxxxxxxx Cc: Su, Friendy <Friendy.Su@xxxxxxxx>; Fan, Shupeng (Contractors) <Shupeng.Fan@xxxxxxxx>; Aoyama, Wataru (SGC) <Wataru.Aoyama@xxxxxxxx> Subject: [PATCH v1] blktrace: output the earliest block trace log firstly Fix the problem of huge timestamps under pipe mode. root@target:~# blktrace -d /dev/mmcblk1 -o - | blkparse -i - -o blkparse.log root@target:~# grep 1266874889 blkparse.log 179,96 1 1 1266874889.702689491 3871 A WS 2080 + 1 <- (179,97) 32 179,96 1 2 1266874889.702693491 3871 Q WS 2080 + 1 [mkdir] 179,96 1 3 1266874889.702734116 3871 G WS 2080 + 1 [mkdir] 179,96 1 4 1266874889.702736616 3871 P N [mkdir] 179,96 1 5 1266874889.702742991 3871 A WS 3872 + 1 <- (179,97) 1824 179,96 1 6 1266874889.702743491 3871 Q WS 3872 + 1 [mkdir] ...... 179,96 1 206 1266874889.706598366 114 D WS 5728 + 64 [kworker/1:1H] 179,96 1 207 1266874889.706626741 15 C WS 2080 + 1 [0] 179,96 1 208 1266874889.707723741 15 C WS 3872 + 1 [0] blktrace creates each cpu specific thread to read trace log from kernel, then pipe out to blkparse. blkparse treats the smallest timestamp in the 1st batch of logs got from pipe as the genesis_time. All other logs's timestamp will minus genesis_time to show as relative time. However, the genesis_time obtained from the smallest timestamp in the 1st batch of logs may not the earliest. Assume there are 4 CPUs, blktrace launches 4 threads, each thread read one CPU buffer from kernel to a independent list. blktrace main thread checks each CPU's list in a loop, once there are available logs, pipe out. Due to the scheduling of 4 read threads and main thread, there are 2 risks that blkparse can not obtain the earliest timestamp as genesis_time. 1. the earliest log kernel filled CPU buffer may not earliest read by blktrace. E.g. 1st: kernel CPU#1 buffer is filled -> log_1 (log_1 is earliest) 2nd: kernel CPU#0 buffer is filled -> log_0 3rd: blktrace CPU#0 thread is scheduled, log_0 is read 4th: blktrace main thread is scheduled, log_0 is read and piped out 5th: blkparse read log_0 from pipe, treat: log_0->timestamp = genesis_time, then dump out log_0 to stdout 6th: blktrace CPU#1 thread is scheduled, log_1 is read 7th: blktrace main thread is scheduled, log_1 is read and piped out 8th: blkparse read log_1 from pipe, calculate: (relative timestamp = log_1->timestamp - genesis_time) (negative) dump out the negative timestamp 2. the earliest log blktrace read to list may not 1st piped out. E.g. 1st: blktrace main thread is scheduled, read from CPU#0 buffer firstly. 2nd: at this moment, CPU#2 thread read log_2 (earliest), CPU#1 thread read log_1. 3rd: blktrace main thread read CPU#1 buffer. log_1 is piped out. 4th: blkparse read log_1, treat log_1->timestamp = genensis, then dump out log_1. 5th: blktrace main thread read CPU#2 buffer. log_2 is piped out. 6th: blkparse read log_2, relative timestamp becomes negative. This patch makes when any blktrace read thread obtained the 1st valid block trace log, all other read thread will be read kernel CPU buffer at least once. This guarantees the trace logs in any CPU buffer be obtained, then sorted the earliest. blktrace main thread will pipe out the earliest trace log firstly. Signed-off-by: shupeng-fan <Shupeng.Fan@xxxxxxxx> Signed-off-by: friendy-su <friendy.su@xxxxxxxx> --- blktrace.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 121 insertions(+), 3 deletions(-) diff --git a/blktrace.c b/blktrace.c index 3444fbb..d530a3e 100644 --- a/blktrace.c +++ b/blktrace.c @@ -298,6 +298,17 @@ static LIST_HEAD(tracers); static volatile int done; +/* + * guarantee the earliest block trace log piped output first. + * blkparse treats the smallest timestamp in 1st batch of logs as genesis. + */ +static pthread_mutex_t earliest_blk_trace_mutex = PTHREAD_MUTEX_INITIALIZER; +static pthread_key_t self_done_read; +static volatile int done_read_cpu_num; +static volatile int earliest_blk_trace_cpu = -1; +static volatile unsigned long long earliest_blk_trace_time = -1ULL; +static int earliest_blk_trace_out; + /* * tracer threads add entries, the main thread takes them off and processes * them. These protect the dp_entries variable. @@ -1245,6 +1256,80 @@ static int wait_empty_entries(void) return !done; } +/* + * Screen the trace log. Once any thread read a valid block trace log from + * kernel, starts to count thread number of read() executed, record the + * earliest timestamp and cpu which got it. + */ +static void screen_earliest_blk_trace(struct trace_buf *tbp) +{ + int off, t_len, *is_done; + struct blk_io_trace *bit; + + /* + * read() got a valid log + */ + if (tbp->len > 0) { + is_done = (int *)pthread_getspecific(self_done_read); + /* + * if this thread has not read valid block trace log, check this log, + * otherwise do nothing. + */ + if (*is_done == 0) { + /* + * track the trace log buffer until the 1st block trace log + */ + off = 0; + while (off + (int)sizeof(*bit) <= tbp->len) { + bit = (struct blk_io_trace *)(tbp->buf + off); + t_len = sizeof(*bit) + bit->pdu_len; + if (off + t_len > tbp->len) + break; + if ((bit->action & BLK_TC_ACT(BLK_TC_NOTIFY)) == 0 || + (bit->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) { + + pthread_mutex_lock(&earliest_blk_trace_mutex); + /* + * it's a block trace log. if its timestamp is smaller than + * current earliest, update the earliest. + */ + if (bit->time < earliest_blk_trace_time) { + earliest_blk_trace_time = bit->time; + earliest_blk_trace_cpu = (int)bit->cpu; + done_read_cpu_num++; + *is_done = 1; + } + pthread_mutex_unlock(&earliest_blk_trace_mutex); + + /* + * in one buffer, first is always earliest, + * no need check more + */ + break; + } + off += t_len; + } + } + /* + * read() did not got log or error + */ + } else { + pthread_mutex_lock(&earliest_blk_trace_mutex); + /* + * if this thread is the 1st read() after block trace got by another + * thread, set self thread read done and count global done_read_cpu_num. + */ + if (done_read_cpu_num != 0) { + is_done = (int *)pthread_getspecific(self_done_read); + if (*is_done == 0) { + *is_done = 1; + done_read_cpu_num++; + } + } + pthread_mutex_unlock(&earliest_blk_trace_mutex); + } +} + static int add_devpath(char *path) { int fd; @@ -1450,10 +1535,34 @@ static void __process_trace_bufs(void) continue; } - list_replace_init(&hd->head, &list); - pthread_mutex_unlock(&hd->mutex); + /* + * earliest blk trace should be output firstly. blkparse will use + * the smallest timestamp in the 1st batch of trace logs as genesis. + */ + if (earliest_blk_trace_out == 0) { + /* + * if all cpu has done read() after a valid blk trace log got, + * earliest_blk_trace_cpu is exact cpu got earliest log, start + * pipe out log from this cpu. + */ + if (done_read_cpu_num == max_cpus && + earliest_blk_trace_cpu == cpu) { - handled += handle_list(hd, &list); + list_replace_init(&hd->head, &list); + pthread_mutex_unlock(&hd->mutex); + + handled += handle_list(hd, &list); + earliest_blk_trace_out = 1; + } else { + pthread_mutex_unlock(&hd->mutex); + continue; + } + } else { + list_replace_init(&hd->head, &list); + pthread_mutex_unlock(&hd->mutex); + + handled += handle_list(hd, &list); + } } } @@ -1805,6 +1914,12 @@ static int handle_pfds_entries(struct tracer *tp, int nevs, int force_read) for (i = 0; i < ndevs; i++, pfd++, iop++) { if (pfd->revents & POLLIN || force_read) { tbp->len = read(iop->ifd, tbp->buf, buf_size); + /* + * screen the earliest blk trace log + * update earliest cpu, timestamp + */ + screen_earliest_blk_trace(tbp); + if (tbp->len > 0) { pdc_dr_update(iop->dpp, tp->cpu, tbp->len); add_trace_buf(iop->dpp, tp->cpu, &tbp); @@ -1837,7 +1952,9 @@ static void *thread_main(void *arg) { int ret, ndone, to_val; struct tracer *tp = arg; + int is_read_done = 0; + pthread_setspecific(self_done_read, &is_read_done); ret = lock_on_cpu(tp->cpu); if (ret) goto err; @@ -1927,6 +2044,7 @@ static void start_tracers(void) struct list_head *p; size_t alloc_size = CPU_ALLOC_SIZE(max_cpus); + pthread_key_create(&self_done_read, NULL); for (cpu = 0; cpu < max_cpus; cpu++) { if (!CPU_ISSET_S(cpu, alloc_size, online_cpus)) { /* -- 2.17.1