Re: fio server/client disconnect bug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/20/18 6:16 PM, Jeff Furlong wrote:
> Revisiting this issue.  It seems the call stack is:
> 
> fio_handle_clients()
>     fio_handle_client()
>         case FIO_NET_CMD_TS:
>             ops->thread_status(client, cmd);
>             .thread_status    = handle_ts
>                 static void handle_ts(struct fio_client *client, struct fio_net_cmd *cmd)
>                 {
>                     struct cmd_ts_pdu *p = (struct cmd_ts_pdu *) cmd->payload;
>                     struct flist_head *opt_list = NULL;
>                     struct json_object *tsobj;
> 
>                     if (client->opt_lists && p->ts.thread_number <= client->jobs)
>                         opt_list = &client->opt_lists[p->ts.thread_number - 1];
> 
>                     tsobj = show_thread_status(&p->ts, &p->rs, opt_list, NULL);
>                     client->did_stat = true;
>                     if (tsobj) {
>                         json_object_add_client_info(tsobj, client);
>                         json_array_add_value_object(clients_array, tsobj);
>                     }
> 
>                     if (sum_stat_clients <= 1)
>                         return;
> 
>                     sum_thread_stats(&client_ts, &p->ts, sum_stat_nr == 1);
>                     sum_group_stats(&client_gs, &p->rs);
> 
>                     client_ts.members++;
>                     client_ts.thread_number = p->ts.thread_number;
>                     client_ts.groupid = p->ts.groupid;
>                     client_ts.unified_rw_rep = p->ts.unified_rw_rep;
>                     client_ts.sig_figs = p->ts.sig_figs;
> 
>                     if (++sum_stat_nr == sum_stat_clients) {
>                         strcpy(client_ts.name, "All clients");
>                         tsobj = show_thread_status(&client_ts, &client_gs, NULL, NULL);
>                         if (tsobj) {
>                             json_object_add_client_info(tsobj, client);
>                             json_array_add_value_object(clients_array, tsobj);
>                         }
>                     }
>                 }
> 
> And when sum_stat_clients <= 1, we never print "All clients" summary.
> Actually, we miss an entire client, so neither the individual client
> summary is output nor the "all clients" summary is output.  It seems
> one client finishes just slightly before the other but we remove from
> the list of clients too quickly.  I tried adjusting the timeout and
> such, but didn't completely remove the issue.  Any specific thoughts?

sum_stat_clients is set when we start everything up, so that should
always be '2' for your case. So I'm a little puzzled as to what is going
on here. Do any of the jobs ever end in error, and that's why we are
missing a report from one of the jobs? Or are you referring to timing on
receiving the stats output, somehow racing with each other and we're
missing one of them? The latter could result in displaying just one
output, and never getting ++sum_stat_nr == 2 and displaying the "All
clients" output.

> I cut down the job files to the smallest I could find to reliably
> reproduce the issue.  It seems we need to log a few items to
> reproduce, but the job runtime itself can be quite small.
> 
> 
> # cat test_job_a
> [test_job_a]
> description=test_job_a
> ioengine=libaio
> direct=1
> rw=randread
> iodepth=8
> bs=64k
> filename=/dev/nvme0n1
> group_reporting
> write_bw_log=test_job_a
> write_iops_log=test_job_a
> write_lat_log=test_job_a
> log_avg_msec=1000
> unified_rw_reporting=1
> disable_lat=0
> disable_clat=0
> disable_slat=0
> runtime=5s
> time_based
> 
> 
> # cat test_job_b
> [test_job_b]
> description=test_job_b
> ioengine=libaio
> direct=1
> rw=randread
> iodepth=8
> bs=64k
> filename=/dev/nvme0n1
> group_reporting
> write_bw_log=test_job_b
> write_iops_log=test_job_b
> write_lat_log=test_job_b
> log_avg_msec=1000
> unified_rw_reporting=1
> disable_lat=0
> disable_clat=0
> disable_slat=0
> runtime=5s
> time_based
> 
> 
> # fio --client=host1 test_job_a --client=host2 test_job_b --output=test_job

I've run 2x100 iterations of this, and I haven't been able to reproduce
the issue so far. I'll try and dig some more, but any extra info or
debug output you may have would be very useful.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel]     [Linux SCSI]     [Linux IDE]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux