Revisiting this issue. It seems the call stack is: fio_handle_clients() fio_handle_client() case FIO_NET_CMD_TS: ops->thread_status(client, cmd); .thread_status = handle_ts static void handle_ts(struct fio_client *client, struct fio_net_cmd *cmd) { struct cmd_ts_pdu *p = (struct cmd_ts_pdu *) cmd->payload; struct flist_head *opt_list = NULL; struct json_object *tsobj; if (client->opt_lists && p->ts.thread_number <= client->jobs) opt_list = &client->opt_lists[p->ts.thread_number - 1]; tsobj = show_thread_status(&p->ts, &p->rs, opt_list, NULL); client->did_stat = true; if (tsobj) { json_object_add_client_info(tsobj, client); json_array_add_value_object(clients_array, tsobj); } if (sum_stat_clients <= 1) return; sum_thread_stats(&client_ts, &p->ts, sum_stat_nr == 1); sum_group_stats(&client_gs, &p->rs); client_ts.members++; client_ts.thread_number = p->ts.thread_number; client_ts.groupid = p->ts.groupid; client_ts.unified_rw_rep = p->ts.unified_rw_rep; client_ts.sig_figs = p->ts.sig_figs; if (++sum_stat_nr == sum_stat_clients) { strcpy(client_ts.name, "All clients"); tsobj = show_thread_status(&client_ts, &client_gs, NULL, NULL); if (tsobj) { json_object_add_client_info(tsobj, client); json_array_add_value_object(clients_array, tsobj); } } } And when sum_stat_clients <= 1, we never print "All clients" summary. Actually, we miss an entire client, so neither the individual client summary is output nor the "all clients" summary is output. It seems one client finishes just slightly before the other but we remove from the list of clients too quickly. I tried adjusting the timeout and such, but didn't completely remove the issue. Any specific thoughts? I cut down the job files to the smallest I could find to reliably reproduce the issue. It seems we need to log a few items to reproduce, but the job runtime itself can be quite small. # cat test_job_a [test_job_a] description=test_job_a ioengine=libaio direct=1 rw=randread iodepth=8 bs=64k filename=/dev/nvme0n1 group_reporting write_bw_log=test_job_a write_iops_log=test_job_a write_lat_log=test_job_a log_avg_msec=1000 unified_rw_reporting=1 disable_lat=0 disable_clat=0 disable_slat=0 runtime=5s time_based # cat test_job_b [test_job_b] description=test_job_b ioengine=libaio direct=1 rw=randread iodepth=8 bs=64k filename=/dev/nvme0n1 group_reporting write_bw_log=test_job_b write_iops_log=test_job_b write_lat_log=test_job_b log_avg_msec=1000 unified_rw_reporting=1 disable_lat=0 disable_clat=0 disable_slat=0 runtime=5s time_based # fio --client=host1 test_job_a --client=host2 test_job_b --output=test_job Regards, Jeff -----Original Message----- From: fio-owner@xxxxxxxxxxxxxxx [mailto:fio-owner@xxxxxxxxxxxxxxx] On Behalf Of Jeff Furlong Sent: Thursday, February 15, 2018 4:33 PM To: fio@xxxxxxxxxxxxxxx Subject: fio server/client disconnect bug Hi All, With the near latest fio version (fio-3.3-51-gf2cd) using server/client mode, I occasionally see one client get disconnected early. Hence, the client's IO summary output does not get reported at the end of the job. The issue seems to occur between 100 and 300 iterations of a job file. Worse, the conditions for the early disconnect may be tied to the complexity of the job file (such as numjobs or IOPs of device). Even worse, when using debug mode, the failure seems to reduce to around 1000 iterations of a job file. When running with fio --client=host1 test_job_a --client=host2 test_job_b --debug=process,net ... net 5027 client: handle host2 net 5027 client: got cmd op IOLOG from host2 (pdu=446) net 5027 client: handle host2 client: host=host2 disconnected net 5027 client: removed <host2> net 5027 client: request eta (1) net 5027 client: requested eta tag 0x1b52c20 net 5027 client: handle host1 net 5027 client: got cmd op TEXT from host1 (pdu=85) <host1> net 5028 server: got op [SEND_ETA], pdu=0, tag=1b4a970 net 5027 client: handle host1 net 5027 client: got cmd op TEXT from host1 (pdu=61) <host1> net 5028 server sending status ... Then the normal output summary only shows the statistics from host1. The host2 throughput, latency, etc. are never displayed. However, the iops, bw, and lat logs all seem to have been updated properly. Sometimes host2 is disconnected early; sometimes host1 is disconnected early. Why might host2 be disconnected? I see disconnects when using a switch with one hop and also with directly connecting host1 to host2. So dropping network packets seems unlikely. Could the ETA update not be accurate? Or is it possible host2 finishes the job faster than host1 and closes the connection too early? In the event host2's job file finishes early, should we still summarize the IO traffic in the output? How is that condition handled? If you have suggestions on other debug options, I would appreciate. Thanks. Regards, Jeff -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html