No client ends in error. In fact, I get a full set of iops/lat/bw logs from both clients. And inspection of those logs looks good; I see the last 1000ms timestamp is valid for the duration of runtime. But only one client prints the summary info to the output file. Regards, Jeff -----Original Message----- From: Jens Axboe [mailto:axboe@xxxxxxxxx] Sent: Wednesday, March 21, 2018 9:28 AM To: Jeff Furlong <jeff.furlong@xxxxxxx>; fio@xxxxxxxxxxxxxxx Subject: Re: fio server/client disconnect bug On 3/21/18 10:19 AM, Jens Axboe wrote: > On 3/21/18 9:19 AM, Jens Axboe wrote: >> On 3/20/18 6:16 PM, Jeff Furlong wrote: >>> Revisiting this issue. It seems the call stack is: >>> >>> fio_handle_clients() >>> fio_handle_client() >>> case FIO_NET_CMD_TS: >>> ops->thread_status(client, cmd); >>> .thread_status = handle_ts >>> static void handle_ts(struct fio_client *client, struct fio_net_cmd *cmd) >>> { >>> struct cmd_ts_pdu *p = (struct cmd_ts_pdu *) cmd->payload; >>> struct flist_head *opt_list = NULL; >>> struct json_object *tsobj; >>> >>> if (client->opt_lists && p->ts.thread_number <= client->jobs) >>> opt_list = >>> &client->opt_lists[p->ts.thread_number - 1]; >>> >>> tsobj = show_thread_status(&p->ts, &p->rs, opt_list, NULL); >>> client->did_stat = true; >>> if (tsobj) { >>> json_object_add_client_info(tsobj, client); >>> json_array_add_value_object(clients_array, tsobj); >>> } >>> >>> if (sum_stat_clients <= 1) >>> return; >>> >>> sum_thread_stats(&client_ts, &p->ts, sum_stat_nr == 1); >>> sum_group_stats(&client_gs, &p->rs); >>> >>> client_ts.members++; >>> client_ts.thread_number = p->ts.thread_number; >>> client_ts.groupid = p->ts.groupid; >>> client_ts.unified_rw_rep = p->ts.unified_rw_rep; >>> client_ts.sig_figs = p->ts.sig_figs; >>> >>> if (++sum_stat_nr == sum_stat_clients) { >>> strcpy(client_ts.name, "All clients"); >>> tsobj = show_thread_status(&client_ts, &client_gs, NULL, NULL); >>> if (tsobj) { >>> json_object_add_client_info(tsobj, client); >>> json_array_add_value_object(clients_array, tsobj); >>> } >>> } >>> } >>> >>> And when sum_stat_clients <= 1, we never print "All clients" summary. >>> Actually, we miss an entire client, so neither the individual client >>> summary is output nor the "all clients" summary is output. It seems >>> one client finishes just slightly before the other but we remove >>> from the list of clients too quickly. I tried adjusting the timeout >>> and such, but didn't completely remove the issue. Any specific thoughts? >> >> sum_stat_clients is set when we start everything up, so that should >> always be '2' for your case. So I'm a little puzzled as to what is >> going on here. Do any of the jobs ever end in error, and that's why >> we are missing a report from one of the jobs? Or are you referring to >> timing on receiving the stats output, somehow racing with each other >> and we're missing one of them? The latter could result in displaying >> just one output, and never getting ++sum_stat_nr == 2 and displaying >> the "All clients" output. > > Does the below patch change anything for you? I forgot that we get > multiple starts (one from each client, of course), which means that we > really should protect the inc from there. I don't think that's it, we serially handle the clients, so there should be no room for a race there. Hmm, it's basically back to my theory where we put a client that hasn't done stats yet. That way we can miss doing the all clients display, since that condition will never be met. But I don't see how that could happen, since I'm assuming that both of your hosts always run to completion without error? -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html