RE: fio server/client disconnect bug

Jeff Furlong <jeff.furlong@xxxxxxx> · Wed, 21 Mar 2018 16:35:21 +0000

No client ends in error.  In fact, I get a full set of iops/lat/bw logs from both clients.  And inspection of those logs looks good; I see the last 1000ms timestamp is valid for the duration of runtime.  But only one client prints the summary info to the output file.

Regards,
Jeff

-----Original Message-----
From: Jens Axboe [mailto:axboe@xxxxxxxxx] 
Sent: Wednesday, March 21, 2018 9:28 AM
To: Jeff Furlong <jeff.furlong@xxxxxxx>; fio@xxxxxxxxxxxxxxx
Subject: Re: fio server/client disconnect bug

On 3/21/18 10:19 AM, Jens Axboe wrote:
> On 3/21/18 9:19 AM, Jens Axboe wrote:
>> On 3/20/18 6:16 PM, Jeff Furlong wrote:
>>> Revisiting this issue.  It seems the call stack is:
>>>
>>> fio_handle_clients()
>>>     fio_handle_client()
>>>         case FIO_NET_CMD_TS:
>>>             ops->thread_status(client, cmd);
>>>             .thread_status    = handle_ts
>>>                 static void handle_ts(struct fio_client *client, struct fio_net_cmd *cmd)
>>>                 {
>>>                     struct cmd_ts_pdu *p = (struct cmd_ts_pdu *) cmd->payload;
>>>                     struct flist_head *opt_list = NULL;
>>>                     struct json_object *tsobj;
>>>
>>>                     if (client->opt_lists && p->ts.thread_number <= client->jobs)
>>>                         opt_list = 
>>> &client->opt_lists[p->ts.thread_number - 1];
>>>
>>>                     tsobj = show_thread_status(&p->ts, &p->rs, opt_list, NULL);
>>>                     client->did_stat = true;
>>>                     if (tsobj) {
>>>                         json_object_add_client_info(tsobj, client);
>>>                         json_array_add_value_object(clients_array, tsobj);
>>>                     }
>>>
>>>                     if (sum_stat_clients <= 1)
>>>                         return;
>>>
>>>                     sum_thread_stats(&client_ts, &p->ts, sum_stat_nr == 1);
>>>                     sum_group_stats(&client_gs, &p->rs);
>>>
>>>                     client_ts.members++;
>>>                     client_ts.thread_number = p->ts.thread_number;
>>>                     client_ts.groupid = p->ts.groupid;
>>>                     client_ts.unified_rw_rep = p->ts.unified_rw_rep;
>>>                     client_ts.sig_figs = p->ts.sig_figs;
>>>
>>>                     if (++sum_stat_nr == sum_stat_clients) {
>>>                         strcpy(client_ts.name, "All clients");
>>>                         tsobj = show_thread_status(&client_ts, &client_gs, NULL, NULL);
>>>                         if (tsobj) {
>>>                             json_object_add_client_info(tsobj, client);
>>>                             json_array_add_value_object(clients_array, tsobj);
>>>                         }
>>>                     }
>>>                 }
>>>
>>> And when sum_stat_clients <= 1, we never print "All clients" summary.
>>> Actually, we miss an entire client, so neither the individual client 
>>> summary is output nor the "all clients" summary is output.  It seems 
>>> one client finishes just slightly before the other but we remove 
>>> from the list of clients too quickly.  I tried adjusting the timeout 
>>> and such, but didn't completely remove the issue.  Any specific thoughts?
>>
>> sum_stat_clients is set when we start everything up, so that should 
>> always be '2' for your case. So I'm a little puzzled as to what is 
>> going on here. Do any of the jobs ever end in error, and that's why 
>> we are missing a report from one of the jobs? Or are you referring to 
>> timing on receiving the stats output, somehow racing with each other 
>> and we're missing one of them? The latter could result in displaying 
>> just one output, and never getting ++sum_stat_nr == 2 and displaying 
>> the "All clients" output.
> 
> Does the below patch change anything for you? I forgot that we get 
> multiple starts (one from each client, of course), which means that we 
> really should protect the inc from there.

I don't think that's it, we serially handle the clients, so there should be no room for a race there. Hmm, it's basically back to my theory where we put a client that hasn't done stats yet. That way we can miss doing the all clients display, since that condition will never be met. But I don't see how that could happen, since I'm assuming that both of your hosts always run to completion without error?

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html