Re: Slow requests: waiting_for_osdmap

Wido den Hollander <wido@xxxxxxxx> · Mon, 22 Dec 2014 20:12:43 +0100

On 12/22/2014 07:42 PM, Gregory Farnum wrote:
> On Mon, Dec 22, 2014 at 10:30 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
> 
>> For example, two ops:
>>
>> #1:
>>
>> { "description": "osd_sub_op(client.2433432.0:61603164 20.424
>> 19038c24\/rbd_data.d7c912ae8944a.00000000000008b6\/head\/\/20 [] v
>> 63283'8301089 snapset=0=[]:[] snapc=0=[])",
>>           "received_at": "2014-12-22 19:26:37.458680",
>>           "age": "2.719850",
>>           "duration": "2.520937",
>>           "type_data": [
>>                 "commit sent; apply or cleanup",
>>                 [
>>                     { "time": "2014-12-22 19:26:37.458914",
>>                       "event": "waiting_for_osdmap"},
>>                     { "time": "2014-12-22 19:26:39.310569",
>>                       "event": "reached_pg"},
>>                     { "time": "2014-12-22 19:26:39.310728",
>>                       "event": "started"},
>>                     { "time": "2014-12-22 19:26:39.310951",
>>                       "event": "started"},
>>                     { "time": "2014-12-22 19:26:39.979292",
>>                       "event": "commit_queued_for_journal_write"},
>>                     { "time": "2014-12-22 19:26:39.979348",
>>                       "event": "write_thread_in_journal_buffer"},
>>                     { "time": "2014-12-22 19:26:39.979594",
>>                       "event": "journaled_completion_queued"},
>>                     { "time": "2014-12-22 19:26:39.979617",
>>                       "event": "commit_sent"}]]},
>>
>> #2:
>>
>> { "description": "osd_sub_op(client.2188703.0:10420738 20.641
>> 6673ee41\/rbd_data.9497e32794ff7.0000000000000454\/head\/\/20 [] v
>> 63283'5215076 snapset=0=[]:[] snapc=0=[])",
>>           "received_at": "2014-12-22 19:26:38.040551",
>>           "age": "2.137979",
>>           "duration": "1.537128",
>>           "type_data": [
>>                 "started",
>>                 [
>>                     { "time": "2014-12-22 19:26:38.040717",
>>                       "event": "waiting_for_osdmap"},
>>                     { "time": "2014-12-22 19:26:39.577609",
>>                       "event": "reached_pg"},
>>                     { "time": "2014-12-22 19:26:39.577624",
>>                       "event": "started"},
>>                     { "time": "2014-12-22 19:26:39.577679",
>>                       "event": "started"}]]},
> 
> Oh, yep, in Firefly it's stuck in the waiting_for_osdmap state while
> it's in the PG work queue as well. Whoops...
> So this is probably just general slowness filling up the work queue.
> 

Ah, ok. Clear.

>> Can this be something which has to do with the amount of RBD snapshots?
>> Since I see snapc involved in both ops?
> 
> It could conceivably have something to do with snapshots, but if it
> does the presence of "snapc" isn't an indicator; that's always present
> and is outputting the default. :)
> 
> If you're seeing disks at 100% I think stuff's just getting a little
> backed up. You could also check the distribution of incoming
> operations across PGs; if e.g. a flood of ops are going to one object
> that could also cause issues.

Yes, I'll do that. It's only weird that ops sometimes get stuck for over
30 seconds. The disks seem to randomly spike to 100%.

It was indeed a indicator that the disks are overloaded, but I just
wanted to verify it.

> -Greg
> 

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com