On 12/22/2014 07:42 PM, Gregory Farnum wrote: > On Mon, Dec 22, 2014 at 10:30 AM, Wido den Hollander <wido@xxxxxxxx> wrote: > >> For example, two ops: >> >> #1: >> >> { "description": "osd_sub_op(client.2433432.0:61603164 20.424 >> 19038c24\/rbd_data.d7c912ae8944a.00000000000008b6\/head\/\/20 [] v >> 63283'8301089 snapset=0=[]:[] snapc=0=[])", >> "received_at": "2014-12-22 19:26:37.458680", >> "age": "2.719850", >> "duration": "2.520937", >> "type_data": [ >> "commit sent; apply or cleanup", >> [ >> { "time": "2014-12-22 19:26:37.458914", >> "event": "waiting_for_osdmap"}, >> { "time": "2014-12-22 19:26:39.310569", >> "event": "reached_pg"}, >> { "time": "2014-12-22 19:26:39.310728", >> "event": "started"}, >> { "time": "2014-12-22 19:26:39.310951", >> "event": "started"}, >> { "time": "2014-12-22 19:26:39.979292", >> "event": "commit_queued_for_journal_write"}, >> { "time": "2014-12-22 19:26:39.979348", >> "event": "write_thread_in_journal_buffer"}, >> { "time": "2014-12-22 19:26:39.979594", >> "event": "journaled_completion_queued"}, >> { "time": "2014-12-22 19:26:39.979617", >> "event": "commit_sent"}]]}, >> >> #2: >> >> { "description": "osd_sub_op(client.2188703.0:10420738 20.641 >> 6673ee41\/rbd_data.9497e32794ff7.0000000000000454\/head\/\/20 [] v >> 63283'5215076 snapset=0=[]:[] snapc=0=[])", >> "received_at": "2014-12-22 19:26:38.040551", >> "age": "2.137979", >> "duration": "1.537128", >> "type_data": [ >> "started", >> [ >> { "time": "2014-12-22 19:26:38.040717", >> "event": "waiting_for_osdmap"}, >> { "time": "2014-12-22 19:26:39.577609", >> "event": "reached_pg"}, >> { "time": "2014-12-22 19:26:39.577624", >> "event": "started"}, >> { "time": "2014-12-22 19:26:39.577679", >> "event": "started"}]]}, > > Oh, yep, in Firefly it's stuck in the waiting_for_osdmap state while > it's in the PG work queue as well. Whoops... > So this is probably just general slowness filling up the work queue. > Ah, ok. Clear. >> Can this be something which has to do with the amount of RBD snapshots? >> Since I see snapc involved in both ops? > > It could conceivably have something to do with snapshots, but if it > does the presence of "snapc" isn't an indicator; that's always present > and is outputting the default. :) > > If you're seeing disks at 100% I think stuff's just getting a little > backed up. You could also check the distribution of incoming > operations across PGs; if e.g. a flood of ops are going to one object > that could also cause issues. Yes, I'll do that. It's only weird that ops sometimes get stuck for over 30 seconds. The disks seem to randomly spike to 100%. It was indeed a indicator that the disks are overloaded, but I just wanted to verify it. > -Greg > -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com