On Mon, Dec 22, 2014 at 10:30 AM, Wido den Hollander <wido@xxxxxxxx> wrote: > For example, two ops: > > #1: > > { "description": "osd_sub_op(client.2433432.0:61603164 20.424 > 19038c24\/rbd_data.d7c912ae8944a.00000000000008b6\/head\/\/20 [] v > 63283'8301089 snapset=0=[]:[] snapc=0=[])", > "received_at": "2014-12-22 19:26:37.458680", > "age": "2.719850", > "duration": "2.520937", > "type_data": [ > "commit sent; apply or cleanup", > [ > { "time": "2014-12-22 19:26:37.458914", > "event": "waiting_for_osdmap"}, > { "time": "2014-12-22 19:26:39.310569", > "event": "reached_pg"}, > { "time": "2014-12-22 19:26:39.310728", > "event": "started"}, > { "time": "2014-12-22 19:26:39.310951", > "event": "started"}, > { "time": "2014-12-22 19:26:39.979292", > "event": "commit_queued_for_journal_write"}, > { "time": "2014-12-22 19:26:39.979348", > "event": "write_thread_in_journal_buffer"}, > { "time": "2014-12-22 19:26:39.979594", > "event": "journaled_completion_queued"}, > { "time": "2014-12-22 19:26:39.979617", > "event": "commit_sent"}]]}, > > #2: > > { "description": "osd_sub_op(client.2188703.0:10420738 20.641 > 6673ee41\/rbd_data.9497e32794ff7.0000000000000454\/head\/\/20 [] v > 63283'5215076 snapset=0=[]:[] snapc=0=[])", > "received_at": "2014-12-22 19:26:38.040551", > "age": "2.137979", > "duration": "1.537128", > "type_data": [ > "started", > [ > { "time": "2014-12-22 19:26:38.040717", > "event": "waiting_for_osdmap"}, > { "time": "2014-12-22 19:26:39.577609", > "event": "reached_pg"}, > { "time": "2014-12-22 19:26:39.577624", > "event": "started"}, > { "time": "2014-12-22 19:26:39.577679", > "event": "started"}]]}, Oh, yep, in Firefly it's stuck in the waiting_for_osdmap state while it's in the PG work queue as well. Whoops... So this is probably just general slowness filling up the work queue. > Can this be something which has to do with the amount of RBD snapshots? > Since I see snapc involved in both ops? It could conceivably have something to do with snapshots, but if it does the presence of "snapc" isn't an indicator; that's always present and is outputting the default. :) If you're seeing disks at 100% I think stuff's just getting a little backed up. You could also check the distribution of incoming operations across PGs; if e.g. a flood of ops are going to one object that could also cause issues. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com