Hi, Rongze This is mainly because of too many Ops waiting for process in op_wq, but the backend cannot process the Ops as fast as the Ops reach op_wq. It is simple to illustrate this. queue: {op10, op9, op8, op7, op6, op5, op4, op3, op2, op1} Under high pressure, client send requests so fast so that osd cannot process the arrived request as the same speed. Thus, osd thread, dequeue op1 to process for 10ms, for op2, process 10ms. Therefore, when op10 is processed by the osd thread, the dequeue_op() routine, it already wait for 10*10ms = 100ms. So assume that 1000 ops in the queue, you will see the last several ops in the queue(that says, op1000 ~ op900) wait for about 1s in dump_historic_ops. So here, I assume that you set client_message_throttle params enough large to let the client enough Ops reach osd at the same time. Otherwise, it will throttle the total in-flight-ops so that this would not be occurs. We encounter this situation under two cases: 1) High load parallel requests comes from client 2) Or High miss rate for cache tier Regards Ning Yao 2015-12-09 1:56 GMT+08:00 Rongze Zhu <zrzhit@xxxxxxxxx>: > > Hi guys, > > I found out a strange issue in a ceph cluster. This ceph cluster is used > for OpenStack cluster, there are 70 VMs in the cluster. There are many > slow request(osd_op_complaint_time = 3) in the ceph cache tier with high > load( more than 3000 ops, 90% write). > > After the investigation, I found that sometimes the interval time between > waiting_for_osdmap event and reached_pg event in the op will be very > long(may be 10 ~ 100 seconds). This means that some ops will stay op_wq > queue very long, until the osd op threads process it. > > What reasons can cause this issue? Before anyone encountered this > situation? I think that osdmap,pg lock,object lock can't block osd op > thread so long(10s ~ 100 seconds). > > > hardware(total 11 nodes): > - 6 x nodes(compute & storage): > - 3 x nodes: HUAWEI RH2288H V2 server, 2xE5-2658, 256GB, > - 1 x HUAWEI SSD disk 800GB, used for ceph osd > - 12 x 1.2TB SAS disk(each raid0 has 4 sas disk), used for ceph > - 3 x nodes: HP DL380 Gen9 server, 2xE5-2640, 128GB, > - 2 x OCZ SSD disk 800GB, used for ceph > - 12 x 1.2TB SAS disk(each raid0 has 4 sas disk), used for ceph > - 5 x nodes (compute) > > os: > - HUAWEI server: RHEL 6.4, 2.6.32-358.123.2.openstack > - HP DL380 Gen9 server: RHEL 6.5, 2.6.32-431.el6.x86_64 > > ceph: > - ceph 0.80.10 > - 9 ssd osd for cache tier > - 18 raid0 osd for storage tier > - replication size = 2 > - pg num = 1024 > - osd op thread = 20 > - filestore op thread = 20 > > dump_historic_ops(osd.4 ssd): > > { "description": "osd_op(client.983894.0:8044023 > rbd_data.468d085e74d0.00000000000031b9 [] 4.5605c210 ack+ondisk+write > e11409)", > "received_at": "2015-12-08 13:31:49.645295", > "age": "100.252129", > "duration": "25.419458", > "type_data": [ > "commit sent; apply or cleanup", > { "client": "client.983894", > "tid": 8044023}, > [ > { "time": "2015-12-08 13:31:49.645616", > "event": "waiting_for_osdmap"}, > { "time": "2015-12-08 13:32:01.529474", > "event": "reached_pg"}, > { "time": "2015-12-08 13:32:01.529557", > "event": "started"}, > { "time": "2015-12-08 13:32:01.529575", > "event": "started"}, > { "time": "2015-12-08 13:32:01.529642", > "event": "waiting for subops from 25"}, > { "time": "2015-12-08 13:32:01.530340", > "event": "commit_queued_for_journal_write"}, > { "time": "2015-12-08 13:32:01.531270", > "event": "write_thread_in_journal_buffer"}, > { "time": "2015-12-08 13:32:01.533131", > "event": "journaled_completion_queued"}, > { "time": "2015-12-08 13:32:01.533412", > "event": "sub_op_commit_rec"}, > { "time": "2015-12-08 13:32:05.813736", > "event": "op_commit"}, > { "time": "2015-12-08 13:32:05.814035", > "event": "commit_sent"}, > { "time": "2015-12-08 13:32:15.064722", > "event": "op_applied"}, > { "time": "2015-12-08 13:32:15.064753", > "event": "done"}]]}]} > > > -- > Rongze Zhu > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html