On Wed, 17 Oct 2018, Ma, Jianpeng wrote: > Hi Sage, xingguo: > > Yes there is a race, see the following code: > > > ThreadA ThreadB > sdata->shard_lock.Lock(); > if (sdata->pqueue->empty() && > !(is_smallest_thread_index && !sdata->context_queue.empty())) { > > void queue(list<Context *>& ls) { > bool empty = false; > { > std::scoped_lock l(q_mutex); > if (q.empty()) { > q.swap(ls); > empty = true; > } else { > q.insert(q.end(), ls.begin(), ls.end()); > } > } > > if (empty) { > mutex.Lock(); > cond.Signal(); > mutex.Unlock(); > } > } > > > > > > sdata->sdata_wait_lock.Lock(); > if (!sdata->stop_waiting) { > > > I modified the code as following and think this can fix the above race. > > sdata->shard_lock.Lock(); > if (is_smallest_thread_index && sdata->pqueue->empty()) { > sdata->sdata_wait_lock.Lock(); //check context_queue must be locked by sdata_wait. > if (sdata->context_queue.empty()) { This looks right to me! > if (!sdata->stop_watiting) { > dout(20) << __func__ << " empty q, waiting" << dendl; > osd->cct->get_heartbeat_map()->clear_timeout(hb); > sdata->shard_lock.Unlock(); > sdata->sdata_cond.Wait(sdata->sdata_wait_lock); > sdata->sdata_wait_lock.Unlock(); > sdata->shard_lock.Lock(); > if (sdata->pqueue->empty() && sdata->context_queue.empty()) { > sdata->shard_lock.Unlock(); > return; > } > osd->cct->get_heartbeat_map()->reset_timeout(hb, > osd->cct->_conf->threadpool_default_timeout, 0); > } else { > dout(20) << __func__ << " need return immediately" << dendl; > sdata->sdata_wait_lock.Unlock(); > sdata->shard_lock.Unlock(); > return; > } > } else { > sdata->sdata_wait_lock.Unlock(); > } > > } else if (sdata->pqueue->empty()) { > ; > } > > BTY: why not remove sdata_wait_lock ? i think it safe to remove sdata_wait_lock. Or am I missing something? The mutex is what closes the race between the ContextQueue::queue and the worker thread. It ensures that we can verify the context_queue is empty and then wait without queue() adding an item between those two steps... sage > > Jianpeng > Thanks! > > > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Ma, Jianpeng > Sent: Wednesday, October 17, 2018 9:59 AM > To: Sage Weil <sweil@xxxxxxxxxx> > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: RE: ContextQueue vs op_wq race? > > Ok, I'll check this now. > > Thanks! > Jianpeng > > -----Original Message----- > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > Sent: Wednesday, October 17, 2018 6:55 AM > To: Ma, Jianpeng <jianpeng.ma@xxxxxxxxx> > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: ContextQueue vs op_wq race? > > Hi Jianpeng, > > Can you take a look at http://tracker.ceph.com/issues/36473 and see if you have any ideas? Does that possible race make sense? > > Thanks! > sage > >