> 在 2018年4月21日,上午1:54,Patrick Donnelly <pdonnell@xxxxxxxxxx> 写道: > > On Fri, Apr 20, 2018 at 7:03 AM, 陶冬冬 <tdd21151186@xxxxxxxxx> wrote: >> Hi, >> I’m following the client hang issue recently, Rishabh Dave has a PR to fix this issue (https://github.com/ceph/ceph/pull/21065) >> the main idea proposed by this PR is that client tick will check whether there is a slow request which didn’t got reply for 120s >> if there is one, then make the client request the latest osdmap from the monitor, after the client get the latest osdmap, the hanging requests will be cleaned up >> >> I know this can fix the hang issue, but this solution seems workaround to me, because unmount will still have to wait for 120s, i don’t think it’s user friendly >> I would like to propose another solution here - we request the latest osdmap in client's ms_handle_remote_reset >> here is the eviction flow >> [1] mds send evict command to monitor >> [2] mds then wait for the latest osdmap >> [3] mds get the latest osdmap and then kill the session from the evicted client which is on the blacklist now >> [4] when the evicted client try to communicate with the mds, this client will certainly get a RESET SESSION reply > > What you've described makes sense to me Dongdong. Rishabh, can you try > testing a solution that gets the latest osdmap in > ::ms_handle_remote_reset for case: MetaSession::STATE_OPENING. > (Rishabh, we can drop [1] if this works as expected.) > Patrick, should also test on MetaSession::STATE_OPEN ? and, i don’t see Client::handle_osd_map is trying to clean up the request which is waiting on list "session->waiting_for_open" I think this will also make the client hang in state MetaSession::STATE_OPENING even it notice it’s in blacklist now. so, i think we should also call signal_context_list (session->waiting_for_open) in Client::handle_osd_map >> so, as long as the session is killed after the osdmap get updated in monitor, ms._handle_remote_reset will always get the client an latest osdmap which at least contain the new blacklist > > As long as the client continues trying reconnecting (it does), the > client will eventually get an osdmap that notes it's blacklisted. So > the ordering is not necessarily important here. > > [1] https://github.com/ceph/ceph/pull/21065/commits/d3fb8c24c7df4fa7443bce71c1adf8cf84c6361e > > -- > Patrick Donnelly -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html