On Wed, Mar 21, 2018 at 7:09 PM, Rishabh Dave <ridave@xxxxxxxxxx> wrote: > Hi, > > I am trying to fix the bug - http://tracker.ceph.com/issues/10915. > Patrick helped me to get started with it. I was able to reproduce it > locally on vstart cluster and I am currently trying to fix it by > getting the client unmounted on eviction. Once I could do this I would > (as Patrick suggested) add a new option like > "client_unmount_on_blacklist" and modify my code accordingly. > > The information about the blacklist seems to be available to the > client code [1] but, as far as I can see, that line (i.e. the if-block > containing it) never gets executed. MDS blacklists the client, evicts > the session but client fails to notice that. I suppose, this > in-congruence causes it to hang. > > The reason why the client fails to notice is that it never actually > looks at the blacklist after the session is evicted -- > handle_osd_map() never gets called after MDSRank::evict_session() is > called. I did write a patch that would make the client check its > address in the blacklist by calling a (new) function in > ms_handle_reset() but it did not help. Looks like not only the client > doesn't check the blacklist but also even if it were to, it would find > an outdated version. > > To verify this, I wrote some debug code to iterate and display the > blacklist towards the end of and after MDSRank::evict_session(). The > blacklist turned out to be empty in both the location. Shouldn't > blacklist be updated at least in or right after > MDSRank::evict_session() gets executed? I think before fixing client, > I need to have some sort of fix somewhere here [2]. The client only gets osdmap updates when it tries to communicate with an OSD, and the OSD tells it that its current map epoch is too old. In the case that the client isn't doing any data operations (i.e. no osd ops), then the client doesn't find out that its blacklisted. But that's okay, because the client's awareness of its own blacklisted-ness should only be needed in the case that there is some dirty data that needs to be thrown away in the special if(blacklisted) paths. So if it's not hanging on any OSD operations (those operations would have resulted in an updated osdmap), the question is what is it hanging on? Is it trying to open a new session with the MDS? John John > And how can I get a stacktrace for commands like "bin/ceph tell mds.a > client evict id=xxxx"? > > Also I have attached the patch containing modifications I have used so far. > > Thanks, > Rishabh > > [1] https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L2420 > [2] https://github.com/ceph/ceph/blob/master/src/mds/MDSRank.cc#L2737 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html