On Thu, Oct 18, 2018 at 1:35 PM Bryan Stillwell <bstillwell@xxxxxxxxxxx> wrote: > > Thanks Dan! > > > > It does look like we're hitting the ms_tcp_read_timeout. I changed it to 79 seconds and I've had a couple dumps that were hung for ~2m40s (2*ms_tcp_read_timeout) and one that was hung for 8 minutes (6*ms_tcp_read_timeout). > > > > I agree that 15 minutes (900s) is a long timeout. Anyone know the reasoning for that decision? I think we picked it because it was long enough to be very sure that a connection wouldn't time out while it was waiting on some kind of slow response, but short enough that it would actually go away. In general, we don't expect it to be an "important" value since connections shouldn't dangle unless one Ceph entity actually remains alive that whole time and stops needing to talk to an entity it was previously using, and establishing a connection takes a few round-trips but otherwise costs little. So eg it's not uncommon for an rbd client to hit these disconnects if it stops using its disk for a while. But there's also very little cost to keeping the session around. I wouldn't worry much about turning it down quite a bit, but if it's changing the behavior of ceph-mgr there's also a ceph-mgr bug that needs to be resolved. I presume John's link is more useful for that. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com