Re: mds laggy issue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



oh, i just found,  the OSDOpReply message is under fast dispatcher.

> 在 2018年3月29日,上午10:17,陶冬冬 <tdd21151186@xxxxxxxxx> 写道:
> 
> thank you, you remind me that open_ino could cause the reply message from OSD, 
> and if there are too much concurrent open_ino, that would make mqueue keep taken by the OSDReply messages.
> and since MDSBeacon message got the lowest priority ? which make MDSBeacon keep in the bottom of the mqueue.
> 
> Regards,
> Dongdong.
>> 在 2018年3月29日,上午9:37,Yan, Zheng <ukernel@xxxxxxxxx> 写道:
>> 
>> On Thu, Mar 29, 2018 at 9:09 AM, 陶冬冬 <tdd21151186@xxxxxxxxx> wrote:
>>> Patrick, we are using 32G memory for the mds.
>>> Zheng, calling mds->heartbeat_reset() could make the healthy check pass so
>>> that monitor won’t kick it out.
>>> more frustrating me is about the laggy issue, from the monitor log, i can
>>> actually see the MDSBeacon are sent without delay.
>>> but  about 50 seconds later, mds start handling that MDSBeacon message.
>>> so i’m wondering would that possible the message stayed in the mqueue for
>>> that long time (if the previous message is MDSMap with rejoin state, ant
>>> that rejoin take long time)
>>> (meantime, i will do some more investigating about this issue)
>>> 
>> 
>> If it's really caused by long wait in mqueue. we should limit
>> concurrent open_ino
>> started by MDCache::process_imported_caps()
>> 
>> 
>>> 在 2018年3月29日,上午8:10,Yan, Zheng <ukernel@xxxxxxxxx> 写道:
>>> 
>>> On Wed, Mar 28, 2018 at 11:14 PM, 陶冬冬 <tdd21151186@xxxxxxxxx> wrote:
>>> 
>>> Hi Zheng & Patrick,
>>> 
>>> we are using v12.2.2.
>>> Recently we’ve met an mds laggy issue (significantly,  about 50 seconds)
>>> i’ve traced the monitor and mds log and found that the MMDSBeacon message
>>> was actually sent to mds 50 seconds ago.
>>> so, looks like monitor isn’t laggy , and more worse is that i also found
>>> that the mds’s health check is failed and eventually monitor
>>> just kicked out this mds and make it respawn.
>>> by the way, this happened at rejoin phase.
>>> 
>>> Following is my analysis :
>>> The mds health check failure is because the mds tick thread could not get
>>> the mds_lock due to rejoin. (i found rejoin has many missing ino needed to
>>> fetch)
>>> and this leads the mqueue of the DispatchQueue consumed by Dispatcher got
>>> very slow, eventually make MMDSBeacon in mqueue got dispatched after a big
>>> delay.
>>> 
>>> 
>>> how about calling mds->heartbeat_reset() in the loop that fetch inodes
>>> 
>>> 
>>> 
>>> i want to know if my analysis make sense to you ?  if so, i’m wondering can
>>> we make MMDSBeacon fast dispatch.
>>> 
>>> Regards,
>>> Dongdong
>>> 
>>> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux