Hello Ilya, On Mon, 29 Jul 2019 at 16:42, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > On Fri, Jul 26, 2019 at 11:23 AM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote: > > > > Some additional information are provided as below: > > > > I tried to restart the active MDS, and after the standby MDS took > > over, there is no client session recorded in the output of `ceph > > daemon mds.xxx session ls`. When I restarted the OSD.13 daemon, the > > stuck write op finished immediately. Thanks. > > So it happened again with the same OSD? Did you see this with other > OSDs? Yes. The issue always happened on the same OSD from previous experience. However, it did happen with other OSD on other node from the Cephfs kernel client's point of view. > > Try enabling some logging on osd.13 since this seems to be a recurring > issue. At least "debug ms = 1" so we can see whether it ever sends the > reply to the original op (i.e. prior to restart). Get it, I will raise the debug level to retrive more logs for further investigateion. > > Also, take note of the epoch in osdc output: > > 36 osd13 ... e327 ... > > Does "ceph osd dump" show the same epoch when things are stuck? > Unfortunately, the environment was gone. But from the logs captured before, the epoch seems to be consistent between client and ceph cluster when thing are stuck, right? 2019-07-26 12:24:08.475 7f06efebc700 0 log_channel(cluster) log [DBG] : osdmap e306: 15 total, 15 up, 15 in BTW, logs of OSD.13 and dynamic debug kernel logs of libceph captured on the stuck node are provided in https://drive.google.com/drive/folders/1gYksDbCecisWtP05HEoSxevDK8sywKv6?usp=sharing. I deeply appreciate your kindly help! - Jerry > Thanks, > > Ilya