Hi, After simplifying the environment and the issue can be re-produced much easier. The IP and infrastructure are simply described as below: CephFS Kernel Client (172.16.10.7, linux-4.14) Node1 (172.16.10.17) - OSD0~2 Node2 (172.16.10.18) - OSD3~7 Node3 (172.16.10.19) - OSD8~12 The client mounts the CephFS right after the Node1 is setup. And every time a new Node joins the cluster, some files will be added to the CephFS. The issue always happens after the Nodes3 joins the cluster and the stuck write op seems to be sent to the old pg acting set although the epoch of osdmap on the client is already the same to the one on ceph. osdmap epoch 69 ############## Tue Aug 6 16:04:25 CST 2019 epoch 69 barrier 30 flags 0x188000 pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 64 pg_num_mask 63 flags 0x1 lfor 0 read_tier -1 write_tier -1 pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31 flags 0x1 lfor 0 read_tier -1 write_tier -1 osd0 172.16.10.17:6802 100% (exists, up) 100% osd1 172.16.10.17:6806 100% (exists, up) 100% osd2 172.16.10.17:6810 100% (exists, up) 100% osd3 172.16.10.18:6801 100% (exists, up) 100% osd4 172.16.10.18:6805 100% (exists, up) 100% osd5 172.16.10.18:6809 100% (exists, up) 100% osd6 172.16.10.18:6813 100% (exists, up) 100% osd7 172.16.10.18:6817 100% (exists, up) 100% osd8 (unknown sockaddr family 0) 0% (exists) 100% (IO on Cephfs ...) osdmap epoch 103 ############### Tue Aug 6 16:04:26 CST 2019 epoch 103 barrier 30 flags 0x188000 pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask 127 flags 0x1 lfor 0 read_tier -1 write_tier -1 pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31 flags 0x1 lfor 0 read_tier -1 write_tier -1 osd0 172.16.10.17:6802 100% (exists, up) 100% osd1 172.16.10.17:6806 100% (exists, up) 100% osd2 172.16.10.17:6810 100% (exists, up) 100% osd3 172.16.10.18:6801 100% (exists, up) 100% osd4 172.16.10.18:6805 100% (exists, up) 100% osd5 172.16.10.18:6809 100% (exists, up) 100% osd6 172.16.10.18:6813 100% (exists, up) 100% osd7 172.16.10.18:6817 100% (exists, up) 100% osd8 172.16.10.19:6801 100% (exists, up) 100% osd9 172.16.10.19:6805 100% (exists, up) 100% osd10 172.16.10.19:6809 100% (exists, up) 100% osd11 172.16.10.19:6813 100% (exists, up) 100% osd12 172.16.10.19:6817 100% (exists, up) 100% (IO on Cephfs ...) osdmap epoch 103 ############### Tue Aug 6 16:04:38 CST 2019 epoch 103 barrier 30 flags 0x188000 pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask 127 flags 0x1 lfor 0 read_tier -1 write_tier -1 pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31 flags 0x1 lfor 0 read_tier -1 write_tier -1 osd0 172.16.10.17:6802 100% (exists, up) 100% osd1 172.16.10.17:6806 100% (exists, up) 100% osd2 172.16.10.17:6810 100% (exists, up) 100% osd3 172.16.10.18:6801 100% (exists, up) 100% osd4 172.16.10.18:6805 100% (exists, up) 100% osd5 172.16.10.18:6809 100% (exists, up) 100% osd6 172.16.10.18:6813 100% (exists, up) 100% osd7 172.16.10.18:6817 100% (exists, up) 100% osd8 172.16.10.19:6801 100% (exists, up) 100% osd9 172.16.10.19:6805 100% (exists, up) 100% osd10 172.16.10.19:6809 100% (exists, up) 100% osd11 172.16.10.19:6813 100% (exists, up) 100% osd12 172.16.10.19:6817 100% (exists, up) 100% REQUESTS 13 homeless 0 389 osd2 1.23964a4b 1.4b [2,4,12]/2 [2,4,12]/2 e103 10000000028.00000006 0x400024 1 write 365 osd5 1.cde1721f 1.1f [5,10,2]/5 [5,10,2]/5 e103 10000000017.00000007 0x400024 1 write 371 osd5 1.9d081620 1.20 [5,12,2]/5 [5,12,2]/5 e103 10000000025.00000007 0x400024 1 write 375 osd5 1.8b5def1f 1.1f [5,10,2]/5 [5,10,2]/5 e103 1000000001f.00000006 0x400024 1 write 204 osd7 1.4dbcd0b2 1.32 [7,5,0]/7 [7,5,0]/7 e103 10000000017.00000001 0x400024 1 write 373 osd7 1.8f57faf5 1.75 [7,11,2]/7 [7,11,2]/7 e103 10000000027.00000007 0x400024 1 write 369 osd8 1.cec2d5dd 1.5d [8,2,7]/8 [8,2,7]/8 e103 10000000020.00000007 0x400024 1 write 378 osd8 1.3853fefc 1.7c [8,3,2]/8 [8,3,2]/8 e103 1000000001c.00000006 0x400024 1 write 384 osd8 1.342be187 1.7 [8,6,2]/8 [8,6,2]/8 e103 1000000001b.00000006 0x400024 1 write 390 osd11 1.1ac10bad 1.2d [11,5,2]/11 [11,5,2]/11 e103 10000000028.00000007 0x400024 1 write 364 osd12 1.345417ca 1.4a [12,7,2]/12 [12,7,2]/12 e103 10000000017.00000006 0x400024 1 write 374 osd12 1.50114f4a 1.4a [12,7,2]/12 [12,7,2]/12 e103 10000000026.00000007 0x400024 1 write 381 osd12 1.d670203f 1.3f [12,2,4]/12 [12,2,4]/12 e103 10000000021.00000006 0x400024 1 write (IO stop ...) osdmap epoch 103 ############### Tue Aug 6 16:04:39 CST 2019 epoch 103 barrier 30 flags 0x188000 pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask 127 flags 0x1 lfor 0 read_tier -1 write_tier -1 pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31 flags 0x1 lfor 0 read_tier -1 write_tier -1 osd0 172.16.10.17:6802 100% (exists, up) 100% osd1 172.16.10.17:6806 100% (exists, up) 100% osd2 172.16.10.17:6810 100% (exists, up) 100% osd3 172.16.10.18:6801 100% (exists, up) 100% osd4 172.16.10.18:6805 100% (exists, up) 100% osd5 172.16.10.18:6809 100% (exists, up) 100% osd6 172.16.10.18:6813 100% (exists, up) 100% osd7 172.16.10.18:6817 100% (exists, up) 100% osd8 172.16.10.19:6801 100% (exists, up) 100% osd9 172.16.10.19:6805 100% (exists, up) 100% osd10 172.16.10.19:6809 100% (exists, up) 100% osd11 172.16.10.19:6813 100% (exists, up) 100% osd12 172.16.10.19:6817 100% (exists, up) 100% REQUESTS 1 homeless 0 204 osd7 1.4dbcd0b2 1.32 [7,5,0]/7 [7,5,0]/7 e103 10000000017.00000001 0x400024 1 write Strangely, the acting set of pg 1.32 shown on ceph is [7,1,9] instead of [7,5,0]. [root@Jerry-x85-n2 ceph]# ceph pg dump | grep ^1.32 dumped all 1.32 1 0 0 0 0 4194304 3 3 active+clean 2019-08-06 16:03:33.978990 68'3 103:98 [7,1,9] 7 [7,1,9] [root@Jerry-x85-n2 ceph]# grep -rn "replicas change" ceph-osd.7.log | grep 1.32 ceph-osd.7.log:1844:2019-08-06 15:59:53.276 7fa390256700 10 osd.7 pg_epoch: 66 pg[1.32( empty local-lis/les=64/65 n=0 ec=47/16 lis/c 64/64 les/c/f 65/65/0 66/66/64) [7,5,0] r=0 lpr=66 pi=[64,66)/1 crt=0'0 mlcod 0'0 unknown mbc={}] [7,5,2] -> [7,5,0], replicas changed ceph-osd.7.log:15330:2019-08-06 16:02:38.769 7fa390256700 10 osd.7 pg_epoch: 84 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=66/67 n=2 ec=47/16 lis/c 66/66 les/c/f 67/67/0 66/84/64) [7,5,0] r=0 lpr=84 pi=[66,84)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,5,0] -> [7,5,0], replicas changed ceph-osd.7.log:25741:2019-08-06 16:02:53.618 7fa390256700 10 osd.7 pg_epoch: 90 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=84/85 n=1 ec=47/16 lis/c 84/84 les/c/f 85/85/0 90/90/64) [7,1,8] r=0 lpr=90 pi=[84,90)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,5,0] -> [7,1,8], replicas changed ceph-osd.7.log:37917:2019-08-06 16:03:17.932 7fa390256700 10 osd.7 pg_epoch: 100 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=90/91 n=1 ec=47/16 lis/c 90/90 les/c/f 91/97/0 100/100/64) [7,1,9] r=0 lpr=100 pi=[90,100)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,1,8] -> [7,1,9], replicas changed Related logs with debug_osd=10 and debug_ms=1 are provided in the https://drive.google.com/open?id=1gYksDbCecisWtP05HEoSxevDK8sywKv6 . Currently, I am tracing the code to figure out the root cause. Any ideas or insights will be appreciated, thanks! - Jerry On Wed, 31 Jul 2019 at 09:49, Jerry Lee <leisurelysw24@xxxxxxxxx> wrote: > > On Tue, 30 Jul 2019 at 23:02, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > > > On Tue, Jul 30, 2019 at 11:20 AM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote: > > > > > > Hello Ilya, > > > > > > On Mon, 29 Jul 2019 at 16:42, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > > > > > > > On Fri, Jul 26, 2019 at 11:23 AM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote: > > > > > > > > > > Some additional information are provided as below: > > > > > > > > > > I tried to restart the active MDS, and after the standby MDS took > > > > > over, there is no client session recorded in the output of `ceph > > > > > daemon mds.xxx session ls`. When I restarted the OSD.13 daemon, the > > > > > stuck write op finished immediately. Thanks. > > > > > > > > So it happened again with the same OSD? Did you see this with other > > > > OSDs? > > > > > > Yes. The issue always happened on the same OSD from previous > > > experience. However, it did happen with other OSD on other node from > > > the Cephfs kernel client's point of view. > > > > Hi Jerry, > > > > I'm not sure what you mean by "it did happen with other OSD on other > > node from the Cephfs kernel client's point of view". > > > > Hi Ilya, > > Sorry, it simply means that I had only seen OSD on Node2 and Node3 > shown in the osdc debug output when encountering the issue but I > didn't seen stuck write op sent to OSD on Node1. So, in the > beginning, I think that there might be some network connection issues > among nodes. > > Node1 (where the kernel client umount stuck) > - OSD.0 > - OSD.1 > - ... > Node2 > - OSD.5 > - OSD.6 > - ... > Node3 > - OSD.10 > - OSD.11 > - ... > > > > > > > > > > > > Try enabling some logging on osd.13 since this seems to be a recurring > > > > issue. At least "debug ms = 1" so we can see whether it ever sends the > > > > reply to the original op (i.e. prior to restart). > > > > > > Get it, I will raise the debug level to retrive more logs for further > > > investigateion. > > > > > > > > > > > Also, take note of the epoch in osdc output: > > > > > > > > 36 osd13 ... e327 ... > > > > > > > > Does "ceph osd dump" show the same epoch when things are stuck? > > > > > > > > > > Unfortunately, the environment was gone. But from the logs captured > > > before, the epoch seems to be consistent between client and ceph > > > cluster when thing are stuck, right? > > > > > > 2019-07-26 12:24:08.475 7f06efebc700 0 log_channel(cluster) log [DBG] > > > : osdmap e306: 15 total, 15 up, 15 in > > > > > > BTW, logs of OSD.13 and dynamic debug kernel logs of libceph captured > > > on the stuck node are provided in > > > https://drive.google.com/drive/folders/1gYksDbCecisWtP05HEoSxevDK8sywKv6?usp=sharing. > > > > The libceph log confirms that it had two laggy requests but it ends > > before you restarted the OSD. The OSD log is useless: we really need > > to see individual ops coming in and replies going out. It appears that > > the OSD simply dropped those ops expecting the kernel to resend them > > but that didn't happen for some reason. > > Thanks for the analysis. I will raise the debug level and hope more > clues can be capatured. > > - Jerry > > > > > Thanks, > > > > Ilya