On Thu, 8 Aug 2019 at 20:17, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > On Thu, Aug 8, 2019 at 2:05 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > > > On Thu, Aug 8, 2019 at 10:19 AM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote: > > > > > > On Thu, 8 Aug 2019 at 15:52, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > > > > > > > On Thu, Aug 8, 2019 at 9:04 AM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote: > > > > > > > > > > On Thu, 8 Aug 2019 at 01:38, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > > > > > > > > > > > On Wed, Aug 7, 2019 at 10:15 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > > > > > > > > > > > > > On Tue, Aug 6, 2019 at 1:11 PM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > After simplifying the environment and the issue can be re-produced > > > > > > > > much easier. The IP and infrastructure are simply described as below: > > > > > > > > > > > > > > > > CephFS Kernel Client (172.16.10.7, linux-4.14) > > > > > > > > Node1 (172.16.10.17) > > > > > > > > - OSD0~2 > > > > > > > > Node2 (172.16.10.18) > > > > > > > > - OSD3~7 > > > > > > > > Node3 (172.16.10.19) > > > > > > > > - OSD8~12 > > > > > > > > > > > > > > > > The client mounts the CephFS right after the Node1 is setup. And > > > > > > > > every time a new Node joins the cluster, some files will be added to > > > > > > > > the CephFS. The issue always happens after the Nodes3 joins the > > > > > > > > cluster and the stuck write op seems to be sent to the old pg acting > > > > > > > > set although the epoch of osdmap on the client is already the same to > > > > > > > > the one on ceph. > > > > > > > > > > > > > > > > osdmap epoch 69 > > > > > > > > ############## > > > > > > > > Tue Aug 6 16:04:25 CST 2019 > > > > > > > > epoch 69 barrier 30 flags 0x188000 > > > > > > > > pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 64 pg_num_mask 63 > > > > > > > > flags 0x1 lfor 0 read_tier -1 write_tier -1 > > > > > > > > pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31 > > > > > > > > flags 0x1 lfor 0 read_tier -1 write_tier -1 > > > > > > > > osd0 172.16.10.17:6802 100% (exists, up) 100% > > > > > > > > osd1 172.16.10.17:6806 100% (exists, up) 100% > > > > > > > > osd2 172.16.10.17:6810 100% (exists, up) 100% > > > > > > > > osd3 172.16.10.18:6801 100% (exists, up) 100% > > > > > > > > osd4 172.16.10.18:6805 100% (exists, up) 100% > > > > > > > > osd5 172.16.10.18:6809 100% (exists, up) 100% > > > > > > > > osd6 172.16.10.18:6813 100% (exists, up) 100% > > > > > > > > osd7 172.16.10.18:6817 100% (exists, up) 100% > > > > > > > > osd8 (unknown sockaddr family 0) 0% (exists) 100% > > > > > > > > > > > > > > > > (IO on Cephfs ...) > > > > > > > > > > > > > > > > osdmap epoch 103 > > > > > > > > ############### > > > > > > > > Tue Aug 6 16:04:26 CST 2019 > > > > > > > > epoch 103 barrier 30 flags 0x188000 > > > > > > > > pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask > > > > > > > > 127 flags 0x1 lfor 0 read_tier -1 write_tier -1 > > > > > > > > pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31 > > > > > > > > flags 0x1 lfor 0 read_tier -1 write_tier -1 > > > > > > > > osd0 172.16.10.17:6802 100% (exists, up) 100% > > > > > > > > osd1 172.16.10.17:6806 100% (exists, up) 100% > > > > > > > > osd2 172.16.10.17:6810 100% (exists, up) 100% > > > > > > > > osd3 172.16.10.18:6801 100% (exists, up) 100% > > > > > > > > osd4 172.16.10.18:6805 100% (exists, up) 100% > > > > > > > > osd5 172.16.10.18:6809 100% (exists, up) 100% > > > > > > > > osd6 172.16.10.18:6813 100% (exists, up) 100% > > > > > > > > osd7 172.16.10.18:6817 100% (exists, up) 100% > > > > > > > > osd8 172.16.10.19:6801 100% (exists, up) 100% > > > > > > > > osd9 172.16.10.19:6805 100% (exists, up) 100% > > > > > > > > osd10 172.16.10.19:6809 100% (exists, up) 100% > > > > > > > > osd11 172.16.10.19:6813 100% (exists, up) 100% > > > > > > > > osd12 172.16.10.19:6817 100% (exists, up) 100% > > > > > > > > > > > > > > > > (IO on Cephfs ...) > > > > > > > > > > > > > > > > osdmap epoch 103 > > > > > > > > ############### > > > > > > > > Tue Aug 6 16:04:38 CST 2019 > > > > > > > > epoch 103 barrier 30 flags 0x188000 > > > > > > > > pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask > > > > > > > > 127 flags 0x1 lfor 0 read_tier -1 write_tier -1 > > > > > > > > pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31 > > > > > > > > flags 0x1 lfor 0 read_tier -1 write_tier -1 > > > > > > > > osd0 172.16.10.17:6802 100% (exists, up) 100% > > > > > > > > osd1 172.16.10.17:6806 100% (exists, up) 100% > > > > > > > > osd2 172.16.10.17:6810 100% (exists, up) 100% > > > > > > > > osd3 172.16.10.18:6801 100% (exists, up) 100% > > > > > > > > osd4 172.16.10.18:6805 100% (exists, up) 100% > > > > > > > > osd5 172.16.10.18:6809 100% (exists, up) 100% > > > > > > > > osd6 172.16.10.18:6813 100% (exists, up) 100% > > > > > > > > osd7 172.16.10.18:6817 100% (exists, up) 100% > > > > > > > > osd8 172.16.10.19:6801 100% (exists, up) 100% > > > > > > > > osd9 172.16.10.19:6805 100% (exists, up) 100% > > > > > > > > osd10 172.16.10.19:6809 100% (exists, up) 100% > > > > > > > > osd11 172.16.10.19:6813 100% (exists, up) 100% > > > > > > > > osd12 172.16.10.19:6817 100% (exists, up) 100% > > > > > > > > REQUESTS 13 homeless 0 > > > > > > > > 389 osd2 1.23964a4b 1.4b [2,4,12]/2 [2,4,12]/2 > > > > > > > > e103 10000000028.00000006 0x400024 1 write > > > > > > > > 365 osd5 1.cde1721f 1.1f [5,10,2]/5 [5,10,2]/5 > > > > > > > > e103 10000000017.00000007 0x400024 1 write > > > > > > > > 371 osd5 1.9d081620 1.20 [5,12,2]/5 [5,12,2]/5 > > > > > > > > e103 10000000025.00000007 0x400024 1 write > > > > > > > > 375 osd5 1.8b5def1f 1.1f [5,10,2]/5 [5,10,2]/5 > > > > > > > > e103 1000000001f.00000006 0x400024 1 write > > > > > > > > 204 osd7 1.4dbcd0b2 1.32 [7,5,0]/7 [7,5,0]/7 > > > > > > > > e103 10000000017.00000001 0x400024 1 write > > > > > > > > 373 osd7 1.8f57faf5 1.75 [7,11,2]/7 [7,11,2]/7 > > > > > > > > e103 10000000027.00000007 0x400024 1 write > > > > > > > > 369 osd8 1.cec2d5dd 1.5d [8,2,7]/8 [8,2,7]/8 > > > > > > > > e103 10000000020.00000007 0x400024 1 write > > > > > > > > 378 osd8 1.3853fefc 1.7c [8,3,2]/8 [8,3,2]/8 > > > > > > > > e103 1000000001c.00000006 0x400024 1 write > > > > > > > > 384 osd8 1.342be187 1.7 [8,6,2]/8 [8,6,2]/8 > > > > > > > > e103 1000000001b.00000006 0x400024 1 write > > > > > > > > 390 osd11 1.1ac10bad 1.2d [11,5,2]/11 [11,5,2]/11 > > > > > > > > e103 10000000028.00000007 0x400024 1 write > > > > > > > > 364 osd12 1.345417ca 1.4a [12,7,2]/12 [12,7,2]/12 > > > > > > > > e103 10000000017.00000006 0x400024 1 write > > > > > > > > 374 osd12 1.50114f4a 1.4a [12,7,2]/12 [12,7,2]/12 > > > > > > > > e103 10000000026.00000007 0x400024 1 write > > > > > > > > 381 osd12 1.d670203f 1.3f [12,2,4]/12 [12,2,4]/12 > > > > > > > > e103 10000000021.00000006 0x400024 1 write > > > > > > > > > > > > > > > > (IO stop ...) > > > > > > > > > > > > > > > > osdmap epoch 103 > > > > > > > > ############### > > > > > > > > Tue Aug 6 16:04:39 CST 2019 > > > > > > > > epoch 103 barrier 30 flags 0x188000 > > > > > > > > pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask > > > > > > > > 127 flags 0x1 lfor 0 read_tier -1 write_tier -1 > > > > > > > > pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31 > > > > > > > > flags 0x1 lfor 0 read_tier -1 write_tier -1 > > > > > > > > osd0 172.16.10.17:6802 100% (exists, up) 100% > > > > > > > > osd1 172.16.10.17:6806 100% (exists, up) 100% > > > > > > > > osd2 172.16.10.17:6810 100% (exists, up) 100% > > > > > > > > osd3 172.16.10.18:6801 100% (exists, up) 100% > > > > > > > > osd4 172.16.10.18:6805 100% (exists, up) 100% > > > > > > > > osd5 172.16.10.18:6809 100% (exists, up) 100% > > > > > > > > osd6 172.16.10.18:6813 100% (exists, up) 100% > > > > > > > > osd7 172.16.10.18:6817 100% (exists, up) 100% > > > > > > > > osd8 172.16.10.19:6801 100% (exists, up) 100% > > > > > > > > osd9 172.16.10.19:6805 100% (exists, up) 100% > > > > > > > > osd10 172.16.10.19:6809 100% (exists, up) 100% > > > > > > > > osd11 172.16.10.19:6813 100% (exists, up) 100% > > > > > > > > osd12 172.16.10.19:6817 100% (exists, up) 100% > > > > > > > > REQUESTS 1 homeless 0 > > > > > > > > 204 osd7 1.4dbcd0b2 1.32 [7,5,0]/7 [7,5,0]/7 > > > > > > > > e103 10000000017.00000001 0x400024 1 write > > > > > > > > > > > > > > > > > > > > > > > > Strangely, the acting set of pg 1.32 shown on ceph is [7,1,9] instead > > > > > > > > of [7,5,0]. > > > > > > > > > > > > > > > > [root@Jerry-x85-n2 ceph]# ceph pg dump | grep ^1.32 > > > > > > > > dumped all > > > > > > > > 1.32 1 0 0 0 0 4194304 > > > > > > > > 3 3 active+clean 2019-08-06 16:03:33.978990 68'3 103:98 > > > > > > > > [7,1,9] 7 [7,1,9] > > > > > > > > > > > > > > > > [root@Jerry-x85-n2 ceph]# grep -rn "replicas change" ceph-osd.7.log | grep 1.32 > > > > > > > > ceph-osd.7.log:1844:2019-08-06 15:59:53.276 7fa390256700 10 osd.7 > > > > > > > > pg_epoch: 66 pg[1.32( empty local-lis/les=64/65 n=0 ec=47/16 lis/c > > > > > > > > 64/64 les/c/f 65/65/0 66/66/64) [7,5,0] r=0 lpr=66 pi=[64,66)/1 > > > > > > > > crt=0'0 mlcod 0'0 unknown mbc={}] [7,5,2] -> [7,5,0], replicas changed > > > > > > > > ceph-osd.7.log:15330:2019-08-06 16:02:38.769 7fa390256700 10 osd.7 > > > > > > > > pg_epoch: 84 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=66/67 n=2 > > > > > > > > ec=47/16 lis/c 66/66 les/c/f 67/67/0 66/84/64) [7,5,0] r=0 lpr=84 > > > > > > > > pi=[66,84)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,5,0] -> > > > > > > > > [7,5,0], replicas changed > > > > > > > > ceph-osd.7.log:25741:2019-08-06 16:02:53.618 7fa390256700 10 osd.7 > > > > > > > > pg_epoch: 90 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=84/85 n=1 > > > > > > > > ec=47/16 lis/c 84/84 les/c/f 85/85/0 90/90/64) [7,1,8] r=0 lpr=90 > > > > > > > > pi=[84,90)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,5,0] -> > > > > > > > > [7,1,8], replicas changed > > > > > > > > ceph-osd.7.log:37917:2019-08-06 16:03:17.932 7fa390256700 10 osd.7 > > > > > > > > pg_epoch: 100 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=90/91 n=1 > > > > > > > > ec=47/16 lis/c 90/90 les/c/f 91/97/0 100/100/64) [7,1,9] r=0 lpr=100 > > > > > > > > pi=[90,100)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,1,8] -> > > > > > > > > [7,1,9], replicas changed > > > > > > > > > > > > > > > > Related logs with debug_osd=10 and debug_ms=1 are provided in the > > > > > > > > https://drive.google.com/open?id=1gYksDbCecisWtP05HEoSxevDK8sywKv6 . > > > > > > > > Currently, I am tracing the code to figure out the root cause. Any > > > > > > > > ideas or insights will be appreciated, thanks! > > > > > > > > > > > > > > Hi Jerry, > > > > > > > > > > > > > > So the original request was dropped by osd.7 because the PG got split: > > > > > > > > > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 dequeue_op > > > > > > > 0x561adf90f500 prio 127 cost 4194304 latency 0.015565 > > > > > > > osd_op(client.4418.0:204 1.32 1.4dbcd0b2 (undecoded) > > > > > > > ondisk+write+known_if_redirected e69) v8 pg pg[1.32( v 68'3 (0'0,68'3] > > > > > > > local-lis/les=100/101 n=1 ec=47/16 lis/c 100/100 les/c/f 101/102/0 > > > > > > > 100/100/64) [7,1,9] r=0 lpr=100 crt=68'3 lcod 68'2 mlcod 0'0 > > > > > > > active+clean] > > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 client.4418 has old > > > > > > > map 69 < 103 > > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 send_incremental_map > > > > > > > 69 -> 103 to 0x561ae4cfb600 172.16.10.7:0/3068540720 > > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 80 220 bytes > > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 72 240 bytes > > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 71 240 bytes > > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 70 240 bytes > > > > > > > 2019-08-06 16:04:25.747 7fa390256700 7 osd.7 pg_epoch: 103 pg[1.32( v > > > > > > > 68'3 (0'0,68'3] local-lis/les=100/101 n=1 ec=47/16 lis/c 100/100 > > > > > > > les/c/f 101/102/0 100/100/64) [7,1,9] r=0 lpr=100 crt=68'3 lcod 68'2 > > > > > > > mlcod 0'0 active+clean] can_discard_op pg split in 84, dropping > > > > > > > 2019-08-06 16:04:25.748 7fa390256700 10 osd.7 103 dequeue_op > > > > > > > 0x561adf90f500 finish > > > > > > > > > > > > > > The request was never resent -- this appears to be a kernel client bug. > > > > > > > > > > > > > > It would be great if you could reproduce with > > > > > > > > > > > > > > echo 'file osd_client +p' > /sys/kernel/debug/dynamic_debug/control > > > > > > > > > > > > > > on the kernel client node and the same set of logs on the OSD side. > > > > > > > > > > > > > > Looking... > > > > > > > > > > > > Nothing so far. The split detection logic appears to work. > > > > > > There were 11 other requests dropped by the OSDs due to splits and they > > > > > > were all handled correctly: > > > > > > > > > > > > Split in epoch 47: > > > > > > > > > > > > orig osd2 osd_op(client.4418.0:5 1.f 1.b1e464af (undecoded) > > > > > > ondisk+write+known_if_redirected e35) > > > > > > retry osd0 osd_op(client.4418.0:5 1.2f 1.b1e464af (undecoded) > > > > > > ondisk+retry+write+known_if_redirected e68) > > > > > > > > > > > > orig osd0 osd_op(client.4418.0:13 1.10 1.fec32590 (undecoded) > > > > > > ondisk+write+known_if_redirected e35) > > > > > > retry osd0 osd_op(client.4418.0:13 1.10 1.fec32590 (undecoded) > > > > > > ondisk+retry+write+known_if_redirected e68) > > > > > > > > > > > > Split in epoch 84: > > > > > > > > > > > > orig osd5 osd_op(client.4418.0:194 1.20 1.260bc0e0 (undecoded) > > > > > > ondisk+write+known_if_redirected e69) > > > > > > retry osd0 osd_op(client.4418.0:194 1.60 1.260bc0e0 (undecoded) > > > > > > ondisk+retry+write+known_if_redirected e103) > > > > > > > > > > > > orig osd3 osd_op(client.4418.0:195 1.33 1.9b583df3 (undecoded) > > > > > > ondisk+write+known_if_redirected e69) > > > > > > retry osd1 osd_op(client.4418.0:195 1.73 1.9b583df3 (undecoded) > > > > > > ondisk+retry+write+known_if_redirected e103) > > > > > > > > > > > > orig osd1 osd_op(client.4418.0:197 1.1a 1.a986dc1a (undecoded) > > > > > > ondisk+write+known_if_redirected e69) > > > > > > retry osd1 osd_op(client.4418.0:197 1.1a 1.a986dc1a (undecoded) > > > > > > ondisk+retry+write+known_if_redirected e103) > > > > > > > > > > > > orig osd2 osd_op(client.4418.0:198 1.14 1.b93e34d4 (undecoded) > > > > > > ondisk+write+known_if_redirected e69) > > > > > > retry osd4 osd_op(client.4418.0:198 1.54 1.b93e34d4 (undecoded) > > > > > > ondisk+retry+write+known_if_redirected e103) > > > > > > > > > > > > orig osd1 osd_op(client.4418.0:199 1.1a 1.92ca525a (undecoded) > > > > > > ondisk+write+known_if_redirected e69) > > > > > > retry osd11 osd_op(client.4418.0:199 1.5a 1.92ca525a (undecoded) > > > > > > ondisk+retry+write+known_if_redirected e103) > > > > > > > > > > > > orig osd0 osd_op(client.4418.0:202 1.3b 1.6388bdbb (undecoded) > > > > > > ondisk+write+known_if_redirected e69) > > > > > > retry osd0 osd_op(client.4418.0:202 1.3b 1.6388bdbb (undecoded) > > > > > > ondisk+retry+write+known_if_redirected e103) > > > > > > > > > > > > orig osd3 osd_op(client.4418.0:203 1.1e 1.fe5731e (undecoded) > > > > > > ondisk+write+known_if_redirected e69) > > > > > > retry osd3 osd_op(client.4418.0:203 1.1e 1.fe5731e (undecoded) > > > > > > ondisk+retry+write+known_if_redirected e103) > > > > > > > > > > > > orig osd7 osd_op(client.4418.0:204 1.32 1.4dbcd0b2 (undecoded) > > > > > > ondisk+write+known_if_redirected e69) > > > > > > retry ??? > > > > > > > > > > > > orig osd0 osd_op(client.4418.0:205 1.10 1.42c29750 (undecoded) > > > > > > ondisk+write+known_if_redirected e69) > > > > > > retry osd0 osd_op(client.4418.0:205 1.50 1.42c29750 (undecoded) > > > > > > ondisk+retry+write+known_if_redirected e103) > > > > > > > > > > > > orig osd3 osd_op(client.4418.0:207 1.33 1.918809b3 (undecoded) > > > > > > ondisk+write+known_if_redirected e69) > > > > > > retry osd3 osd_op(client.4418.0:207 1.33 1.918809b3 (undecoded) > > > > > > ondisk+retry+write+known_if_redirected e103) > > > > > > > > > > > > This shows that both staying in the old PG and moving to the new PG > > > > > > cases are handled correctly, as staying on the same OSD and moving to > > > > > > the new OSD. The request in question (tid 204) is on the right list > > > > > > as indicated by e103 in osdc output, but it looks like something went > > > > > > wrong inside calc_target(). > > > > > > > > > > > > What is the exact kernel version you are running? I know you said > > > > > > 4.14, but you also mentioned some customizations... > > > > > > > > > > Hi Ilya, > > > > > > > > > > Thanks for your help on analysing the issue. New logs are provided in > > > > > https://drive.google.com/open?id=1dJ1-eGClDWf18yPrIQMRtLoLJ5xsN3jA . > > > > > I'm also tracing the calc_target() and it seems that for those stuck > > > > > requests, the calc_target() decides that there is no action > > > > > (CALC_TARGET_NO_ACTION) for them. > > > > > > > > > > REQUESTS 2 homeless 0 > > > > > 199 osd0 1.6388bdbb 1.3b [0,1,7]/0 [0,1,7]/0 > > > > > e106 10000000016.00000002 0x400024 1 write > > > > > 202 osd0 1.42c29750 1.50 [0,2,5]/0 [0,2,5]/0 > > > > > e106 10000000017.00000002 0x400024 1 write > > > > > > > > > > <7>[11617.144891] libceph: applying incremental map 106 len 212 > > > > > <7>[11617.144893] libceph: scan_requests req ffff8808260614a0 tid 199 > > > > > <7>[11617.144898] libceph: calc_target t ffff8808260614e0 -> ct_res 0 osd 0 > > > > > ... > > > > > <7>[11646.137530] libceph: req ffff8808260614a0 tid 199 on osd0 is laggy > > > > > > > > Right, that is what I expected. Looking... > > > > > > > > > > > > > > The exact kernel version is liunx-4.14.24 and most of the > > > > > customizations are made in the block layer, file system, and scsi > > > > > driver. However, some patches related to net/ceph and fs/ceph files > > > > > in linux-4.14.y upstream are backported, so I'm also checking whether > > > > > those backports cause this issue or not. > > > > > > > > Can you pastebin or attach your osdmap.c and osd_client.c files? > > > > > > Hi Ilya, > > > > > > - osdmap https://pastebin.com/Aj5h0eLC > > > - osd_client https://pastebin.com/FZNepUVe > > > > I think I see the problem. > > > > 1) creating the OSD session doesn't open the TCP socket, the messenger > > opens it in try_write() > > > > 2) con->peer_features remains 0 until the TCP socket is opened and > > either CEPH_MSGR_TAG_SEQ or CEPH_MSGR_TAG_READY is received > > > > 3) calc_target() uses con->peer_features to check for RESEND_ON_SPLIT > > and resends only if it's set > > > > What happened is there was no osd0 session and tid 199 had to > > create it before submitting itself to the messenger. However before > > the messenger got to tid 199, some other request with pre-split epoch > > had reached some other OSD and triggered osdmap incrementals. While > > processing the split incremental, calc_target() for tid 199 returned > > NO_ACTION: > > > > if (unpaused || legacy_change || force_resend || > > (split && con && CEPH_HAVE_FEATURE(con->peer_features, > > RESEND_ON_SPLIT))) > > ct_res = CALC_TARGET_NEED_RESEND; > > else > > ct_res = CALC_TARGET_NO_ACTION; > > > > This isn't logged, but I'm pretty sure that split was 1, con wasn't > > NULL and con->peer_features was 0. Hi Ilya, Yes, you're right! I just add more logs to print out split and con->peer_features and it's exactly what you think: <7>[11347.891038] libceph: scan_requests req ffff880817fc5300 tid 215 <7>[11347.891043] libceph: calc_target t ffff880817fc5340 -> ct_res 0 osd 2 split 1 peer_features 0 > > > > I would have noticed this earlier, but I was mislead by osdc > > output. It shows outdated up/acting sets and that lead me to assume > > that split was 0 (because if split is 1, up/acting sets are updated > > _before_ con->peer_features check). However in this case up/acting > > sets remained the same for a few epochs after the split, so they are > > not actually outdated wrt the split. > > > > Checking con->peer_features in calc_target() is fundamentally racy... > > Objecter had the same bug, it was fixed by looking at per-OSD features > in osd_xinfo array instead of con->peer_features: > > https://github.com/ceph/ceph/pull/23850 > > I'll see if we can do the same thing here. Again, thanks for your patience and kindly help on this issue. And, I'm willing to re-verify it if a fix comes out :) Thanks, - Jerry > > Thanks, > > Ilya