Re: cephfs kernel client umount stucks forever

Jerry Lee <leisurelysw24@xxxxxxxxx> · Fri, 9 Aug 2019 18:40:32 +0800

On Thu, 8 Aug 2019 at 20:17, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>
> On Thu, Aug 8, 2019 at 2:05 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> >
> > On Thu, Aug 8, 2019 at 10:19 AM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
> > >
> > > On Thu, 8 Aug 2019 at 15:52, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> > > >
> > > > On Thu, Aug 8, 2019 at 9:04 AM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
> > > > >
> > > > > On Thu, 8 Aug 2019 at 01:38, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Wed, Aug 7, 2019 at 10:15 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Tue, Aug 6, 2019 at 1:11 PM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > After simplifying the environment and the issue can be re-produced
> > > > > > > > much easier.  The IP and infrastructure are simply described as below:
> > > > > > > >
> > > > > > > > CephFS Kernel Client (172.16.10.7, linux-4.14)
> > > > > > > > Node1 (172.16.10.17)
> > > > > > > >   - OSD0~2
> > > > > > > > Node2 (172.16.10.18)
> > > > > > > >   - OSD3~7
> > > > > > > > Node3 (172.16.10.19)
> > > > > > > >   - OSD8~12
> > > > > > > >
> > > > > > > > The client mounts the CephFS right after the Node1 is setup.  And
> > > > > > > > every time a new Node joins the cluster, some files will be added to
> > > > > > > > the CephFS.  The issue always happens after the Nodes3 joins the
> > > > > > > > cluster and the stuck write op seems to be sent to the old pg acting
> > > > > > > > set although the epoch of osdmap on the client is already the same to
> > > > > > > > the one on ceph.
> > > > > > > >
> > > > > > > > osdmap epoch 69
> > > > > > > > ##############
> > > > > > > > Tue Aug  6 16:04:25 CST 2019
> > > > > > > > epoch 69 barrier 30 flags 0x188000
> > > > > > > > pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 64 pg_num_mask 63
> > > > > > > > flags 0x1 lfor 0 read_tier -1 write_tier -1
> > > > > > > > pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31
> > > > > > > > flags 0x1 lfor 0 read_tier -1 write_tier -1
> > > > > > > > osd0    172.16.10.17:6802       100%    (exists, up)    100%
> > > > > > > > osd1    172.16.10.17:6806       100%    (exists, up)    100%
> > > > > > > > osd2    172.16.10.17:6810       100%    (exists, up)    100%
> > > > > > > > osd3    172.16.10.18:6801       100%    (exists, up)    100%
> > > > > > > > osd4    172.16.10.18:6805       100%    (exists, up)    100%
> > > > > > > > osd5    172.16.10.18:6809       100%    (exists, up)    100%
> > > > > > > > osd6    172.16.10.18:6813       100%    (exists, up)    100%
> > > > > > > > osd7    172.16.10.18:6817       100%    (exists, up)    100%
> > > > > > > > osd8    (unknown sockaddr family 0)       0%    (exists)        100%
> > > > > > > >
> > > > > > > > (IO on Cephfs ...)
> > > > > > > >
> > > > > > > > osdmap epoch 103
> > > > > > > > ###############
> > > > > > > > Tue Aug  6 16:04:26 CST 2019
> > > > > > > > epoch 103 barrier 30 flags 0x188000
> > > > > > > > pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask
> > > > > > > > 127 flags 0x1 lfor 0 read_tier -1 write_tier -1
> > > > > > > > pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31
> > > > > > > > flags 0x1 lfor 0 read_tier -1 write_tier -1
> > > > > > > > osd0    172.16.10.17:6802       100%    (exists, up)    100%
> > > > > > > > osd1    172.16.10.17:6806       100%    (exists, up)    100%
> > > > > > > > osd2    172.16.10.17:6810       100%    (exists, up)    100%
> > > > > > > > osd3    172.16.10.18:6801       100%    (exists, up)    100%
> > > > > > > > osd4    172.16.10.18:6805       100%    (exists, up)    100%
> > > > > > > > osd5    172.16.10.18:6809       100%    (exists, up)    100%
> > > > > > > > osd6    172.16.10.18:6813       100%    (exists, up)    100%
> > > > > > > > osd7    172.16.10.18:6817       100%    (exists, up)    100%
> > > > > > > > osd8    172.16.10.19:6801       100%    (exists, up)    100%
> > > > > > > > osd9    172.16.10.19:6805       100%    (exists, up)    100%
> > > > > > > > osd10   172.16.10.19:6809       100%    (exists, up)    100%
> > > > > > > > osd11   172.16.10.19:6813       100%    (exists, up)    100%
> > > > > > > > osd12   172.16.10.19:6817       100%    (exists, up)    100%
> > > > > > > >
> > > > > > > > (IO on Cephfs ...)
> > > > > > > >
> > > > > > > > osdmap epoch 103
> > > > > > > > ###############
> > > > > > > > Tue Aug  6 16:04:38 CST 2019
> > > > > > > > epoch 103 barrier 30 flags 0x188000
> > > > > > > > pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask
> > > > > > > > 127 flags 0x1 lfor 0 read_tier -1 write_tier -1
> > > > > > > > pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31
> > > > > > > > flags 0x1 lfor 0 read_tier -1 write_tier -1
> > > > > > > > osd0    172.16.10.17:6802       100%    (exists, up)    100%
> > > > > > > > osd1    172.16.10.17:6806       100%    (exists, up)    100%
> > > > > > > > osd2    172.16.10.17:6810       100%    (exists, up)    100%
> > > > > > > > osd3    172.16.10.18:6801       100%    (exists, up)    100%
> > > > > > > > osd4    172.16.10.18:6805       100%    (exists, up)    100%
> > > > > > > > osd5    172.16.10.18:6809       100%    (exists, up)    100%
> > > > > > > > osd6    172.16.10.18:6813       100%    (exists, up)    100%
> > > > > > > > osd7    172.16.10.18:6817       100%    (exists, up)    100%
> > > > > > > > osd8    172.16.10.19:6801       100%    (exists, up)    100%
> > > > > > > > osd9    172.16.10.19:6805       100%    (exists, up)    100%
> > > > > > > > osd10   172.16.10.19:6809       100%    (exists, up)    100%
> > > > > > > > osd11   172.16.10.19:6813       100%    (exists, up)    100%
> > > > > > > > osd12   172.16.10.19:6817       100%    (exists, up)    100%
> > > > > > > > REQUESTS 13 homeless 0
> > > > > > > > 389     osd2    1.23964a4b      1.4b    [2,4,12]/2      [2,4,12]/2
> > > > > > > >  e103    10000000028.00000006    0x400024        1       write
> > > > > > > > 365     osd5    1.cde1721f      1.1f    [5,10,2]/5      [5,10,2]/5
> > > > > > > >  e103    10000000017.00000007    0x400024        1       write
> > > > > > > > 371     osd5    1.9d081620      1.20    [5,12,2]/5      [5,12,2]/5
> > > > > > > >  e103    10000000025.00000007    0x400024        1       write
> > > > > > > > 375     osd5    1.8b5def1f      1.1f    [5,10,2]/5      [5,10,2]/5
> > > > > > > >  e103    1000000001f.00000006    0x400024        1       write
> > > > > > > > 204     osd7    1.4dbcd0b2      1.32    [7,5,0]/7       [7,5,0]/7
> > > > > > > >  e103    10000000017.00000001    0x400024        1       write
> > > > > > > > 373     osd7    1.8f57faf5      1.75    [7,11,2]/7      [7,11,2]/7
> > > > > > > >  e103    10000000027.00000007    0x400024        1       write
> > > > > > > > 369     osd8    1.cec2d5dd      1.5d    [8,2,7]/8       [8,2,7]/8
> > > > > > > >  e103    10000000020.00000007    0x400024        1       write
> > > > > > > > 378     osd8    1.3853fefc      1.7c    [8,3,2]/8       [8,3,2]/8
> > > > > > > >  e103    1000000001c.00000006    0x400024        1       write
> > > > > > > > 384     osd8    1.342be187      1.7     [8,6,2]/8       [8,6,2]/8
> > > > > > > >  e103    1000000001b.00000006    0x400024        1       write
> > > > > > > > 390     osd11   1.1ac10bad      1.2d    [11,5,2]/11     [11,5,2]/11
> > > > > > > >  e103    10000000028.00000007    0x400024        1       write
> > > > > > > > 364     osd12   1.345417ca      1.4a    [12,7,2]/12     [12,7,2]/12
> > > > > > > >  e103    10000000017.00000006    0x400024        1       write
> > > > > > > > 374     osd12   1.50114f4a      1.4a    [12,7,2]/12     [12,7,2]/12
> > > > > > > >  e103    10000000026.00000007    0x400024        1       write
> > > > > > > > 381     osd12   1.d670203f      1.3f    [12,2,4]/12     [12,2,4]/12
> > > > > > > >  e103    10000000021.00000006    0x400024        1       write
> > > > > > > >
> > > > > > > > (IO stop ...)
> > > > > > > >
> > > > > > > > osdmap epoch 103
> > > > > > > > ###############
> > > > > > > > Tue Aug  6 16:04:39 CST 2019
> > > > > > > > epoch 103 barrier 30 flags 0x188000
> > > > > > > > pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask
> > > > > > > > 127 flags 0x1 lfor 0 read_tier -1 write_tier -1
> > > > > > > > pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31
> > > > > > > > flags 0x1 lfor 0 read_tier -1 write_tier -1
> > > > > > > > osd0    172.16.10.17:6802       100%    (exists, up)    100%
> > > > > > > > osd1    172.16.10.17:6806       100%    (exists, up)    100%
> > > > > > > > osd2    172.16.10.17:6810       100%    (exists, up)    100%
> > > > > > > > osd3    172.16.10.18:6801       100%    (exists, up)    100%
> > > > > > > > osd4    172.16.10.18:6805       100%    (exists, up)    100%
> > > > > > > > osd5    172.16.10.18:6809       100%    (exists, up)    100%
> > > > > > > > osd6    172.16.10.18:6813       100%    (exists, up)    100%
> > > > > > > > osd7    172.16.10.18:6817       100%    (exists, up)    100%
> > > > > > > > osd8    172.16.10.19:6801       100%    (exists, up)    100%
> > > > > > > > osd9    172.16.10.19:6805       100%    (exists, up)    100%
> > > > > > > > osd10   172.16.10.19:6809       100%    (exists, up)    100%
> > > > > > > > osd11   172.16.10.19:6813       100%    (exists, up)    100%
> > > > > > > > osd12   172.16.10.19:6817       100%    (exists, up)    100%
> > > > > > > > REQUESTS 1 homeless 0
> > > > > > > > 204     osd7    1.4dbcd0b2      1.32    [7,5,0]/7       [7,5,0]/7
> > > > > > > >  e103    10000000017.00000001    0x400024        1       write
> > > > > > > >
> > > > > > > >
> > > > > > > > Strangely, the acting set of pg 1.32 shown on ceph is [7,1,9] instead
> > > > > > > > of [7,5,0].
> > > > > > > >
> > > > > > > > [root@Jerry-x85-n2 ceph]# ceph pg dump | grep ^1.32
> > > > > > > > dumped all
> > > > > > > > 1.32          1                  0        0         0       0  4194304
> > > > > > > >   3        3 active+clean 2019-08-06 16:03:33.978990    68'3   103:98
> > > > > > > > [7,1,9]          7  [7,1,9]
> > > > > > > >
> > > > > > > > [root@Jerry-x85-n2 ceph]# grep -rn "replicas change" ceph-osd.7.log | grep 1.32
> > > > > > > > ceph-osd.7.log:1844:2019-08-06 15:59:53.276 7fa390256700 10 osd.7
> > > > > > > > pg_epoch: 66 pg[1.32( empty local-lis/les=64/65 n=0 ec=47/16 lis/c
> > > > > > > > 64/64 les/c/f 65/65/0 66/66/64) [7,5,0] r=0 lpr=66 pi=[64,66)/1
> > > > > > > > crt=0'0 mlcod 0'0 unknown mbc={}] [7,5,2] -> [7,5,0], replicas changed
> > > > > > > > ceph-osd.7.log:15330:2019-08-06 16:02:38.769 7fa390256700 10 osd.7
> > > > > > > > pg_epoch: 84 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=66/67 n=2
> > > > > > > > ec=47/16 lis/c 66/66 les/c/f 67/67/0 66/84/64) [7,5,0] r=0 lpr=84
> > > > > > > > pi=[66,84)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,5,0] ->
> > > > > > > > [7,5,0], replicas changed
> > > > > > > > ceph-osd.7.log:25741:2019-08-06 16:02:53.618 7fa390256700 10 osd.7
> > > > > > > > pg_epoch: 90 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=84/85 n=1
> > > > > > > > ec=47/16 lis/c 84/84 les/c/f 85/85/0 90/90/64) [7,1,8] r=0 lpr=90
> > > > > > > > pi=[84,90)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,5,0] ->
> > > > > > > > [7,1,8], replicas changed
> > > > > > > > ceph-osd.7.log:37917:2019-08-06 16:03:17.932 7fa390256700 10 osd.7
> > > > > > > > pg_epoch: 100 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=90/91 n=1
> > > > > > > > ec=47/16 lis/c 90/90 les/c/f 91/97/0 100/100/64) [7,1,9] r=0 lpr=100
> > > > > > > > pi=[90,100)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,1,8] ->
> > > > > > > > [7,1,9], replicas changed
> > > > > > > >
> > > > > > > > Related logs with debug_osd=10 and debug_ms=1 are provided in the
> > > > > > > > https://drive.google.com/open?id=1gYksDbCecisWtP05HEoSxevDK8sywKv6 .
> > > > > > > > Currently, I am tracing the code to figure out the root cause.  Any
> > > > > > > > ideas or insights will be appreciated, thanks!
> > > > > > >
> > > > > > > Hi Jerry,
> > > > > > >
> > > > > > > So the original request was dropped by osd.7 because the PG got split:
> > > > > > >
> > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 dequeue_op
> > > > > > > 0x561adf90f500 prio 127 cost 4194304 latency 0.015565
> > > > > > > osd_op(client.4418.0:204 1.32 1.4dbcd0b2 (undecoded)
> > > > > > > ondisk+write+known_if_redirected e69) v8 pg pg[1.32( v 68'3 (0'0,68'3]
> > > > > > > local-lis/les=100/101 n=1 ec=47/16 lis/c 100/100 les/c/f 101/102/0
> > > > > > > 100/100/64) [7,1,9] r=0 lpr=100 crt=68'3 lcod 68'2 mlcod 0'0
> > > > > > > active+clean]
> > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 client.4418 has old
> > > > > > > map 69 < 103
> > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 send_incremental_map
> > > > > > > 69 -> 103 to 0x561ae4cfb600 172.16.10.7:0/3068540720
> > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 80 220 bytes
> > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 72 240 bytes
> > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 71 240 bytes
> > > > > > > 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 70 240 bytes
> > > > > > > 2019-08-06 16:04:25.747 7fa390256700  7 osd.7 pg_epoch: 103 pg[1.32( v
> > > > > > > 68'3 (0'0,68'3] local-lis/les=100/101 n=1 ec=47/16 lis/c 100/100
> > > > > > > les/c/f 101/102/0 100/100/64) [7,1,9] r=0 lpr=100 crt=68'3 lcod 68'2
> > > > > > > mlcod 0'0 active+clean] can_discard_op pg split in 84, dropping
> > > > > > > 2019-08-06 16:04:25.748 7fa390256700 10 osd.7 103 dequeue_op
> > > > > > > 0x561adf90f500 finish
> > > > > > >
> > > > > > > The request was never resent -- this appears to be a kernel client bug.
> > > > > > >
> > > > > > > It would be great if you could reproduce with
> > > > > > >
> > > > > > > echo 'file osd_client +p' > /sys/kernel/debug/dynamic_debug/control
> > > > > > >
> > > > > > > on the kernel client node and the same set of logs on the OSD side.
> > > > > > >
> > > > > > > Looking...
> > > > > >
> > > > > > Nothing so far.  The split detection logic appears to work.
> > > > > > There were 11 other requests dropped by the OSDs due to splits and they
> > > > > > were all handled correctly:
> > > > > >
> > > > > > Split in epoch 47:
> > > > > >
> > > > > > orig  osd2 osd_op(client.4418.0:5 1.f 1.b1e464af (undecoded)
> > > > > > ondisk+write+known_if_redirected e35)
> > > > > > retry osd0 osd_op(client.4418.0:5 1.2f 1.b1e464af (undecoded)
> > > > > > ondisk+retry+write+known_if_redirected e68)
> > > > > >
> > > > > > orig  osd0 osd_op(client.4418.0:13 1.10 1.fec32590 (undecoded)
> > > > > > ondisk+write+known_if_redirected e35)
> > > > > > retry osd0 osd_op(client.4418.0:13 1.10 1.fec32590 (undecoded)
> > > > > > ondisk+retry+write+known_if_redirected e68)
> > > > > >
> > > > > > Split in epoch 84:
> > > > > >
> > > > > > orig  osd5 osd_op(client.4418.0:194 1.20 1.260bc0e0 (undecoded)
> > > > > > ondisk+write+known_if_redirected e69)
> > > > > > retry osd0 osd_op(client.4418.0:194 1.60 1.260bc0e0 (undecoded)
> > > > > > ondisk+retry+write+known_if_redirected e103)
> > > > > >
> > > > > > orig  osd3 osd_op(client.4418.0:195 1.33 1.9b583df3 (undecoded)
> > > > > > ondisk+write+known_if_redirected e69)
> > > > > > retry osd1 osd_op(client.4418.0:195 1.73 1.9b583df3 (undecoded)
> > > > > > ondisk+retry+write+known_if_redirected e103)
> > > > > >
> > > > > > orig  osd1 osd_op(client.4418.0:197 1.1a 1.a986dc1a (undecoded)
> > > > > > ondisk+write+known_if_redirected e69)
> > > > > > retry osd1 osd_op(client.4418.0:197 1.1a 1.a986dc1a (undecoded)
> > > > > > ondisk+retry+write+known_if_redirected e103)
> > > > > >
> > > > > > orig  osd2 osd_op(client.4418.0:198 1.14 1.b93e34d4 (undecoded)
> > > > > > ondisk+write+known_if_redirected e69)
> > > > > > retry osd4 osd_op(client.4418.0:198 1.54 1.b93e34d4 (undecoded)
> > > > > > ondisk+retry+write+known_if_redirected e103)
> > > > > >
> > > > > > orig  osd1 osd_op(client.4418.0:199 1.1a 1.92ca525a (undecoded)
> > > > > > ondisk+write+known_if_redirected e69)
> > > > > > retry osd11 osd_op(client.4418.0:199 1.5a 1.92ca525a (undecoded)
> > > > > > ondisk+retry+write+known_if_redirected e103)
> > > > > >
> > > > > > orig  osd0 osd_op(client.4418.0:202 1.3b 1.6388bdbb (undecoded)
> > > > > > ondisk+write+known_if_redirected e69)
> > > > > > retry osd0 osd_op(client.4418.0:202 1.3b 1.6388bdbb (undecoded)
> > > > > > ondisk+retry+write+known_if_redirected e103)
> > > > > >
> > > > > > orig  osd3 osd_op(client.4418.0:203 1.1e 1.fe5731e (undecoded)
> > > > > > ondisk+write+known_if_redirected e69)
> > > > > > retry osd3 osd_op(client.4418.0:203 1.1e 1.fe5731e (undecoded)
> > > > > > ondisk+retry+write+known_if_redirected e103)
> > > > > >
> > > > > > orig  osd7 osd_op(client.4418.0:204 1.32 1.4dbcd0b2 (undecoded)
> > > > > > ondisk+write+known_if_redirected e69)
> > > > > > retry ???
> > > > > >
> > > > > > orig  osd0 osd_op(client.4418.0:205 1.10 1.42c29750 (undecoded)
> > > > > > ondisk+write+known_if_redirected e69)
> > > > > > retry osd0 osd_op(client.4418.0:205 1.50 1.42c29750 (undecoded)
> > > > > > ondisk+retry+write+known_if_redirected e103)
> > > > > >
> > > > > > orig  osd3 osd_op(client.4418.0:207 1.33 1.918809b3 (undecoded)
> > > > > > ondisk+write+known_if_redirected e69)
> > > > > > retry osd3 osd_op(client.4418.0:207 1.33 1.918809b3 (undecoded)
> > > > > > ondisk+retry+write+known_if_redirected e103)
> > > > > >
> > > > > > This shows that both staying in the old PG and moving to the new PG
> > > > > > cases are handled correctly, as staying on the same OSD and moving to
> > > > > > the new OSD.  The request in question (tid 204) is on the right list
> > > > > > as indicated by e103 in osdc output, but it looks like something went
> > > > > > wrong inside calc_target().
> > > > > >
> > > > > > What is the exact kernel version you are running?  I know you said
> > > > > > 4.14, but you also mentioned some customizations...
> > > > >
> > > > > Hi Ilya,
> > > > >
> > > > > Thanks for your help on analysing the issue.  New logs are provided in
> > > > > https://drive.google.com/open?id=1dJ1-eGClDWf18yPrIQMRtLoLJ5xsN3jA .
> > > > > I'm also tracing the calc_target() and it seems that for those stuck
> > > > > requests, the calc_target() decides that there is no action
> > > > > (CALC_TARGET_NO_ACTION) for them.
> > > > >
> > > > > REQUESTS 2 homeless 0
> > > > > 199     osd0    1.6388bdbb      1.3b    [0,1,7]/0       [0,1,7]/0
> > > > >  e106    10000000016.00000002    0x400024        1       write
> > > > > 202     osd0    1.42c29750      1.50    [0,2,5]/0       [0,2,5]/0
> > > > >  e106    10000000017.00000002    0x400024        1       write
> > > > >
> > > > > <7>[11617.144891] libceph:  applying incremental map 106 len 212
> > > > > <7>[11617.144893] libceph:  scan_requests req ffff8808260614a0 tid 199
> > > > > <7>[11617.144898] libceph:  calc_target t ffff8808260614e0 -> ct_res 0 osd 0
> > > > > ...
> > > > > <7>[11646.137530] libceph:   req ffff8808260614a0 tid 199 on osd0 is laggy
> > > >
> > > > Right, that is what I expected.  Looking...
> > > >
> > > > >
> > > > > The exact kernel version is liunx-4.14.24 and most of the
> > > > > customizations are made in the block layer, file system, and scsi
> > > > > driver.   However, some patches related to net/ceph and fs/ceph files
> > > > > in linux-4.14.y upstream are backported, so I'm also checking whether
> > > > > those backports cause this issue or not.
> > > >
> > > > Can you pastebin or attach your osdmap.c and osd_client.c files?
> > >
> > > Hi Ilya,
> > >
> > > - osdmap https://pastebin.com/Aj5h0eLC
> > > - osd_client https://pastebin.com/FZNepUVe
> >
> > I think I see the problem.
> >
> > 1) creating the OSD session doesn't open the TCP socket, the messenger
> >    opens it in try_write()
> >
> > 2) con->peer_features remains 0 until the TCP socket is opened and
> >    either CEPH_MSGR_TAG_SEQ or CEPH_MSGR_TAG_READY is received
> >
> > 3) calc_target() uses con->peer_features to check for RESEND_ON_SPLIT
> >    and resends only if it's set
> >
> > What happened is there was no osd0 session and tid 199 had to
> > create it before submitting itself to the messenger.  However before
> > the messenger got to tid 199, some other request with pre-split epoch
> > had reached some other OSD and triggered osdmap incrementals.  While
> > processing the split incremental, calc_target() for tid 199 returned
> > NO_ACTION:
> >
> >   if (unpaused || legacy_change || force_resend ||
> >       (split && con && CEPH_HAVE_FEATURE(con->peer_features,
> >                                          RESEND_ON_SPLIT)))
> >           ct_res = CALC_TARGET_NEED_RESEND;
> >   else
> >           ct_res = CALC_TARGET_NO_ACTION;
> >
> > This isn't logged, but I'm pretty sure that split was 1, con wasn't
> > NULL and con->peer_features was 0.

Hi Ilya,

Yes, you're right!  I just add more logs to print out split and
con->peer_features and it's exactly what you think:

<7>[11347.891038] libceph:  scan_requests req ffff880817fc5300 tid 215
<7>[11347.891043] libceph:  calc_target t ffff880817fc5340 -> ct_res 0
osd 2 split 1 peer_features 0

> >
> > I would have noticed this earlier, but I was mislead by osdc
> > output.  It shows outdated up/acting sets and that lead me to assume
> > that split was 0 (because if split is 1, up/acting sets are updated
> > _before_ con->peer_features check).  However in this case up/acting
> > sets remained the same for a few epochs after the split, so they are
> > not actually outdated wrt the split.
> >
> > Checking con->peer_features in calc_target() is fundamentally racy...
>
> Objecter had the same bug, it was fixed by looking at per-OSD features
> in osd_xinfo array instead of con->peer_features:
>
>   https://github.com/ceph/ceph/pull/23850
>
> I'll see if we can do the same thing here.

Again, thanks for your patience and kindly help on this issue.  And,
I'm willing to re-verify it if a fix comes out :)

Thanks,

- Jerry

>
> Thanks,
>
>                 Ilya