Re: cephfs kernel client umount stucks forever

Ilya Dryomov <idryomov@xxxxxxxxx> · Wed, 7 Aug 2019 10:15:17 +0200

On Tue, Aug 6, 2019 at 1:11 PM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
>
> Hi,
>
> After simplifying the environment and the issue can be re-produced
> much easier.  The IP and infrastructure are simply described as below:
>
> CephFS Kernel Client (172.16.10.7, linux-4.14)
> Node1 (172.16.10.17)
>   - OSD0~2
> Node2 (172.16.10.18)
>   - OSD3~7
> Node3 (172.16.10.19)
>   - OSD8~12
>
> The client mounts the CephFS right after the Node1 is setup.  And
> every time a new Node joins the cluster, some files will be added to
> the CephFS.  The issue always happens after the Nodes3 joins the
> cluster and the stuck write op seems to be sent to the old pg acting
> set although the epoch of osdmap on the client is already the same to
> the one on ceph.
>
> osdmap epoch 69
> ##############
> Tue Aug  6 16:04:25 CST 2019
> epoch 69 barrier 30 flags 0x188000
> pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 64 pg_num_mask 63
> flags 0x1 lfor 0 read_tier -1 write_tier -1
> pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31
> flags 0x1 lfor 0 read_tier -1 write_tier -1
> osd0    172.16.10.17:6802       100%    (exists, up)    100%
> osd1    172.16.10.17:6806       100%    (exists, up)    100%
> osd2    172.16.10.17:6810       100%    (exists, up)    100%
> osd3    172.16.10.18:6801       100%    (exists, up)    100%
> osd4    172.16.10.18:6805       100%    (exists, up)    100%
> osd5    172.16.10.18:6809       100%    (exists, up)    100%
> osd6    172.16.10.18:6813       100%    (exists, up)    100%
> osd7    172.16.10.18:6817       100%    (exists, up)    100%
> osd8    (unknown sockaddr family 0)       0%    (exists)        100%
>
> (IO on Cephfs ...)
>
> osdmap epoch 103
> ###############
> Tue Aug  6 16:04:26 CST 2019
> epoch 103 barrier 30 flags 0x188000
> pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask
> 127 flags 0x1 lfor 0 read_tier -1 write_tier -1
> pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31
> flags 0x1 lfor 0 read_tier -1 write_tier -1
> osd0    172.16.10.17:6802       100%    (exists, up)    100%
> osd1    172.16.10.17:6806       100%    (exists, up)    100%
> osd2    172.16.10.17:6810       100%    (exists, up)    100%
> osd3    172.16.10.18:6801       100%    (exists, up)    100%
> osd4    172.16.10.18:6805       100%    (exists, up)    100%
> osd5    172.16.10.18:6809       100%    (exists, up)    100%
> osd6    172.16.10.18:6813       100%    (exists, up)    100%
> osd7    172.16.10.18:6817       100%    (exists, up)    100%
> osd8    172.16.10.19:6801       100%    (exists, up)    100%
> osd9    172.16.10.19:6805       100%    (exists, up)    100%
> osd10   172.16.10.19:6809       100%    (exists, up)    100%
> osd11   172.16.10.19:6813       100%    (exists, up)    100%
> osd12   172.16.10.19:6817       100%    (exists, up)    100%
>
> (IO on Cephfs ...)
>
> osdmap epoch 103
> ###############
> Tue Aug  6 16:04:38 CST 2019
> epoch 103 barrier 30 flags 0x188000
> pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask
> 127 flags 0x1 lfor 0 read_tier -1 write_tier -1
> pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31
> flags 0x1 lfor 0 read_tier -1 write_tier -1
> osd0    172.16.10.17:6802       100%    (exists, up)    100%
> osd1    172.16.10.17:6806       100%    (exists, up)    100%
> osd2    172.16.10.17:6810       100%    (exists, up)    100%
> osd3    172.16.10.18:6801       100%    (exists, up)    100%
> osd4    172.16.10.18:6805       100%    (exists, up)    100%
> osd5    172.16.10.18:6809       100%    (exists, up)    100%
> osd6    172.16.10.18:6813       100%    (exists, up)    100%
> osd7    172.16.10.18:6817       100%    (exists, up)    100%
> osd8    172.16.10.19:6801       100%    (exists, up)    100%
> osd9    172.16.10.19:6805       100%    (exists, up)    100%
> osd10   172.16.10.19:6809       100%    (exists, up)    100%
> osd11   172.16.10.19:6813       100%    (exists, up)    100%
> osd12   172.16.10.19:6817       100%    (exists, up)    100%
> REQUESTS 13 homeless 0
> 389     osd2    1.23964a4b      1.4b    [2,4,12]/2      [2,4,12]/2
>  e103    10000000028.00000006    0x400024        1       write
> 365     osd5    1.cde1721f      1.1f    [5,10,2]/5      [5,10,2]/5
>  e103    10000000017.00000007    0x400024        1       write
> 371     osd5    1.9d081620      1.20    [5,12,2]/5      [5,12,2]/5
>  e103    10000000025.00000007    0x400024        1       write
> 375     osd5    1.8b5def1f      1.1f    [5,10,2]/5      [5,10,2]/5
>  e103    1000000001f.00000006    0x400024        1       write
> 204     osd7    1.4dbcd0b2      1.32    [7,5,0]/7       [7,5,0]/7
>  e103    10000000017.00000001    0x400024        1       write
> 373     osd7    1.8f57faf5      1.75    [7,11,2]/7      [7,11,2]/7
>  e103    10000000027.00000007    0x400024        1       write
> 369     osd8    1.cec2d5dd      1.5d    [8,2,7]/8       [8,2,7]/8
>  e103    10000000020.00000007    0x400024        1       write
> 378     osd8    1.3853fefc      1.7c    [8,3,2]/8       [8,3,2]/8
>  e103    1000000001c.00000006    0x400024        1       write
> 384     osd8    1.342be187      1.7     [8,6,2]/8       [8,6,2]/8
>  e103    1000000001b.00000006    0x400024        1       write
> 390     osd11   1.1ac10bad      1.2d    [11,5,2]/11     [11,5,2]/11
>  e103    10000000028.00000007    0x400024        1       write
> 364     osd12   1.345417ca      1.4a    [12,7,2]/12     [12,7,2]/12
>  e103    10000000017.00000006    0x400024        1       write
> 374     osd12   1.50114f4a      1.4a    [12,7,2]/12     [12,7,2]/12
>  e103    10000000026.00000007    0x400024        1       write
> 381     osd12   1.d670203f      1.3f    [12,2,4]/12     [12,2,4]/12
>  e103    10000000021.00000006    0x400024        1       write
>
> (IO stop ...)
>
> osdmap epoch 103
> ###############
> Tue Aug  6 16:04:39 CST 2019
> epoch 103 barrier 30 flags 0x188000
> pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask
> 127 flags 0x1 lfor 0 read_tier -1 write_tier -1
> pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31
> flags 0x1 lfor 0 read_tier -1 write_tier -1
> osd0    172.16.10.17:6802       100%    (exists, up)    100%
> osd1    172.16.10.17:6806       100%    (exists, up)    100%
> osd2    172.16.10.17:6810       100%    (exists, up)    100%
> osd3    172.16.10.18:6801       100%    (exists, up)    100%
> osd4    172.16.10.18:6805       100%    (exists, up)    100%
> osd5    172.16.10.18:6809       100%    (exists, up)    100%
> osd6    172.16.10.18:6813       100%    (exists, up)    100%
> osd7    172.16.10.18:6817       100%    (exists, up)    100%
> osd8    172.16.10.19:6801       100%    (exists, up)    100%
> osd9    172.16.10.19:6805       100%    (exists, up)    100%
> osd10   172.16.10.19:6809       100%    (exists, up)    100%
> osd11   172.16.10.19:6813       100%    (exists, up)    100%
> osd12   172.16.10.19:6817       100%    (exists, up)    100%
> REQUESTS 1 homeless 0
> 204     osd7    1.4dbcd0b2      1.32    [7,5,0]/7       [7,5,0]/7
>  e103    10000000017.00000001    0x400024        1       write
>
>
> Strangely, the acting set of pg 1.32 shown on ceph is [7,1,9] instead
> of [7,5,0].
>
> [root@Jerry-x85-n2 ceph]# ceph pg dump | grep ^1.32
> dumped all
> 1.32          1                  0        0         0       0  4194304
>   3        3 active+clean 2019-08-06 16:03:33.978990    68'3   103:98
> [7,1,9]          7  [7,1,9]
>
> [root@Jerry-x85-n2 ceph]# grep -rn "replicas change" ceph-osd.7.log | grep 1.32
> ceph-osd.7.log:1844:2019-08-06 15:59:53.276 7fa390256700 10 osd.7
> pg_epoch: 66 pg[1.32( empty local-lis/les=64/65 n=0 ec=47/16 lis/c
> 64/64 les/c/f 65/65/0 66/66/64) [7,5,0] r=0 lpr=66 pi=[64,66)/1
> crt=0'0 mlcod 0'0 unknown mbc={}] [7,5,2] -> [7,5,0], replicas changed
> ceph-osd.7.log:15330:2019-08-06 16:02:38.769 7fa390256700 10 osd.7
> pg_epoch: 84 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=66/67 n=2
> ec=47/16 lis/c 66/66 les/c/f 67/67/0 66/84/64) [7,5,0] r=0 lpr=84
> pi=[66,84)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,5,0] ->
> [7,5,0], replicas changed
> ceph-osd.7.log:25741:2019-08-06 16:02:53.618 7fa390256700 10 osd.7
> pg_epoch: 90 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=84/85 n=1
> ec=47/16 lis/c 84/84 les/c/f 85/85/0 90/90/64) [7,1,8] r=0 lpr=90
> pi=[84,90)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,5,0] ->
> [7,1,8], replicas changed
> ceph-osd.7.log:37917:2019-08-06 16:03:17.932 7fa390256700 10 osd.7
> pg_epoch: 100 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=90/91 n=1
> ec=47/16 lis/c 90/90 les/c/f 91/97/0 100/100/64) [7,1,9] r=0 lpr=100
> pi=[90,100)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,1,8] ->
> [7,1,9], replicas changed
>
> Related logs with debug_osd=10 and debug_ms=1 are provided in the
> https://drive.google.com/open?id=1gYksDbCecisWtP05HEoSxevDK8sywKv6 .
> Currently, I am tracing the code to figure out the root cause.  Any
> ideas or insights will be appreciated, thanks!

Hi Jerry,

So the original request was dropped by osd.7 because the PG got split:

2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 dequeue_op
0x561adf90f500 prio 127 cost 4194304 latency 0.015565
osd_op(client.4418.0:204 1.32 1.4dbcd0b2 (undecoded)
ondisk+write+known_if_redirected e69) v8 pg pg[1.32( v 68'3 (0'0,68'3]
local-lis/les=100/101 n=1 ec=47/16 lis/c 100/100 les/c/f 101/102/0
100/100/64) [7,1,9] r=0 lpr=100 crt=68'3 lcod 68'2 mlcod 0'0
active+clean]
2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 client.4418 has old
map 69 < 103
2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 send_incremental_map
69 -> 103 to 0x561ae4cfb600 172.16.10.7:0/3068540720
2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 80 220 bytes
2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 72 240 bytes
2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 71 240 bytes
2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 70 240 bytes
2019-08-06 16:04:25.747 7fa390256700  7 osd.7 pg_epoch: 103 pg[1.32( v
68'3 (0'0,68'3] local-lis/les=100/101 n=1 ec=47/16 lis/c 100/100
les/c/f 101/102/0 100/100/64) [7,1,9] r=0 lpr=100 crt=68'3 lcod 68'2
mlcod 0'0 active+clean] can_discard_op pg split in 84, dropping
2019-08-06 16:04:25.748 7fa390256700 10 osd.7 103 dequeue_op
0x561adf90f500 finish

The request was never resent -- this appears to be a kernel client bug.

It would be great if you could reproduce with

echo 'file osd_client +p' > /sys/kernel/debug/dynamic_debug/control

on the kernel client node and the same set of logs on the OSD side.

Looking...

Thanks,

                Ilya