Re: cephfs kernel client umount stucks forever

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Aug 7, 2019 at 10:15 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>
> On Tue, Aug 6, 2019 at 1:11 PM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
> >
> > Hi,
> >
> > After simplifying the environment and the issue can be re-produced
> > much easier.  The IP and infrastructure are simply described as below:
> >
> > CephFS Kernel Client (172.16.10.7, linux-4.14)
> > Node1 (172.16.10.17)
> >   - OSD0~2
> > Node2 (172.16.10.18)
> >   - OSD3~7
> > Node3 (172.16.10.19)
> >   - OSD8~12
> >
> > The client mounts the CephFS right after the Node1 is setup.  And
> > every time a new Node joins the cluster, some files will be added to
> > the CephFS.  The issue always happens after the Nodes3 joins the
> > cluster and the stuck write op seems to be sent to the old pg acting
> > set although the epoch of osdmap on the client is already the same to
> > the one on ceph.
> >
> > osdmap epoch 69
> > ##############
> > Tue Aug  6 16:04:25 CST 2019
> > epoch 69 barrier 30 flags 0x188000
> > pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 64 pg_num_mask 63
> > flags 0x1 lfor 0 read_tier -1 write_tier -1
> > pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31
> > flags 0x1 lfor 0 read_tier -1 write_tier -1
> > osd0    172.16.10.17:6802       100%    (exists, up)    100%
> > osd1    172.16.10.17:6806       100%    (exists, up)    100%
> > osd2    172.16.10.17:6810       100%    (exists, up)    100%
> > osd3    172.16.10.18:6801       100%    (exists, up)    100%
> > osd4    172.16.10.18:6805       100%    (exists, up)    100%
> > osd5    172.16.10.18:6809       100%    (exists, up)    100%
> > osd6    172.16.10.18:6813       100%    (exists, up)    100%
> > osd7    172.16.10.18:6817       100%    (exists, up)    100%
> > osd8    (unknown sockaddr family 0)       0%    (exists)        100%
> >
> > (IO on Cephfs ...)
> >
> > osdmap epoch 103
> > ###############
> > Tue Aug  6 16:04:26 CST 2019
> > epoch 103 barrier 30 flags 0x188000
> > pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask
> > 127 flags 0x1 lfor 0 read_tier -1 write_tier -1
> > pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31
> > flags 0x1 lfor 0 read_tier -1 write_tier -1
> > osd0    172.16.10.17:6802       100%    (exists, up)    100%
> > osd1    172.16.10.17:6806       100%    (exists, up)    100%
> > osd2    172.16.10.17:6810       100%    (exists, up)    100%
> > osd3    172.16.10.18:6801       100%    (exists, up)    100%
> > osd4    172.16.10.18:6805       100%    (exists, up)    100%
> > osd5    172.16.10.18:6809       100%    (exists, up)    100%
> > osd6    172.16.10.18:6813       100%    (exists, up)    100%
> > osd7    172.16.10.18:6817       100%    (exists, up)    100%
> > osd8    172.16.10.19:6801       100%    (exists, up)    100%
> > osd9    172.16.10.19:6805       100%    (exists, up)    100%
> > osd10   172.16.10.19:6809       100%    (exists, up)    100%
> > osd11   172.16.10.19:6813       100%    (exists, up)    100%
> > osd12   172.16.10.19:6817       100%    (exists, up)    100%
> >
> > (IO on Cephfs ...)
> >
> > osdmap epoch 103
> > ###############
> > Tue Aug  6 16:04:38 CST 2019
> > epoch 103 barrier 30 flags 0x188000
> > pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask
> > 127 flags 0x1 lfor 0 read_tier -1 write_tier -1
> > pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31
> > flags 0x1 lfor 0 read_tier -1 write_tier -1
> > osd0    172.16.10.17:6802       100%    (exists, up)    100%
> > osd1    172.16.10.17:6806       100%    (exists, up)    100%
> > osd2    172.16.10.17:6810       100%    (exists, up)    100%
> > osd3    172.16.10.18:6801       100%    (exists, up)    100%
> > osd4    172.16.10.18:6805       100%    (exists, up)    100%
> > osd5    172.16.10.18:6809       100%    (exists, up)    100%
> > osd6    172.16.10.18:6813       100%    (exists, up)    100%
> > osd7    172.16.10.18:6817       100%    (exists, up)    100%
> > osd8    172.16.10.19:6801       100%    (exists, up)    100%
> > osd9    172.16.10.19:6805       100%    (exists, up)    100%
> > osd10   172.16.10.19:6809       100%    (exists, up)    100%
> > osd11   172.16.10.19:6813       100%    (exists, up)    100%
> > osd12   172.16.10.19:6817       100%    (exists, up)    100%
> > REQUESTS 13 homeless 0
> > 389     osd2    1.23964a4b      1.4b    [2,4,12]/2      [2,4,12]/2
> >  e103    10000000028.00000006    0x400024        1       write
> > 365     osd5    1.cde1721f      1.1f    [5,10,2]/5      [5,10,2]/5
> >  e103    10000000017.00000007    0x400024        1       write
> > 371     osd5    1.9d081620      1.20    [5,12,2]/5      [5,12,2]/5
> >  e103    10000000025.00000007    0x400024        1       write
> > 375     osd5    1.8b5def1f      1.1f    [5,10,2]/5      [5,10,2]/5
> >  e103    1000000001f.00000006    0x400024        1       write
> > 204     osd7    1.4dbcd0b2      1.32    [7,5,0]/7       [7,5,0]/7
> >  e103    10000000017.00000001    0x400024        1       write
> > 373     osd7    1.8f57faf5      1.75    [7,11,2]/7      [7,11,2]/7
> >  e103    10000000027.00000007    0x400024        1       write
> > 369     osd8    1.cec2d5dd      1.5d    [8,2,7]/8       [8,2,7]/8
> >  e103    10000000020.00000007    0x400024        1       write
> > 378     osd8    1.3853fefc      1.7c    [8,3,2]/8       [8,3,2]/8
> >  e103    1000000001c.00000006    0x400024        1       write
> > 384     osd8    1.342be187      1.7     [8,6,2]/8       [8,6,2]/8
> >  e103    1000000001b.00000006    0x400024        1       write
> > 390     osd11   1.1ac10bad      1.2d    [11,5,2]/11     [11,5,2]/11
> >  e103    10000000028.00000007    0x400024        1       write
> > 364     osd12   1.345417ca      1.4a    [12,7,2]/12     [12,7,2]/12
> >  e103    10000000017.00000006    0x400024        1       write
> > 374     osd12   1.50114f4a      1.4a    [12,7,2]/12     [12,7,2]/12
> >  e103    10000000026.00000007    0x400024        1       write
> > 381     osd12   1.d670203f      1.3f    [12,2,4]/12     [12,2,4]/12
> >  e103    10000000021.00000006    0x400024        1       write
> >
> > (IO stop ...)
> >
> > osdmap epoch 103
> > ###############
> > Tue Aug  6 16:04:39 CST 2019
> > epoch 103 barrier 30 flags 0x188000
> > pool 1 'cephfs_data' type 1 size 3 min_size 2 pg_num 128 pg_num_mask
> > 127 flags 0x1 lfor 0 read_tier -1 write_tier -1
> > pool 2 'cephfs_md' type 1 size 3 min_size 2 pg_num 32 pg_num_mask 31
> > flags 0x1 lfor 0 read_tier -1 write_tier -1
> > osd0    172.16.10.17:6802       100%    (exists, up)    100%
> > osd1    172.16.10.17:6806       100%    (exists, up)    100%
> > osd2    172.16.10.17:6810       100%    (exists, up)    100%
> > osd3    172.16.10.18:6801       100%    (exists, up)    100%
> > osd4    172.16.10.18:6805       100%    (exists, up)    100%
> > osd5    172.16.10.18:6809       100%    (exists, up)    100%
> > osd6    172.16.10.18:6813       100%    (exists, up)    100%
> > osd7    172.16.10.18:6817       100%    (exists, up)    100%
> > osd8    172.16.10.19:6801       100%    (exists, up)    100%
> > osd9    172.16.10.19:6805       100%    (exists, up)    100%
> > osd10   172.16.10.19:6809       100%    (exists, up)    100%
> > osd11   172.16.10.19:6813       100%    (exists, up)    100%
> > osd12   172.16.10.19:6817       100%    (exists, up)    100%
> > REQUESTS 1 homeless 0
> > 204     osd7    1.4dbcd0b2      1.32    [7,5,0]/7       [7,5,0]/7
> >  e103    10000000017.00000001    0x400024        1       write
> >
> >
> > Strangely, the acting set of pg 1.32 shown on ceph is [7,1,9] instead
> > of [7,5,0].
> >
> > [root@Jerry-x85-n2 ceph]# ceph pg dump | grep ^1.32
> > dumped all
> > 1.32          1                  0        0         0       0  4194304
> >   3        3 active+clean 2019-08-06 16:03:33.978990    68'3   103:98
> > [7,1,9]          7  [7,1,9]
> >
> > [root@Jerry-x85-n2 ceph]# grep -rn "replicas change" ceph-osd.7.log | grep 1.32
> > ceph-osd.7.log:1844:2019-08-06 15:59:53.276 7fa390256700 10 osd.7
> > pg_epoch: 66 pg[1.32( empty local-lis/les=64/65 n=0 ec=47/16 lis/c
> > 64/64 les/c/f 65/65/0 66/66/64) [7,5,0] r=0 lpr=66 pi=[64,66)/1
> > crt=0'0 mlcod 0'0 unknown mbc={}] [7,5,2] -> [7,5,0], replicas changed
> > ceph-osd.7.log:15330:2019-08-06 16:02:38.769 7fa390256700 10 osd.7
> > pg_epoch: 84 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=66/67 n=2
> > ec=47/16 lis/c 66/66 les/c/f 67/67/0 66/84/64) [7,5,0] r=0 lpr=84
> > pi=[66,84)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,5,0] ->
> > [7,5,0], replicas changed
> > ceph-osd.7.log:25741:2019-08-06 16:02:53.618 7fa390256700 10 osd.7
> > pg_epoch: 90 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=84/85 n=1
> > ec=47/16 lis/c 84/84 les/c/f 85/85/0 90/90/64) [7,1,8] r=0 lpr=90
> > pi=[84,90)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,5,0] ->
> > [7,1,8], replicas changed
> > ceph-osd.7.log:37917:2019-08-06 16:03:17.932 7fa390256700 10 osd.7
> > pg_epoch: 100 pg[1.32( v 68'3 (0'0,68'3] local-lis/les=90/91 n=1
> > ec=47/16 lis/c 90/90 les/c/f 91/97/0 100/100/64) [7,1,9] r=0 lpr=100
> > pi=[90,100)/1 crt=68'3 lcod 68'2 mlcod 0'0 unknown mbc={}] [7,1,8] ->
> > [7,1,9], replicas changed
> >
> > Related logs with debug_osd=10 and debug_ms=1 are provided in the
> > https://drive.google.com/open?id=1gYksDbCecisWtP05HEoSxevDK8sywKv6 .
> > Currently, I am tracing the code to figure out the root cause.  Any
> > ideas or insights will be appreciated, thanks!
>
> Hi Jerry,
>
> So the original request was dropped by osd.7 because the PG got split:
>
> 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 dequeue_op
> 0x561adf90f500 prio 127 cost 4194304 latency 0.015565
> osd_op(client.4418.0:204 1.32 1.4dbcd0b2 (undecoded)
> ondisk+write+known_if_redirected e69) v8 pg pg[1.32( v 68'3 (0'0,68'3]
> local-lis/les=100/101 n=1 ec=47/16 lis/c 100/100 les/c/f 101/102/0
> 100/100/64) [7,1,9] r=0 lpr=100 crt=68'3 lcod 68'2 mlcod 0'0
> active+clean]
> 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 client.4418 has old
> map 69 < 103
> 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 send_incremental_map
> 69 -> 103 to 0x561ae4cfb600 172.16.10.7:0/3068540720
> 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 80 220 bytes
> 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 72 240 bytes
> 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 71 240 bytes
> 2019-08-06 16:04:25.747 7fa390256700 10 osd.7 103 add_map_inc_bl 70 240 bytes
> 2019-08-06 16:04:25.747 7fa390256700  7 osd.7 pg_epoch: 103 pg[1.32( v
> 68'3 (0'0,68'3] local-lis/les=100/101 n=1 ec=47/16 lis/c 100/100
> les/c/f 101/102/0 100/100/64) [7,1,9] r=0 lpr=100 crt=68'3 lcod 68'2
> mlcod 0'0 active+clean] can_discard_op pg split in 84, dropping
> 2019-08-06 16:04:25.748 7fa390256700 10 osd.7 103 dequeue_op
> 0x561adf90f500 finish
>
> The request was never resent -- this appears to be a kernel client bug.
>
> It would be great if you could reproduce with
>
> echo 'file osd_client +p' > /sys/kernel/debug/dynamic_debug/control
>
> on the kernel client node and the same set of logs on the OSD side.
>
> Looking...

Nothing so far.  The split detection logic appears to work.
There were 11 other requests dropped by the OSDs due to splits and they
were all handled correctly:

Split in epoch 47:

orig  osd2 osd_op(client.4418.0:5 1.f 1.b1e464af (undecoded)
ondisk+write+known_if_redirected e35)
retry osd0 osd_op(client.4418.0:5 1.2f 1.b1e464af (undecoded)
ondisk+retry+write+known_if_redirected e68)

orig  osd0 osd_op(client.4418.0:13 1.10 1.fec32590 (undecoded)
ondisk+write+known_if_redirected e35)
retry osd0 osd_op(client.4418.0:13 1.10 1.fec32590 (undecoded)
ondisk+retry+write+known_if_redirected e68)

Split in epoch 84:

orig  osd5 osd_op(client.4418.0:194 1.20 1.260bc0e0 (undecoded)
ondisk+write+known_if_redirected e69)
retry osd0 osd_op(client.4418.0:194 1.60 1.260bc0e0 (undecoded)
ondisk+retry+write+known_if_redirected e103)

orig  osd3 osd_op(client.4418.0:195 1.33 1.9b583df3 (undecoded)
ondisk+write+known_if_redirected e69)
retry osd1 osd_op(client.4418.0:195 1.73 1.9b583df3 (undecoded)
ondisk+retry+write+known_if_redirected e103)

orig  osd1 osd_op(client.4418.0:197 1.1a 1.a986dc1a (undecoded)
ondisk+write+known_if_redirected e69)
retry osd1 osd_op(client.4418.0:197 1.1a 1.a986dc1a (undecoded)
ondisk+retry+write+known_if_redirected e103)

orig  osd2 osd_op(client.4418.0:198 1.14 1.b93e34d4 (undecoded)
ondisk+write+known_if_redirected e69)
retry osd4 osd_op(client.4418.0:198 1.54 1.b93e34d4 (undecoded)
ondisk+retry+write+known_if_redirected e103)

orig  osd1 osd_op(client.4418.0:199 1.1a 1.92ca525a (undecoded)
ondisk+write+known_if_redirected e69)
retry osd11 osd_op(client.4418.0:199 1.5a 1.92ca525a (undecoded)
ondisk+retry+write+known_if_redirected e103)

orig  osd0 osd_op(client.4418.0:202 1.3b 1.6388bdbb (undecoded)
ondisk+write+known_if_redirected e69)
retry osd0 osd_op(client.4418.0:202 1.3b 1.6388bdbb (undecoded)
ondisk+retry+write+known_if_redirected e103)

orig  osd3 osd_op(client.4418.0:203 1.1e 1.fe5731e (undecoded)
ondisk+write+known_if_redirected e69)
retry osd3 osd_op(client.4418.0:203 1.1e 1.fe5731e (undecoded)
ondisk+retry+write+known_if_redirected e103)

orig  osd7 osd_op(client.4418.0:204 1.32 1.4dbcd0b2 (undecoded)
ondisk+write+known_if_redirected e69)
retry ???

orig  osd0 osd_op(client.4418.0:205 1.10 1.42c29750 (undecoded)
ondisk+write+known_if_redirected e69)
retry osd0 osd_op(client.4418.0:205 1.50 1.42c29750 (undecoded)
ondisk+retry+write+known_if_redirected e103)

orig  osd3 osd_op(client.4418.0:207 1.33 1.918809b3 (undecoded)
ondisk+write+known_if_redirected e69)
retry osd3 osd_op(client.4418.0:207 1.33 1.918809b3 (undecoded)
ondisk+retry+write+known_if_redirected e103)

This shows that both staying in the old PG and moving to the new PG
cases are handled correctly, as staying on the same OSD and moving to
the new OSD.  The request in question (tid 204) is on the right list
as indicated by e103 in osdc output, but it looks like something went
wrong inside calc_target().

What is the exact kernel version you are running?  I know you said
4.14, but you also mentioned some customizations...

Thanks,

                Ilya



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux