Re: cephfs kernel client instability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



An update on our cephfs kernel client troubles.  After doing some heavier testing with a newer kernel 4.19.13, it seems like it also gets into a bad state when it can't connect to monitors (all back end processes are on 12.2.8):

Jan 15 08:49:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session established
Jan 15 08:49:01 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Jan 15 08:49:01 mon5 kernel: libceph: mon1 10.128.150.11:6789 session lost, hunting for new mon
Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session established
Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session lost, hunting for new mon
Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 session established
Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 session lost, hunting for new mon
Jan 15 08:49:06 mon5 kernel: libceph: mon2 10.128.150.12:6789 session established
Jan 15 08:49:07 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Jan 15 08:49:07 mon5 kernel: libceph: mon2 10.128.150.12:6789 session lost, hunting for new mon
... repeating forever ...

# uname -r
4.19.13

and on the mon node (10.128.150.10) at log level 20, I see that it is building/encoding a lot of maps (10.128.36.18 is the client in question):
2019-01-15 08:49:01.355017 7fffee40c700 10 mon.cephmon00@0(leader) e40 _ms_dispatch new session 0x555562dc6c00 MonSession(client.36800361 10.128.36.18:0/2911716500 is open , features 0x27018fb86aa42ada (jewel)) features 0x27018fb86aa42ada
2019-01-15 08:49:01.355021 7fffee40c700 20 mon.cephmon00@0(leader) e40  caps
2019-01-15 08:49:01.355026 7fffee40c700 10 mon.cephmon00@0(leader).auth v58457 preprocess_query auth(proto 0 34 bytes epoch 0) from client.36800361 10.128.36.18:0/2911716500
<snip>
2019-01-15 08:49:01.355817 7fffee40c700 10 mon.cephmon00@0(leader).osd e1254390 check_osdmap_sub 0x555565373340 next 1254102 (onetime)
2019-01-15 08:49:01.355819 7fffee40c700  5 mon.cephmon00@0(leader).osd e1254390 send_incremental [1254102..1254390] to client.36800361 10.128.36.18:0/2911716500
2019-01-15 08:49:01.355821 7fffee40c700 10 mon.cephmon00@0(leader).osd e1254390 build_incremental [1254102..1254141] with features 27018fb86aa42ada
2019-01-15 08:49:01.364859 7fffee40c700 20 mon.cephmon00@0(leader).osd e1254390 reencode_incremental_map 1254141 with features 504412504116439552
2019-01-15 08:49:01.372131 7fffee40c700 20 mon.cephmon00@0(leader).osd e1254390 build_incremental    inc 1254141 1237271 bytes
2019-01-15 08:49:01.372180 7fffee40c700 20 mon.cephmon00@0(leader).osd e1254390 reencode_incremental_map 1254140 with features 504412504116439552
2019-01-15 08:49:01.372187 7fffee40c700 20 mon.cephmon00@0(leader).osd e1254390 build_incremental    inc 1254140 260 bytes
2019-01-15 08:49:01.380981 7fffee40c700 20 mon.cephmon00@0(leader).osd e1254390 reencode_incremental_map 1254139 with features 504412504116439552
2019-01-15 08:49:01.387983 7fffee40c700 20 mon.cephmon00@0(leader).osd e1254390 build_incremental    inc 1254139 1237351 bytes
2019-01-15 08:49:01.388043 7fffee40c700 20 mon.cephmon00@0(leader).osd e1254390 reencode_incremental_map 1254138 with features 504412504116439552
2019-01-15 08:49:01.388049 7fffee40c700 20 mon.cephmon00@0(leader).osd e1254390 build_incremental    inc 1254138 232 bytes
2019-01-15 08:49:01.396781 7fffee40c700 20 mon.cephmon00@0(leader).osd e1254390 reencode_incremental_map 1254137 with features 504412504116439552
<snip> ... a lot more of similar messages
2019-01-15 08:49:04.210936 7fffee40c700 20 mon.cephmon00@0(leader).osd e1254390 reencode_incremental_map 1254382 with features 504412504116439552
2019-01-15 08:49:04.211032 7fffee40c700 20 mon.cephmon00@0(leader).osd e1254390 build_incremental    inc 1254382 232 bytes
2019-01-15 08:49:04.211066 7fffee40c700 10 mon.cephmon00@0(leader) e40 ms_handle_reset 0x55556450f800 10.128.36.18:0/2911716500
2019-01-15 08:49:04.211070 7fffee40c700 10 mon.cephmon00@0(leader) e40 reset/close on session client.36800361 10.128.36.18:0/2911716500
2019-01-15 08:49:04.211073 7fffee40c700 10 mon.cephmon00@0(leader) e40 remove_session 0x555562dc6c00 client.36800361 10.128.36.18:0/2911716500 features 0x27018fb86aa42ada

looks like the client disconnects (either for waiting too long, or for some protocol error?).  Any hints on why so many maps need to be reencoded (to jewel), or how to improve this behavior would be much appreciated.  We would really be interested in using the kernel client instead of fuse, but this seems to be a stumbling block.

Thanks,

Andras


On 1/3/19 6:49 AM, Andras Pataki wrote:
I wonder if anyone could offer any insight on the issue below, regarding the CentOS 7.6 kernel cephfs client connecting to a Luminous cluster.  I have since tried a much newer 4.19.13 kernel, which did not show the same issue (but unfortunately for various reasons unrelated to ceph, we can't go to such a new kernel).

Am I reading it right that somehow the monitor thinks this kernel is old and needs to prepare special maps in some older format for it, and that takes too long and the kernel just gives up, or perhaps has some other communication protocol error?  It seems like one of these mon communication sessions only lasts half a second.  Then it reconnects to another mon, and gets the same result, etc.  Any way around this?

Andras


On 12/26/18 7:55 PM, Andras Pataki wrote:
We've been using ceph-fuse with a pretty good stability record (against the Luminous 12.2.8 back end).  Unfortunately ceph-fuse has extremely poor small file performance (understandably), so we've been testing the kernel client.  The latest RedHat kernel 3.10.0-957.1.3.el7.x86_64 seems to work pretty well, as long as the cluster is running in a completely clean state.  However, it seems that as soon as there is something happening to the cluster, the kernel client crashes pretty badly.

Today's example: I've reweighted some OSDs to balance the disk usage a bit (set nobackfill, reweight the OSDs, check the new hypothetical space usage, then unset nobackfill).   As soon as the reweighting procedure started, the kernel client went into an infinite loop trying to unsuccessfully connect to mons:

Dec 26 19:28:53 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Dec 26 19:28:53 mon5 kernel: libceph: mon0 10.128.150.10:6789 session lost, hunting for new mon
Dec 26 19:28:53 mon5 kernel: libceph: mon2 10.128.150.12:6789 session established
Dec 26 19:28:58 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:28:58 mon5 kernel: libceph: mon2 10.128.150.12:6789 session lost, hunting for new mon
Dec 26 19:28:58 mon5 kernel: libceph: mon1 10.128.150.11:6789 session established
Dec 26 19:28:59 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Dec 26 19:28:59 mon5 kernel: libceph: mon1 10.128.150.11:6789 session lost, hunting for new mon
Dec 26 19:28:59 mon5 kernel: libceph: mon2 10.128.150.12:6789 session established
Dec 26 19:28:59 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:28:59 mon5 kernel: libceph: mon2 10.128.150.12:6789 session lost, hunting for new mon
Dec 26 19:28:59 mon5 kernel: libceph: mon0 10.128.150.10:6789 session established
Dec 26 19:29:00 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Dec 26 19:29:00 mon5 kernel: libceph: mon0 10.128.150.10:6789 session lost, hunting for new mon
Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 session established
Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 session lost, hunting for new mon
Dec 26 19:29:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session established
Dec 26 19:29:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Dec 26 19:29:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session lost, hunting for new mon
Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 session established
Dec 26 19:29:01 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:29:01 mon5 kernel: libceph: mon2 10.128.150.12:6789 session lost, hunting for new mon
Dec 26 19:29:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session established
Dec 26 19:29:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Dec 26 19:29:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session lost, hunting for new mon
Dec 26 19:29:01 mon5 kernel: libceph: mon2 10.128.150.12:6789 session established
Dec 26 19:29:02 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:29:02 mon5 kernel: libceph: mon2 10.128.150.12:6789 session lost, hunting for new mon
Dec 26 19:29:02 mon5 kernel: libceph: mon1 10.128.150.11:6789 session established
Dec 26 19:29:02 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Dec 26 19:29:02 mon5 kernel: libceph: mon1 10.128.150.11:6789 session lost, hunting for new mon
... etc ...

seemingly never recovering.  The cluster is healthy, all other clients are successfully doing I/O:
[root@cephmon00 ceph]# ceph -s
  cluster:
    id:     d7b33135-0940-4e48-8aa6-1d2026597c2f
    health: HEALTH_WARN
            noout flag(s) set
            1 backfillfull osd(s)
            4 pool(s) backfillfull
            239119058/12419244975 objects misplaced (1.925%)
 
  services:
    mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
    mgr: cephmon00(active)
    mds: cephfs-1/1/1 up  {0=cephmds00=up:active}, 1 up:standby
    osd: 3534 osds: 3534 up, 3534 in; 5040 remapped pgs
         flags noout
 
  data:
    pools:   5 pools, 50688 pgs
    objects: 2.51G objects, 7.77PiB
    usage:   19.1PiB used, 5.90PiB / 25.0PiB avail
    pgs:     239119058/12419244975 objects misplaced (1.925%)
             45639 active+clean
             2914  active+remapped+backfilling
             2126  active+remapped+backfill_wait
             9     active+clean+scrubbing+deep
 
  io:
    client:   10.3MiB/s rd, 2.23GiB/s wr, 32op/s rd, 3.77kop/s wr
    recovery: 114GiB/s, 33.12kobjects/s
 
The client machine in question is otherwise healthy also (not out of memory, etc.).  I've checked the osd blacklist, nothing on that.  Nothing suspicious in the mon logs regarding this client either.  Setting one of the mons to log level 20, the relevant section when this client tries to connect:

2018-12-26 19:51:46.203264 7fffeb406700 10 mon.cephmon00@0(leader) e40 ms_verify_authorizer 10.128.36.18:0/3882984371 client protocol 0
2018-12-26 19:51:46.206645 7fffee40c700 10 mon.cephmon00@0(leader) e40 _ms_dispatch new session 0x5555722bb980 MonSession(client.36398604 10.128.36.18:0/3882984371 is open , features 0x27018fb86aa42ada (jewel)) features 0x27018fb86aa42ada
2018-12-26 19:51:46.206651 7fffee40c700 20 mon.cephmon00@0(leader) e40  caps
2018-12-26 19:51:46.206657 7fffee40c700 10 mon.cephmon00@0(leader).auth v57602 preprocess_query auth(proto 0 34 bytes epoch 0) from client.36398604 10.128.36.18:0/3882984371
2018-12-26 19:51:46.206660 7fffee40c700 10 mon.cephmon00@0(leader).auth v57602 prep_auth() blob_size=34
2018-12-26 19:51:46.206692 7fffee40c700  2 mon.cephmon00@0(leader) e40 send_reply 0x555562fee5a0 0x55556f398b40 auth_reply(proto 2 0 (0) Success) v1
2018-12-26 19:51:46.206867 7fffee40c700 20 mon.cephmon00@0(leader) e40 _ms_dispatch existing session 0x5555722bb980 for client.36398604 10.128.36.18:0/3882984371
2018-12-26 19:51:46.206870 7fffee40c700 20 mon.cephmon00@0(leader) e40  caps
2018-12-26 19:51:46.206877 7fffee40c700 10 mon.cephmon00@0(leader).auth v57602 preprocess_query auth(proto 2 128 bytes epoch 0) from client.36398604 10.128.36.18:0/3882984371
2018-12-26 19:51:46.206882 7fffee40c700 10 mon.cephmon00@0(leader).auth v57602 prep_auth() blob_size=128
2018-12-26 19:51:46.206979 7fffee40c700  2 mon.cephmon00@0(leader) e40 send_reply 0x555562fee5a0 0x55571f9c4d80 auth_reply(proto 2 0 (0) Success) v1
2018-12-26 19:51:46.207178 7fffee40c700 20 mon.cephmon00@0(leader) e40 _ms_dispatch existing session 0x5555722bb980 for client.36398604 10.128.36.18:0/3882984371
2018-12-26 19:51:46.207182 7fffee40c700 20 mon.cephmon00@0(leader) e40  caps allow *
2018-12-26 19:51:46.207184 7fffee40c700 20 is_capable service=mon command= read on cap allow *
2018-12-26 19:51:46.207185 7fffee40c700 20  allow so far , doing grant allow *
2018-12-26 19:51:46.207186 7fffee40c700 20  allow all
2018-12-26 19:51:46.207188 7fffee40c700 10 mon.cephmon00@0(leader) e40 handle_subscribe mon_subscribe({mdsmap=72103+,monmap=41+,osdmap=1207891}) v2
2018-12-26 19:51:46.207194 7fffee40c700 10 mon.cephmon00@0(leader) e40 handle_subscribe: MDS sub 'mdsmap'
2018-12-26 19:51:46.207195 7fffee40c700 20 is_capable service=mds command= read on cap allow *
2018-12-26 19:51:46.207197 7fffee40c700 20  allow so far , doing grant allow *
2018-12-26 19:51:46.207198 7fffee40c700 20  allow all
2018-12-26 19:51:46.207200 7fffee40c700 20 mon.cephmon00@0(leader).mds e72102 check_sub: mdsmap
2018-12-26 19:51:46.207204 7fffee40c700 10 mon.cephmon00@0(leader).monmap v40 check_sub monmap next 41 have 40
2018-12-26 19:51:46.207207 7fffee40c700 20 is_capable service=osd command= read on cap allow *
2018-12-26 19:51:46.207208 7fffee40c700 20  allow so far , doing grant allow *
2018-12-26 19:51:46.207209 7fffee40c700 20  allow all
2018-12-26 19:51:46.207211 7fffee40c700 10 mon.cephmon00@0(leader).osd e1208017 check_osdmap_sub 0x55592f513d00 next 1207891 (onetime)
2018-12-26 19:51:46.207213 7fffee40c700  5 mon.cephmon00@0(leader).osd e1208017 send_incremental [1207891..1208017] to client.36398604 10.128.36.18:0/3882984371
2018-12-26 19:51:46.207217 7fffee40c700 10 mon.cephmon00@0(leader).osd e1208017 build_incremental [1207891..1207930] with features 27018fb86aa42ada
2018-12-26 19:51:46.220019 7fffee40c700 20 mon.cephmon00@0(leader).osd e1208017 reencode_incremental_map 1207930 with features 504412504116439552
2018-12-26 19:51:46.230217 7fffee40c700 20 mon.cephmon00@0(leader).osd e1208017 build_incremental    inc 1207930 1146701 bytes
2018-12-26 19:51:46.230349 7fffee40c700 20 mon.cephmon00@0(leader).osd e1208017 reencode_incremental_map 1207929 with features 504412504116439552
2018-12-26 19:51:46.232523 7fffee40c700 20 mon.cephmon00@0(leader).osd e1208017 build_incremental    inc 1207929 175613 bytes
... a lot more of reencode_incremental stuff ...
2018-12-26 19:51:46.745394 7fffee40c700 10 mon.cephmon00@0(leader) e40 ms_handle_reset 0x5555637cf800 10.128.36.18:0/3882984371
2018-12-26 19:51:46.745395 7ffff0c11700 10 mon.cephmon00@0(leader).log v79246823 encode_full log v 79246823
2018-12-26 19:51:46.745469 7ffff0c11700 10 mon.cephmon00@0(leader).log v79246823 encode_pending v79246824
2018-12-26 19:51:46.745763 7fffee40c700 10 mon.cephmon00@0(leader) e40 reset/close on session client.36398604 10.128.36.18:0/3882984371
2018-12-26 19:51:46.745769 7fffee40c700 10 mon.cephmon00@0(leader) e40 remove_session 0x5555722bb980 client.36398604 10.128.36.18:0/3882984371 features 0x27018fb86aa42ada

Any pointers to what to do here?

Andras

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux