MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

Jonathan Woytek <woytek@xxxxxxxxxxx> · Wed, 15 Aug 2018 11:44:31 -0400

Hi list people. I was asking a few of these questions in IRC, too, but figured maybe a wider audience could see something that I'm missing. 

I'm running a four-node cluster with cephfs and the kernel-mode driver as the primary access method. Each node has 72 * 10TB OSDs, for a total of 288 OSDs. Each system has about 256GB of memory. These systems are dedicated to the ceph service and run no other workloads. The cluster is configured with every machine participating as MON, MGR, and MDS. Data is stored in replica 2 mode. The data are files of varying lengths, but most <5MB. Files are named with their SHA256 hash, and are divided into subdirectories based on the first few octets (example: files/a1/a13f/a13f25....). The current set of files occupies about 100TB (200TB accounting for replication). 

Early this week, we started seeing some network issues that were causing OSDs to become unavailable for short periods of time. It was long enough to get logged by syslog, but not long enough to trigger a persistent warning or error state in ceph status. Conditions continued to degrade until we encountered two of the four nodes falling off of the network, and OSDs tried to start migrating en masse. After the network stabilized a short while later, the OSDs were all shown as online and OK, and ceph seemed to recover cleanly, and stopped trying to migrate data. In the process of trying to get the network stable, though, the two nodes that had fallen off the network had to be rebooted. 

When all four nodes were back online and talking to each other, I noticed that MDS was in "up: rejoin", and after a period of time, it would eat all of the available memory and swap on whatever system was primary. It would eventually either get killed-off by the system due to memory usage, or it got so slow that the monitors would drop it and pick another MDS as primary. This cycle would repeat. 

I added more swap to one system (160GB of swap total), and brought down the MDS service on the other three nodes, forcing the rejoin operations to occur on the node with added swap. I also turned up debugging to see what it was actually doing. This was then allowed to run for about 14 hours overnight. When I arrived this morning, the system was still up, but severly lagged. Nearly all swap had been used, and the system had difficulty responding to commands. Out of options, I killed the process, and then watched as it tried to shut down cleanly. I was hoping to preserve as much of the work that it did as possible. I restarted it, and it seemed to do more in replay, and then reentered the rejoin, which is still running and giving no hints of finishing anytime soon. 

The rejoin traffic I'm seeing in the MDS log looks like this: 

2018-08-15 11:39:21.726 7f9c7229f700 10 mds.0.cache.ino(0x100000108aa) verify_diri_backtrace
2018-08-15 11:39:21.726 7f9c7229f700 10 mds.0.cache.dir(0x100000108aa) _fetched header 274 bytes 2323 keys for [dir 0x100000108aa /files-by-sha256/1c/1cc4/ [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 0x561738166a00]
2018-08-15 11:39:21.726 7f9c7229f700 10 mds.0.cache.dir(0x100000108aa) _fetched version 59738838
2018-08-15 11:39:21.726 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1 0x56171c6a3400) have_past_parents_open [1,head]
2018-08-15 11:39:21.727 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1 0x56171c6a3400) have_past_parents_open [1,head]
2018-08-15 11:39:21.898 7f9c78534700 10 mds.beacon.ta-g17 handle_mds_beacon up:rejoin seq 377 rtt 1.400594
2018-08-15 11:39:24.564 7f9c752a5700 10 mds.beacon.ta-g17 _send up:rejoin seq 378
2018-08-15 11:39:25.503 7f9c78534700 10 mds.beacon.ta-g17 handle_mds_beacon up:rejoin seq 378 rtt 0.907796
2018-08-15 11:39:26.565 7f9c7229f700 10 mds.0.cache.dir(0x100000108aa) auth_unpin by 0x561738166a00 on [dir 0x100000108aa /files-by-sha256/1c/1cc4/ [2,head] auth v=59738838 cv=59738838/59738838 state=1073741825|complete f(v0 m2018-08-14 07:52:06.764154 2323=2323+0) n(v0 rc2018-08-14 07:52:06.764154 b3161079403 2323=2323+0) hs=2323+0,ss=0+0 | child=1 waiter=1 authpin=0 0x561738166a00] count now 0 + 0
2018-08-15 11:39:26.706 7f9c73aa2700  7 mds.0.13676 mds has 1 queued contexts
2018-08-15 11:39:26.706 7f9c73aa2700 10 mds.0.13676 0x5617cd27a790
2018-08-15 11:39:26.706 7f9c73aa2700 10 mds.0.13676  finish 0x5617cd27a790
2018-08-15 11:39:26.723 7f9c7229f700 10 MDSIOContextBase::complete: 21C_IO_Dir_OMAP_Fetched
2018-08-15 11:39:26.723 7f9c7229f700 10 mds.0.cache.ino(0x100000020f7) verify_diri_backtrace
2018-08-15 11:39:26.738 7f9c7229f700 10 mds.0.cache.dir(0x100000020f7) _fetched header 274 bytes 1899 keys for [dir 0x100000020f7 /files-by-sha256/a7/a723/ [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 0x5617351cbc00]
2018-08-15 11:39:26.792 7f9c7229f700 10 mds.0.cache.dir(0x100000020f7) _fetched version 59752211
2018-08-15 11:39:26.792 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1 0x56171c6a3400) have_past_parents_open [1,head]
2018-08-15 11:39:26.811 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1 0x56171c6a3400) have_past_parents_open [1,head]
2018-08-15 11:39:27.908 7f9c7229f700 10 mds.0.cache.dir(0x100000020f7) auth_unpin by 0x5617351cbc00 on [dir 0x100000020f7 /files-by-sha256/a7/a723/ [2,head] auth v=59752211 cv=59752211/59752211 state=1073741825|complete f(v0 m2018-08-14 08:14:21.893249 1899=1899+0) n(v0 rc2018-08-14 08:14:21.893249 b2658734443 1899=1899+0) hs=1899+0,ss=0+0 | child=1 waiter=1 authpin=0 0x5617351cbc00] count now 0 + 0
2018-08-15 11:39:27.962 7f9c7229f700 10 MDSIOContextBase::complete: 21C_IO_Dir_OMAP_Fetched

I am to the point here where I'd prefer to get this filesystem up sooner than later. There was likely some data in-transit to the filesystem when the outage occurred (possibly as many as a few thousand files being created), but I'm willing to lose that data and let it get re-created by our processes when we detect it missing. 
Is there anything I can do to make this more efficient or help to get the process completed so MDS goes online? 

jonathan
--
Jonathan Woytek
http://www.dryrose.com
KB3HOZ
PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com