Re: CephFS msg length greater than osd_max_write_size

Ryan Leimenstoll <rleimens@xxxxxxxxxxxxxx> · Wed, 22 May 2019 11:11:33 -0400

Thanks for the reply! We will be more proactive about evicting clients in the future rather than waiting.

One followup however, it seems that the filesystem going read only was only a WARNING state, which didn’t immediately catch our eye due to some other rebalancing operations. Is there a reason that this wouldn’t be a HEALTH_ERR condition since it represents a significant service degradation?

Thanks!
Ryan

> On May 22, 2019, at 4:20 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> 
> On Tue, May 21, 2019 at 6:10 AM Ryan Leimenstoll
> <rleimens@xxxxxxxxxxxxxx> wrote:
>> 
>> Hi all,
>> 
>> We recently encountered an issue where our CephFS filesystem unexpectedly was set to read-only. When we look at some of the logs from the daemons I can see the following:
>> 
>> On the MDS:
>> ...
>> 2019-05-18 16:34:24.341 7fb3bd610700 -1 mds.0.89098 unhandled write error (90) Message too long, force readonly...
>> 2019-05-18 16:34:24.341 7fb3bd610700  1 mds.0.cache force file system read-only
>> 2019-05-18 16:34:24.341 7fb3bd610700  0 log_channel(cluster) log [WRN] : force file system read-only
>> 2019-05-18 16:34:41.289 7fb3c0616700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
>> 2019-05-18 16:34:41.289 7fb3c0616700  0 mds.beacon.objmds00 Skipping beacon heartbeat to monitors (last acked 4.00101s ago); MDS internal heartbeat is not healthy!
>> ...
>> 
>> On one of the OSDs it was most likely targeting:
>> ...
>> 2019-05-18 16:34:24.140 7f8134e6c700 -1 osd.602 pg_epoch: 682796 pg[49.20b( v 682796'15706523 (682693'15703449,682796'15706523] local-lis/les=673041/673042 n=10524 ec=245563/245563 lis/c 673041/673041 les/c/f 673042/673042/0 673038/673041/668565) [602,530,558] r=0 lpr=673041 crt=682796'15706523 lcod 682796'15706522 mlcod 682796'15706522 active+clean] do_op msg data len 95146005 > osd_max_write_size 94371840 on osd_op(mds.0.89098:48609421 49.20b 49:d0630e4c:::mds0_sessionmap:head [omap-set-header,omap-set-vals] snapc 0=[] ondisk+write+known_if_redirected+full_force e682796) v8
>> 2019-05-18 17:10:33.695 7f813466b700  0 log_channel(cluster) log [DBG] : 49.31c scrub starts
>> 2019-05-18 17:10:34.980 7f813466b700  0 log_channel(cluster) log [DBG] : 49.31c scrub ok
>> 2019-05-18 22:17:37.320 7f8134e6c700 -1 osd.602 pg_epoch: 683434 pg[49.20b( v 682861'15706526 (682693'15703449,682861'15706526] local-lis/les=673041/673042 n=10525 ec=245563/245563 lis/c 673041/673041 les/c/f 673042/673042/0 673038/673041/668565) [602,530,558] r=0 lpr=673041 crt=682861'15706526 lcod 682859'15706525 mlcod 682859'15706525 active+clean] do_op msg data len 95903764 > osd_max_write_size 94371840 on osd_op(mds.0.91565:357877 49.20b 49:d0630e4c:::mds0_sessionmap:head [omap-set-header,omap-set-vals,omap-rm-keys] snapc 0=[] ondisk+write+known_if_redirected+full_force e683434) v8
>> …
>> 
>> During this time there were some health concerns with the cluster. Significantly, since the error above seems to be related to the SessionMap, we had a client that had a few blocked requests for over 35948 secs (it’s a member of a compute cluster so we let the node drain/finish jobs before rebooting). We have also had some issues with certain OSDs running older hardware staying up/responding timely to heartbeats after upgrading to Nautilus, although that seems to be an iowait/load issue that we are actively working to mitigate separately.
>> 
> 
> This prevent mds from trimming completed requests recorded in session.
> which results a very large session item.  To recovery, blacklist the
> client that has blocked request, the restart mds.
> 
>> We are running Nautilus 14.2.1 on RHEL7.6. There is only one MDS Rank, with an active/standby setup between two MDS nodes. MDS clients are mounted using the RHEL7.6 kernel driver.
>> 
>> My read here would be that the MDS is sending too large a message to the OSD, however my understanding was that the MDS should be using osd_max_write_size to determine the size of that message [0]. Is this maybe a bug in how this is calculated on the MDS side?
>> 
>> 
>> Thanks!
>> Ryan Leimenstoll
>> rleimens@xxxxxxxxxxxxxx
>> University of Maryland Institute for Advanced Computer Studies
>> 
>> 
>> 
>> [0] https://www.spinics.net/lists/ceph-devel/msg11951.html
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com