Re: reproducible rbd-nbd crashes

Jason Dillaman <jdillama@xxxxxxxxxx> · Tue, 10 Sep 2019 10:10:11 -0400

On Tue, Sep 10, 2019 at 9:46 AM Marc Schöchlin <ms@xxxxxxxxxx> wrote:
>
> Hello Mike,
>
> as described i set all the settings.
>
> Unfortunately it crashed also with these settings :-(
>
> Regards
> Marc
>
> [Tue Sep 10 12:25:56 2019] Btrfs loaded, crc32c=crc32c-intel
> [Tue Sep 10 12:25:57 2019] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
> [Tue Sep 10 12:25:59 2019] systemd[1]: systemd 237 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> [Tue Sep 10 12:25:59 2019] systemd[1]: Detected virtualization xen.
> [Tue Sep 10 12:25:59 2019] systemd[1]: Detected architecture x86-64.
> [Tue Sep 10 12:25:59 2019] systemd[1]: Set hostname to <int-nfs-001>.
> [Tue Sep 10 12:26:01 2019] systemd[1]: Started ntp-systemd-netif.path.
> [Tue Sep 10 12:26:01 2019] systemd[1]: Created slice System Slice.
> [Tue Sep 10 12:26:01 2019] systemd[1]: Listening on udev Kernel Socket.
> [Tue Sep 10 12:26:01 2019] systemd[1]: Created slice system-serial\x2dgetty.slice.
> [Tue Sep 10 12:26:01 2019] systemd[1]: Listening on Journal Socket.
> [Tue Sep 10 12:26:01 2019] systemd[1]: Mounting POSIX Message Queue File System...
> [Tue Sep 10 12:26:01 2019] RPC: Registered named UNIX socket transport module.
> [Tue Sep 10 12:26:01 2019] RPC: Registered udp transport module.
> [Tue Sep 10 12:26:01 2019] RPC: Registered tcp transport module.
> [Tue Sep 10 12:26:01 2019] RPC: Registered tcp NFSv4.1 backchannel transport module.
> [Tue Sep 10 12:26:01 2019] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
> [Tue Sep 10 12:26:01 2019] Loading iSCSI transport class v2.0-870.
> [Tue Sep 10 12:26:01 2019] iscsi: registered transport (tcp)
> [Tue Sep 10 12:26:01 2019] systemd-journald[497]: Received request to flush runtime journal from PID 1
> [Tue Sep 10 12:26:01 2019] Installing knfsd (copyright (C) 1996 okir@xxxxxxxxxxxx).
> [Tue Sep 10 12:26:01 2019] iscsi: registered transport (iser)
> [Tue Sep 10 12:26:01 2019] systemd-journald[497]: File /var/log/journal/cef15a6d1b80c9fbcb31a3a65aec21ad/system.journal corrupted or uncleanly shut down, renaming and replacing.
> [Tue Sep 10 12:26:04 2019] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
> [Tue Sep 10 12:26:05 2019] EXT4-fs (xvda1): mounted filesystem with ordered data mode. Opts: (null)
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.659:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/lxc-start" pid=902 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.675:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=904 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.675:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=904 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.675:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=904 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default" pid=900 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-cgns" pid=900 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-with-mounting" pid=900 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-with-nesting" pid=900 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.723:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=905 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.723:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=905 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] new mount options do not match the existing superblock, will be ignored
> [Tue Sep 10 12:26:09 2019] SGI XFS with ACLs, security attributes, realtime, no debug enabled
> [Tue Sep 10 12:26:09 2019] XFS (nbd0): Mounting V5 Filesystem
> [Tue Sep 10 12:26:11 2019] XFS (nbd0): Starting recovery (logdev: internal)
> [Tue Sep 10 12:26:12 2019] XFS (nbd0): Ending recovery (logdev: internal)
> [Tue Sep 10 12:26:12 2019] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
> [Tue Sep 10 12:26:12 2019] NFSD: starting 90-second grace period (net f00000f8)
> [Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out
> [Tue Sep 10 14:45:04 2019] block nbd0: shutting down sockets
> [Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out
> [Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out
> [Tue Sep 10 14:45:04 2019] print_req_error: I/O error, dev nbd0, sector 89228968 flags 3
> [Tue Sep 10 14:45:04 2019] print_req_error: I/O error, dev nbd0, sector 121872688 flags 3
> [Tue Sep 10 14:45:04 2019] print_req_error: I/O error, dev nbd0, sector 123189160 flags 3
> [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0
> [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 4294967168 flags 0
> [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 4294967280 flags 0
> [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0
> [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 8 flags 0
> [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0
> [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 4294967168 flags 0
> [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 4294967280 flags 0
> [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0
> [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 8 flags 0
> [Tue Sep 10 14:46:51 2019] INFO: task xfsaild/nbd0:1919 blocked for more than 120 seconds.
> [Tue Sep 10 14:46:51 2019]       Not tainted 5.0.0-27-generic #28~18.04.1-Ubuntu
> [Tue Sep 10 14:46:51 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Tue Sep 10 14:46:51 2019] xfsaild/nbd0    D    0  1919      2 0x80000000
> [Tue Sep 10 14:46:51 2019] Call Trace:
> [Tue Sep 10 14:46:51 2019]  __schedule+0x2bd/0x850
> [Tue Sep 10 14:46:51 2019]  ? try_to_del_timer_sync+0x53/0x80
> [Tue Sep 10 14:46:51 2019]  schedule+0x2c/0x70
> [Tue Sep 10 14:46:51 2019]  xfs_log_force+0x15f/0x2e0 [xfs]
> [Tue Sep 10 14:46:51 2019]  ? wake_up_q+0x80/0x80
> [Tue Sep 10 14:46:51 2019]  xfsaild+0x17b/0x800 [xfs]
> [Tue Sep 10 14:46:51 2019]  ? __schedule+0x2c5/0x850
> [Tue Sep 10 14:46:51 2019]  kthread+0x121/0x140
> [Tue Sep 10 14:46:51 2019]  ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
> [Tue Sep 10 14:46:51 2019]  ? kthread+0x121/0x140
> [Tue Sep 10 14:46:51 2019]  ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
> [Tue Sep 10 14:46:51 2019]  ? kthread_park+0x90/0x90
> [Tue Sep 10 14:46:51 2019]  ret_from_fork+0x35/0x40

Perhaps try it w/ ext4 instead of XFS?

> Am 10.09.19 um 12:51 schrieb Marc Schöchlin:
> > Hello Mike,
> >
> > Am 03.09.19 um 04:41 schrieb Mike Christie:
> >> On 09/02/2019 06:20 AM, Marc Schöchlin wrote:
> >>> Hello Mike,
> >>>
> >>> i am having a quick look  to this on vacation because my coworker
> >>> reports daily and continuous crashes ;-)
> >>> Any updates here (i am aware that this is not very easy to fix)?
> >> I am still working on it. It basically requires rbd-nbd to be written so
> >> it preallocates its memory used for IO, and when it can't like when
> >> doing network IO it requires adding a interface to tell the kernel to
> >> not use allocation flags that can cause disk IO back on to the device.
> >>
> >> There are some workraounds like adding more memory and setting the vm
> >> values. For the latter, if it seems if you set:
> >>
> >> vm.dirty_background_ratio = 0 then it looks like it avoids the problem
> >> because the kernel will immediately start to write dirty pages from the
> >> background worker threads, so we do not end up later needing to write
> >> out pages from the rbd-nbd thread to free up memory.
> > Sigh, I set this yesterday on my system ("sysctl vm.dirty_background_ratio=0") and got an additional crash this night :-(
> >
> > I now restarted the system and invoked all of the following commands mentioned by your last mail:
> >
> > sysctl vm.dirty_background_ratio=0
> > sysctl vm.dirty_ratio=0
> > sysctl vm.vfs_cache_pressure=0
> >
> > Let's see if that helps....
> >
> > Regards
> >
> > Marc
> >
> >
> > Am 03.09.19 um 04:41 schrieb Mike Christie:
> >> On 09/02/2019 06:20 AM, Marc Schöchlin wrote:
> >>> Hello Mike,
> >>>
> >>> i am having a quick look  to this on vacation because my coworker
> >>> reports daily and continuous crashes ;-)
> >>> Any updates here (i am aware that this is not very easy to fix)?
> >> I am still working on it. It basically requires rbd-nbd to be written so
> >> it preallocates its memory used for IO, and when it can't like when
> >> doing network IO it requires adding a interface to tell the kernel to
> >> not use allocation flags that can cause disk IO back on to the device.
> >>
> >> There are some workraounds like adding more memory and setting the vm
> >> values. For the latter, if it seems if you set:
> >>
> >> vm.dirty_background_ratio = 0 then it looks like it avoids the problem
> >> because the kernel will immediately start to write dirty pages from the
> >> background worker threads, so we do not end up later needing to write
> >> out pages from the rbd-nbd thread to free up memory.
> >>
> >> or
> >>
> >> vm.dirty_ratio = 0 then it looks like it avoids the problem because the
> >> kernel will just write out the data right away similar to above, but
> >> from its normally going to be written out from the thread that you are
> >> running your test from.
> >>
> >> and this seems optional and can result in other problems:
> >>
> >> vm.vfs_cache_pressure = 0 then for at least XFS it looks like we avoid
> >> one of the immediate problems where allocations would always cause the
> >> inode caches to be reclaimed and that memory to be written out to the
> >> device. For EXT4, I did not see a similar issue.
> >>
> >>> I think the severity of this problem
> >>> <https://tracker.ceph.com/issues/40822> (currently "minor") is not
> >>> suitable to the consequences of this problem.
> >>>
> >>> This reproducible problem can cause:
> >>>
> >>>   * random service outage
> >>>   * data corruption
> >>>   * long recovery procedures on huge filesystems
> >>>
> >>> Is it adequate to increase the severity to major or critical?
> >>>
> >>> What might the reason for a very reliable rbd-nbd running on my xen
> >>> servers as storage repository?
> >>> (see https://github.com/vico-research-and-consulting/RBDSR/tree/v2.0 -
> >>> hundreds of devices, high workload)
> >>>
> >>> Regards
> >>> Marc
> >>>
> >>> Am 15.08.19 um 20:07 schrieb Marc Schöchlin:
> >>>> Hello Mike,
> >>>>
> >>>> Am 15.08.19 um 19:57 schrieb Mike Christie:
> >>>>>> Don't waste your time. I found a way to replicate it now.
> >>>>>>
> >>>>> Just a quick update.
> >>>>>
> >>>>> Looks like we are trying to allocate memory in the IO path in a way that
> >>>>> can swing back on us, so we can end up locking up. You are probably not
> >>>>> hitting this with krbd in your setup because normally it's preallocating
> >>>>> structs, using flags like GFP_NOIO, etc. For rbd-nbd, we cannot
> >>>>> preallocate some structs and cannot control the allocation flags for
> >>>>> some operations initiated from userspace, so its possible to hit this
> >>>>> every IO. I can replicate this now in a second just doing a cp -r.
> >>>>>
> >>>>> It's not going to be a simple fix. We have had a similar issue for
> >>>>> storage daemons like iscsid and multipathd since they were created. It's
> >>>>> less likey to hit with them because you only hit the paths they cannot
> >>>>> control memory allocation behavior during recovery.
> >>>>>
> >>>>> I am looking into some things now.
> >>>> Great to hear, that the problem is now identified.
> >>>>
> >>>> As described I'm on vacation -  if you need anything after the 8.9. we can probably invest some time to test upcoming fixes.
> >>>>
> >>>> Regards
> >>>> Marc
> >>>>
> >>>>
> >>> --
> >>> GPG encryption available: 0x670DCBEC/pool.sks-keyservers.net
> >>>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com