On Tue, Sep 10, 2019 at 9:46 AM Marc Schöchlin <ms@xxxxxxxxxx> wrote: > > Hello Mike, > > as described i set all the settings. > > Unfortunately it crashed also with these settings :-( > > Regards > Marc > > [Tue Sep 10 12:25:56 2019] Btrfs loaded, crc32c=crc32c-intel > [Tue Sep 10 12:25:57 2019] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null) > [Tue Sep 10 12:25:59 2019] systemd[1]: systemd 237 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid) > [Tue Sep 10 12:25:59 2019] systemd[1]: Detected virtualization xen. > [Tue Sep 10 12:25:59 2019] systemd[1]: Detected architecture x86-64. > [Tue Sep 10 12:25:59 2019] systemd[1]: Set hostname to <int-nfs-001>. > [Tue Sep 10 12:26:01 2019] systemd[1]: Started ntp-systemd-netif.path. > [Tue Sep 10 12:26:01 2019] systemd[1]: Created slice System Slice. > [Tue Sep 10 12:26:01 2019] systemd[1]: Listening on udev Kernel Socket. > [Tue Sep 10 12:26:01 2019] systemd[1]: Created slice system-serial\x2dgetty.slice. > [Tue Sep 10 12:26:01 2019] systemd[1]: Listening on Journal Socket. > [Tue Sep 10 12:26:01 2019] systemd[1]: Mounting POSIX Message Queue File System... > [Tue Sep 10 12:26:01 2019] RPC: Registered named UNIX socket transport module. > [Tue Sep 10 12:26:01 2019] RPC: Registered udp transport module. > [Tue Sep 10 12:26:01 2019] RPC: Registered tcp transport module. > [Tue Sep 10 12:26:01 2019] RPC: Registered tcp NFSv4.1 backchannel transport module. > [Tue Sep 10 12:26:01 2019] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro > [Tue Sep 10 12:26:01 2019] Loading iSCSI transport class v2.0-870. > [Tue Sep 10 12:26:01 2019] iscsi: registered transport (tcp) > [Tue Sep 10 12:26:01 2019] systemd-journald[497]: Received request to flush runtime journal from PID 1 > [Tue Sep 10 12:26:01 2019] Installing knfsd (copyright (C) 1996 okir@xxxxxxxxxxxx). > [Tue Sep 10 12:26:01 2019] iscsi: registered transport (iser) > [Tue Sep 10 12:26:01 2019] systemd-journald[497]: File /var/log/journal/cef15a6d1b80c9fbcb31a3a65aec21ad/system.journal corrupted or uncleanly shut down, renaming and replacing. > [Tue Sep 10 12:26:04 2019] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null) > [Tue Sep 10 12:26:05 2019] EXT4-fs (xvda1): mounted filesystem with ordered data mode. Opts: (null) > [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.659:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/lxc-start" pid=902 comm="apparmor_parser" > [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.675:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=904 comm="apparmor_parser" > [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.675:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=904 comm="apparmor_parser" > [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.675:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=904 comm="apparmor_parser" > [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default" pid=900 comm="apparmor_parser" > [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-cgns" pid=900 comm="apparmor_parser" > [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-with-mounting" pid=900 comm="apparmor_parser" > [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-with-nesting" pid=900 comm="apparmor_parser" > [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.723:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=905 comm="apparmor_parser" > [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.723:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=905 comm="apparmor_parser" > [Tue Sep 10 12:26:06 2019] new mount options do not match the existing superblock, will be ignored > [Tue Sep 10 12:26:09 2019] SGI XFS with ACLs, security attributes, realtime, no debug enabled > [Tue Sep 10 12:26:09 2019] XFS (nbd0): Mounting V5 Filesystem > [Tue Sep 10 12:26:11 2019] XFS (nbd0): Starting recovery (logdev: internal) > [Tue Sep 10 12:26:12 2019] XFS (nbd0): Ending recovery (logdev: internal) > [Tue Sep 10 12:26:12 2019] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory > [Tue Sep 10 12:26:12 2019] NFSD: starting 90-second grace period (net f00000f8) > [Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out > [Tue Sep 10 14:45:04 2019] block nbd0: shutting down sockets > [Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out > [Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out > [Tue Sep 10 14:45:04 2019] print_req_error: I/O error, dev nbd0, sector 89228968 flags 3 > [Tue Sep 10 14:45:04 2019] print_req_error: I/O error, dev nbd0, sector 121872688 flags 3 > [Tue Sep 10 14:45:04 2019] print_req_error: I/O error, dev nbd0, sector 123189160 flags 3 > [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0 > [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 4294967168 flags 0 > [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 4294967280 flags 0 > [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0 > [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 8 flags 0 > [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0 > [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 4294967168 flags 0 > [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 4294967280 flags 0 > [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0 > [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 8 flags 0 > [Tue Sep 10 14:46:51 2019] INFO: task xfsaild/nbd0:1919 blocked for more than 120 seconds. > [Tue Sep 10 14:46:51 2019] Not tainted 5.0.0-27-generic #28~18.04.1-Ubuntu > [Tue Sep 10 14:46:51 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Tue Sep 10 14:46:51 2019] xfsaild/nbd0 D 0 1919 2 0x80000000 > [Tue Sep 10 14:46:51 2019] Call Trace: > [Tue Sep 10 14:46:51 2019] __schedule+0x2bd/0x850 > [Tue Sep 10 14:46:51 2019] ? try_to_del_timer_sync+0x53/0x80 > [Tue Sep 10 14:46:51 2019] schedule+0x2c/0x70 > [Tue Sep 10 14:46:51 2019] xfs_log_force+0x15f/0x2e0 [xfs] > [Tue Sep 10 14:46:51 2019] ? wake_up_q+0x80/0x80 > [Tue Sep 10 14:46:51 2019] xfsaild+0x17b/0x800 [xfs] > [Tue Sep 10 14:46:51 2019] ? __schedule+0x2c5/0x850 > [Tue Sep 10 14:46:51 2019] kthread+0x121/0x140 > [Tue Sep 10 14:46:51 2019] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] > [Tue Sep 10 14:46:51 2019] ? kthread+0x121/0x140 > [Tue Sep 10 14:46:51 2019] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] > [Tue Sep 10 14:46:51 2019] ? kthread_park+0x90/0x90 > [Tue Sep 10 14:46:51 2019] ret_from_fork+0x35/0x40 Perhaps try it w/ ext4 instead of XFS? > Am 10.09.19 um 12:51 schrieb Marc Schöchlin: > > Hello Mike, > > > > Am 03.09.19 um 04:41 schrieb Mike Christie: > >> On 09/02/2019 06:20 AM, Marc Schöchlin wrote: > >>> Hello Mike, > >>> > >>> i am having a quick look to this on vacation because my coworker > >>> reports daily and continuous crashes ;-) > >>> Any updates here (i am aware that this is not very easy to fix)? > >> I am still working on it. It basically requires rbd-nbd to be written so > >> it preallocates its memory used for IO, and when it can't like when > >> doing network IO it requires adding a interface to tell the kernel to > >> not use allocation flags that can cause disk IO back on to the device. > >> > >> There are some workraounds like adding more memory and setting the vm > >> values. For the latter, if it seems if you set: > >> > >> vm.dirty_background_ratio = 0 then it looks like it avoids the problem > >> because the kernel will immediately start to write dirty pages from the > >> background worker threads, so we do not end up later needing to write > >> out pages from the rbd-nbd thread to free up memory. > > Sigh, I set this yesterday on my system ("sysctl vm.dirty_background_ratio=0") and got an additional crash this night :-( > > > > I now restarted the system and invoked all of the following commands mentioned by your last mail: > > > > sysctl vm.dirty_background_ratio=0 > > sysctl vm.dirty_ratio=0 > > sysctl vm.vfs_cache_pressure=0 > > > > Let's see if that helps.... > > > > Regards > > > > Marc > > > > > > Am 03.09.19 um 04:41 schrieb Mike Christie: > >> On 09/02/2019 06:20 AM, Marc Schöchlin wrote: > >>> Hello Mike, > >>> > >>> i am having a quick look to this on vacation because my coworker > >>> reports daily and continuous crashes ;-) > >>> Any updates here (i am aware that this is not very easy to fix)? > >> I am still working on it. It basically requires rbd-nbd to be written so > >> it preallocates its memory used for IO, and when it can't like when > >> doing network IO it requires adding a interface to tell the kernel to > >> not use allocation flags that can cause disk IO back on to the device. > >> > >> There are some workraounds like adding more memory and setting the vm > >> values. For the latter, if it seems if you set: > >> > >> vm.dirty_background_ratio = 0 then it looks like it avoids the problem > >> because the kernel will immediately start to write dirty pages from the > >> background worker threads, so we do not end up later needing to write > >> out pages from the rbd-nbd thread to free up memory. > >> > >> or > >> > >> vm.dirty_ratio = 0 then it looks like it avoids the problem because the > >> kernel will just write out the data right away similar to above, but > >> from its normally going to be written out from the thread that you are > >> running your test from. > >> > >> and this seems optional and can result in other problems: > >> > >> vm.vfs_cache_pressure = 0 then for at least XFS it looks like we avoid > >> one of the immediate problems where allocations would always cause the > >> inode caches to be reclaimed and that memory to be written out to the > >> device. For EXT4, I did not see a similar issue. > >> > >>> I think the severity of this problem > >>> <https://tracker.ceph.com/issues/40822> (currently "minor") is not > >>> suitable to the consequences of this problem. > >>> > >>> This reproducible problem can cause: > >>> > >>> * random service outage > >>> * data corruption > >>> * long recovery procedures on huge filesystems > >>> > >>> Is it adequate to increase the severity to major or critical? > >>> > >>> What might the reason for a very reliable rbd-nbd running on my xen > >>> servers as storage repository? > >>> (see https://github.com/vico-research-and-consulting/RBDSR/tree/v2.0 - > >>> hundreds of devices, high workload) > >>> > >>> Regards > >>> Marc > >>> > >>> Am 15.08.19 um 20:07 schrieb Marc Schöchlin: > >>>> Hello Mike, > >>>> > >>>> Am 15.08.19 um 19:57 schrieb Mike Christie: > >>>>>> Don't waste your time. I found a way to replicate it now. > >>>>>> > >>>>> Just a quick update. > >>>>> > >>>>> Looks like we are trying to allocate memory in the IO path in a way that > >>>>> can swing back on us, so we can end up locking up. You are probably not > >>>>> hitting this with krbd in your setup because normally it's preallocating > >>>>> structs, using flags like GFP_NOIO, etc. For rbd-nbd, we cannot > >>>>> preallocate some structs and cannot control the allocation flags for > >>>>> some operations initiated from userspace, so its possible to hit this > >>>>> every IO. I can replicate this now in a second just doing a cp -r. > >>>>> > >>>>> It's not going to be a simple fix. We have had a similar issue for > >>>>> storage daemons like iscsid and multipathd since they were created. It's > >>>>> less likey to hit with them because you only hit the paths they cannot > >>>>> control memory allocation behavior during recovery. > >>>>> > >>>>> I am looking into some things now. > >>>> Great to hear, that the problem is now identified. > >>>> > >>>> As described I'm on vacation - if you need anything after the 8.9. we can probably invest some time to test upcoming fixes. > >>>> > >>>> Regards > >>>> Marc > >>>> > >>>> > >>> -- > >>> GPG encryption available: 0x670DCBEC/pool.sks-keyservers.net > >>> > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com