Hello Mike, as described i set all the settings. Unfortunately it crashed also with these settings :-( Regards Marc [Tue Sep 10 12:25:56 2019] Btrfs loaded, crc32c=crc32c-intel [Tue Sep 10 12:25:57 2019] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null) [Tue Sep 10 12:25:59 2019] systemd[1]: systemd 237 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid) [Tue Sep 10 12:25:59 2019] systemd[1]: Detected virtualization xen. [Tue Sep 10 12:25:59 2019] systemd[1]: Detected architecture x86-64. [Tue Sep 10 12:25:59 2019] systemd[1]: Set hostname to <int-nfs-001>. [Tue Sep 10 12:26:01 2019] systemd[1]: Started ntp-systemd-netif.path. [Tue Sep 10 12:26:01 2019] systemd[1]: Created slice System Slice. [Tue Sep 10 12:26:01 2019] systemd[1]: Listening on udev Kernel Socket. [Tue Sep 10 12:26:01 2019] systemd[1]: Created slice system-serial\x2dgetty.slice. [Tue Sep 10 12:26:01 2019] systemd[1]: Listening on Journal Socket. [Tue Sep 10 12:26:01 2019] systemd[1]: Mounting POSIX Message Queue File System... [Tue Sep 10 12:26:01 2019] RPC: Registered named UNIX socket transport module. [Tue Sep 10 12:26:01 2019] RPC: Registered udp transport module. [Tue Sep 10 12:26:01 2019] RPC: Registered tcp transport module. [Tue Sep 10 12:26:01 2019] RPC: Registered tcp NFSv4.1 backchannel transport module. [Tue Sep 10 12:26:01 2019] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro [Tue Sep 10 12:26:01 2019] Loading iSCSI transport class v2.0-870. [Tue Sep 10 12:26:01 2019] iscsi: registered transport (tcp) [Tue Sep 10 12:26:01 2019] systemd-journald[497]: Received request to flush runtime journal from PID 1 [Tue Sep 10 12:26:01 2019] Installing knfsd (copyright (C) 1996 okir@xxxxxxxxxxxx). [Tue Sep 10 12:26:01 2019] iscsi: registered transport (iser) [Tue Sep 10 12:26:01 2019] systemd-journald[497]: File /var/log/journal/cef15a6d1b80c9fbcb31a3a65aec21ad/system.journal corrupted or uncleanly shut down, renaming and replacing. [Tue Sep 10 12:26:04 2019] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null) [Tue Sep 10 12:26:05 2019] EXT4-fs (xvda1): mounted filesystem with ordered data mode. Opts: (null) [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.659:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/lxc-start" pid=902 comm="apparmor_parser" [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.675:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=904 comm="apparmor_parser" [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.675:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=904 comm="apparmor_parser" [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.675:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=904 comm="apparmor_parser" [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default" pid=900 comm="apparmor_parser" [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-cgns" pid=900 comm="apparmor_parser" [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-with-mounting" pid=900 comm="apparmor_parser" [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-with-nesting" pid=900 comm="apparmor_parser" [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.723:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=905 comm="apparmor_parser" [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.723:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=905 comm="apparmor_parser" [Tue Sep 10 12:26:06 2019] new mount options do not match the existing superblock, will be ignored [Tue Sep 10 12:26:09 2019] SGI XFS with ACLs, security attributes, realtime, no debug enabled [Tue Sep 10 12:26:09 2019] XFS (nbd0): Mounting V5 Filesystem [Tue Sep 10 12:26:11 2019] XFS (nbd0): Starting recovery (logdev: internal) [Tue Sep 10 12:26:12 2019] XFS (nbd0): Ending recovery (logdev: internal) [Tue Sep 10 12:26:12 2019] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory [Tue Sep 10 12:26:12 2019] NFSD: starting 90-second grace period (net f00000f8) [Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out [Tue Sep 10 14:45:04 2019] block nbd0: shutting down sockets [Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out [Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out [Tue Sep 10 14:45:04 2019] print_req_error: I/O error, dev nbd0, sector 89228968 flags 3 [Tue Sep 10 14:45:04 2019] print_req_error: I/O error, dev nbd0, sector 121872688 flags 3 [Tue Sep 10 14:45:04 2019] print_req_error: I/O error, dev nbd0, sector 123189160 flags 3 [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0 [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 4294967168 flags 0 [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 4294967280 flags 0 [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0 [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 8 flags 0 [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0 [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 4294967168 flags 0 [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 4294967280 flags 0 [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0 [Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 8 flags 0 [Tue Sep 10 14:46:51 2019] INFO: task xfsaild/nbd0:1919 blocked for more than 120 seconds. [Tue Sep 10 14:46:51 2019] Not tainted 5.0.0-27-generic #28~18.04.1-Ubuntu [Tue Sep 10 14:46:51 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Tue Sep 10 14:46:51 2019] xfsaild/nbd0 D 0 1919 2 0x80000000 [Tue Sep 10 14:46:51 2019] Call Trace: [Tue Sep 10 14:46:51 2019] __schedule+0x2bd/0x850 [Tue Sep 10 14:46:51 2019] ? try_to_del_timer_sync+0x53/0x80 [Tue Sep 10 14:46:51 2019] schedule+0x2c/0x70 [Tue Sep 10 14:46:51 2019] xfs_log_force+0x15f/0x2e0 [xfs] [Tue Sep 10 14:46:51 2019] ? wake_up_q+0x80/0x80 [Tue Sep 10 14:46:51 2019] xfsaild+0x17b/0x800 [xfs] [Tue Sep 10 14:46:51 2019] ? __schedule+0x2c5/0x850 [Tue Sep 10 14:46:51 2019] kthread+0x121/0x140 [Tue Sep 10 14:46:51 2019] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] [Tue Sep 10 14:46:51 2019] ? kthread+0x121/0x140 [Tue Sep 10 14:46:51 2019] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] [Tue Sep 10 14:46:51 2019] ? kthread_park+0x90/0x90 [Tue Sep 10 14:46:51 2019] ret_from_fork+0x35/0x40 Am 10.09.19 um 12:51 schrieb Marc Schöchlin: > Hello Mike, > > Am 03.09.19 um 04:41 schrieb Mike Christie: >> On 09/02/2019 06:20 AM, Marc Schöchlin wrote: >>> Hello Mike, >>> >>> i am having a quick look to this on vacation because my coworker >>> reports daily and continuous crashes ;-) >>> Any updates here (i am aware that this is not very easy to fix)? >> I am still working on it. It basically requires rbd-nbd to be written so >> it preallocates its memory used for IO, and when it can't like when >> doing network IO it requires adding a interface to tell the kernel to >> not use allocation flags that can cause disk IO back on to the device. >> >> There are some workraounds like adding more memory and setting the vm >> values. For the latter, if it seems if you set: >> >> vm.dirty_background_ratio = 0 then it looks like it avoids the problem >> because the kernel will immediately start to write dirty pages from the >> background worker threads, so we do not end up later needing to write >> out pages from the rbd-nbd thread to free up memory. > Sigh, I set this yesterday on my system ("sysctl vm.dirty_background_ratio=0") and got an additional crash this night :-( > > I now restarted the system and invoked all of the following commands mentioned by your last mail: > > sysctl vm.dirty_background_ratio=0 > sysctl vm.dirty_ratio=0 > sysctl vm.vfs_cache_pressure=0 > > Let's see if that helps.... > > Regards > > Marc > > > Am 03.09.19 um 04:41 schrieb Mike Christie: >> On 09/02/2019 06:20 AM, Marc Schöchlin wrote: >>> Hello Mike, >>> >>> i am having a quick look to this on vacation because my coworker >>> reports daily and continuous crashes ;-) >>> Any updates here (i am aware that this is not very easy to fix)? >> I am still working on it. It basically requires rbd-nbd to be written so >> it preallocates its memory used for IO, and when it can't like when >> doing network IO it requires adding a interface to tell the kernel to >> not use allocation flags that can cause disk IO back on to the device. >> >> There are some workraounds like adding more memory and setting the vm >> values. For the latter, if it seems if you set: >> >> vm.dirty_background_ratio = 0 then it looks like it avoids the problem >> because the kernel will immediately start to write dirty pages from the >> background worker threads, so we do not end up later needing to write >> out pages from the rbd-nbd thread to free up memory. >> >> or >> >> vm.dirty_ratio = 0 then it looks like it avoids the problem because the >> kernel will just write out the data right away similar to above, but >> from its normally going to be written out from the thread that you are >> running your test from. >> >> and this seems optional and can result in other problems: >> >> vm.vfs_cache_pressure = 0 then for at least XFS it looks like we avoid >> one of the immediate problems where allocations would always cause the >> inode caches to be reclaimed and that memory to be written out to the >> device. For EXT4, I did not see a similar issue. >> >>> I think the severity of this problem >>> <https://tracker.ceph.com/issues/40822> (currently "minor") is not >>> suitable to the consequences of this problem. >>> >>> This reproducible problem can cause: >>> >>> * random service outage >>> * data corruption >>> * long recovery procedures on huge filesystems >>> >>> Is it adequate to increase the severity to major or critical? >>> >>> What might the reason for a very reliable rbd-nbd running on my xen >>> servers as storage repository? >>> (see https://github.com/vico-research-and-consulting/RBDSR/tree/v2.0 - >>> hundreds of devices, high workload) >>> >>> Regards >>> Marc >>> >>> Am 15.08.19 um 20:07 schrieb Marc Schöchlin: >>>> Hello Mike, >>>> >>>> Am 15.08.19 um 19:57 schrieb Mike Christie: >>>>>> Don't waste your time. I found a way to replicate it now. >>>>>> >>>>> Just a quick update. >>>>> >>>>> Looks like we are trying to allocate memory in the IO path in a way that >>>>> can swing back on us, so we can end up locking up. You are probably not >>>>> hitting this with krbd in your setup because normally it's preallocating >>>>> structs, using flags like GFP_NOIO, etc. For rbd-nbd, we cannot >>>>> preallocate some structs and cannot control the allocation flags for >>>>> some operations initiated from userspace, so its possible to hit this >>>>> every IO. I can replicate this now in a second just doing a cp -r. >>>>> >>>>> It's not going to be a simple fix. We have had a similar issue for >>>>> storage daemons like iscsid and multipathd since they were created. It's >>>>> less likey to hit with them because you only hit the paths they cannot >>>>> control memory allocation behavior during recovery. >>>>> >>>>> I am looking into some things now. >>>> Great to hear, that the problem is now identified. >>>> >>>> As described I'm on vacation - if you need anything after the 8.9. we can probably invest some time to test upcoming fixes. >>>> >>>> Regards >>>> Marc >>>> >>>> >>> -- >>> GPG encryption available: 0x670DCBEC/pool.sks-keyservers.net >>> > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com