Hello Mike, Am 03.09.19 um 04:41 schrieb Mike Christie: > On 09/02/2019 06:20 AM, Marc Schöchlin wrote: >> Hello Mike, >> >> i am having a quick look to this on vacation because my coworker >> reports daily and continuous crashes ;-) >> Any updates here (i am aware that this is not very easy to fix)? > I am still working on it. It basically requires rbd-nbd to be written so > it preallocates its memory used for IO, and when it can't like when > doing network IO it requires adding a interface to tell the kernel to > not use allocation flags that can cause disk IO back on to the device. > > There are some workraounds like adding more memory and setting the vm > values. For the latter, if it seems if you set: > > vm.dirty_background_ratio = 0 then it looks like it avoids the problem > because the kernel will immediately start to write dirty pages from the > background worker threads, so we do not end up later needing to write > out pages from the rbd-nbd thread to free up memory. Sigh, I set this yesterday on my system ("sysctl vm.dirty_background_ratio=0") and got an additional crash this night :-( I now restarted the system and invoked all of the following commands mentioned by your last mail: sysctl vm.dirty_background_ratio=0 sysctl vm.dirty_ratio=0 sysctl vm.vfs_cache_pressure=0 Let's see if that helps.... Regards Marc Am 03.09.19 um 04:41 schrieb Mike Christie: > On 09/02/2019 06:20 AM, Marc Schöchlin wrote: >> Hello Mike, >> >> i am having a quick look to this on vacation because my coworker >> reports daily and continuous crashes ;-) >> Any updates here (i am aware that this is not very easy to fix)? > I am still working on it. It basically requires rbd-nbd to be written so > it preallocates its memory used for IO, and when it can't like when > doing network IO it requires adding a interface to tell the kernel to > not use allocation flags that can cause disk IO back on to the device. > > There are some workraounds like adding more memory and setting the vm > values. For the latter, if it seems if you set: > > vm.dirty_background_ratio = 0 then it looks like it avoids the problem > because the kernel will immediately start to write dirty pages from the > background worker threads, so we do not end up later needing to write > out pages from the rbd-nbd thread to free up memory. > > or > > vm.dirty_ratio = 0 then it looks like it avoids the problem because the > kernel will just write out the data right away similar to above, but > from its normally going to be written out from the thread that you are > running your test from. > > and this seems optional and can result in other problems: > > vm.vfs_cache_pressure = 0 then for at least XFS it looks like we avoid > one of the immediate problems where allocations would always cause the > inode caches to be reclaimed and that memory to be written out to the > device. For EXT4, I did not see a similar issue. > >> I think the severity of this problem >> <https://tracker.ceph.com/issues/40822> (currently "minor") is not >> suitable to the consequences of this problem. >> >> This reproducible problem can cause: >> >> * random service outage >> * data corruption >> * long recovery procedures on huge filesystems >> >> Is it adequate to increase the severity to major or critical? >> >> What might the reason for a very reliable rbd-nbd running on my xen >> servers as storage repository? >> (see https://github.com/vico-research-and-consulting/RBDSR/tree/v2.0 - >> hundreds of devices, high workload) >> >> Regards >> Marc >> >> Am 15.08.19 um 20:07 schrieb Marc Schöchlin: >>> Hello Mike, >>> >>> Am 15.08.19 um 19:57 schrieb Mike Christie: >>>>> Don't waste your time. I found a way to replicate it now. >>>>> >>>> Just a quick update. >>>> >>>> Looks like we are trying to allocate memory in the IO path in a way that >>>> can swing back on us, so we can end up locking up. You are probably not >>>> hitting this with krbd in your setup because normally it's preallocating >>>> structs, using flags like GFP_NOIO, etc. For rbd-nbd, we cannot >>>> preallocate some structs and cannot control the allocation flags for >>>> some operations initiated from userspace, so its possible to hit this >>>> every IO. I can replicate this now in a second just doing a cp -r. >>>> >>>> It's not going to be a simple fix. We have had a similar issue for >>>> storage daemons like iscsid and multipathd since they were created. It's >>>> less likey to hit with them because you only hit the paths they cannot >>>> control memory allocation behavior during recovery. >>>> >>>> I am looking into some things now. >>> Great to hear, that the problem is now identified. >>> >>> As described I'm on vacation - if you need anything after the 8.9. we can probably invest some time to test upcoming fixes. >>> >>> Regards >>> Marc >>> >>> >> -- >> GPG encryption available: 0x670DCBEC/pool.sks-keyservers.net >> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com