Re: reproducible rbd-nbd crashes

Marc Schöchlin <ms@xxxxxxxxxx> · Tue, 10 Sep 2019 12:51:35 +0200

Hello Mike,

Am 03.09.19 um 04:41 schrieb Mike Christie:
> On 09/02/2019 06:20 AM, Marc Schöchlin wrote:
>> Hello Mike,
>>
>> i am having a quick look  to this on vacation because my coworker
>> reports daily and continuous crashes ;-)
>> Any updates here (i am aware that this is not very easy to fix)?
> I am still working on it. It basically requires rbd-nbd to be written so
> it preallocates its memory used for IO, and when it can't like when
> doing network IO it requires adding a interface to tell the kernel to
> not use allocation flags that can cause disk IO back on to the device.
>
> There are some workraounds like adding more memory and setting the vm
> values. For the latter, if it seems if you set:
>
> vm.dirty_background_ratio = 0 then it looks like it avoids the problem
> because the kernel will immediately start to write dirty pages from the
> background worker threads, so we do not end up later needing to write
> out pages from the rbd-nbd thread to free up memory.

Sigh, I set this yesterday on my system ("sysctl vm.dirty_background_ratio=0") and got an additional crash this night :-(

I now restarted the system and invoked all of the following commands mentioned by your last mail:

sysctl vm.dirty_background_ratio=0
sysctl vm.dirty_ratio=0
sysctl vm.vfs_cache_pressure=0

Let's see if that helps....

Regards

Marc

Am 03.09.19 um 04:41 schrieb Mike Christie:
> On 09/02/2019 06:20 AM, Marc Schöchlin wrote:
>> Hello Mike,
>>
>> i am having a quick look  to this on vacation because my coworker
>> reports daily and continuous crashes ;-)
>> Any updates here (i am aware that this is not very easy to fix)?
> I am still working on it. It basically requires rbd-nbd to be written so
> it preallocates its memory used for IO, and when it can't like when
> doing network IO it requires adding a interface to tell the kernel to
> not use allocation flags that can cause disk IO back on to the device.
>
> There are some workraounds like adding more memory and setting the vm
> values. For the latter, if it seems if you set:
>
> vm.dirty_background_ratio = 0 then it looks like it avoids the problem
> because the kernel will immediately start to write dirty pages from the
> background worker threads, so we do not end up later needing to write
> out pages from the rbd-nbd thread to free up memory.
>
> or
>
> vm.dirty_ratio = 0 then it looks like it avoids the problem because the
> kernel will just write out the data right away similar to above, but
> from its normally going to be written out from the thread that you are
> running your test from.
>
> and this seems optional and can result in other problems:
>
> vm.vfs_cache_pressure = 0 then for at least XFS it looks like we avoid
> one of the immediate problems where allocations would always cause the
> inode caches to be reclaimed and that memory to be written out to the
> device. For EXT4, I did not see a similar issue.
>
>> I think the severity of this problem
>> <https://tracker.ceph.com/issues/40822> (currently "minor") is not
>> suitable to the consequences of this problem.
>>
>> This reproducible problem can cause:
>>
>>   * random service outage
>>   * data corruption
>>   * long recovery procedures on huge filesystems
>>
>> Is it adequate to increase the severity to major or critical?
>>
>> What might the reason for a very reliable rbd-nbd running on my xen
>> servers as storage repository?
>> (see https://github.com/vico-research-and-consulting/RBDSR/tree/v2.0 -
>> hundreds of devices, high workload)
>>
>> Regards
>> Marc
>>
>> Am 15.08.19 um 20:07 schrieb Marc Schöchlin:
>>> Hello Mike,
>>>
>>> Am 15.08.19 um 19:57 schrieb Mike Christie:
>>>>> Don't waste your time. I found a way to replicate it now.
>>>>>
>>>> Just a quick update.
>>>>
>>>> Looks like we are trying to allocate memory in the IO path in a way that
>>>> can swing back on us, so we can end up locking up. You are probably not
>>>> hitting this with krbd in your setup because normally it's preallocating
>>>> structs, using flags like GFP_NOIO, etc. For rbd-nbd, we cannot
>>>> preallocate some structs and cannot control the allocation flags for
>>>> some operations initiated from userspace, so its possible to hit this
>>>> every IO. I can replicate this now in a second just doing a cp -r.
>>>>
>>>> It's not going to be a simple fix. We have had a similar issue for
>>>> storage daemons like iscsid and multipathd since they were created. It's
>>>> less likey to hit with them because you only hit the paths they cannot
>>>> control memory allocation behavior during recovery.
>>>>
>>>> I am looking into some things now.
>>> Great to hear, that the problem is now identified.
>>>
>>> As described I'm on vacation -  if you need anything after the 8.9. we can probably invest some time to test upcoming fixes.
>>>
>>> Regards
>>> Marc
>>>
>>>
>> -- 
>> GPG encryption available: 0x670DCBEC/pool.sks-keyservers.net
>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com