Re: reproducible rbd-nbd crashes

Mike Christie <mchristi@xxxxxxxxxx> · Tue, 13 Aug 2019 19:09:40 -0500

On 08/13/2019 07:04 PM, Mike Christie wrote:
> On 07/31/2019 05:20 AM, Marc Schöchlin wrote:
>> Hello Jason,
>>
>> it seems that there is something wrong in the rbd-nbd implementation.
>> (added this information also at  https://tracker.ceph.com/issues/40822)
>>
>> The problem not seems to be related to kernel releases, filesystem types or the ceph and network setup.
>> Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems to have the described problem.
>>
>> This night a 18 hour testrun with the following procedure was successful:
>> -----
>> #!/bin/bash
>> set -x
>> while true; do
>>    date
>>    find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 2 gzip -v
>>    date
>>    find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 -n 2 gunzip -v
>> done
>> -----
>> Previous tests crashed in a reproducible manner with "-P 1" (single io gzip/gunzip) after a few minutes up to 45 minutes.
>>
>> Overview of my tests:
>>
>> - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s device timeout
>>   -> 18 hour testrun was successful, no dmesg output
>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created without reboot
>>   -> parallel krbd device usage with 99% io usage worked without a problem while running the test
>> - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
>>   -> parallel krbd device usage with 99% io usage worked without a problem while running the test
>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
>>   -> failed after < 10 minutes
>>   -> system runs in a high system load, system is almost unusable, unable to shutdown the system, hard reset of vm necessary, manual exclusive lock removal is necessary before remapping the device
> 
> Did you see Mykola's question on the tracker about this test? Did the
> system become unusable at 13:00?
> 
> Above you said it took less than 10 minutes, so we want to clarify if
> the test started at 12:39 and failed at 12:49 or if it started at 12:49
> and failed by 13:00.
> 
>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s device timeout
>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
> 
> How many CPUs and how much memory does the VM have?
> 
> I'm not sure which test it covers above, but for
> test-with-timeout/ceph-client.archiv.log and dmesg-crash it looks like
> the command that probably triggered the timeout got stuck in safe_write
> or write_fd, because we see:
> 
> // Command completed and right after this log message we try to write
> the reply and data to the nbd.ko module.
> 
> 2019-07-29 21:55:21.148118 7fffbf7fe700 20 rbd-nbd: writer_entry: got:
> [4500000000000000 READ 24043755000~20000 0]
> 
> // We got stuck and 2 minutes go by and so the timeout fires. That kills
> the socket, so we get an error here and after that rbd-nbd is going to exit.
> 
> 2019-07-29 21:57:21.785111 7fffbf7fe700 -1 rbd-nbd: [4500000000000000
> READ 24043755000~20000 0]: failed to write replay data: (32) Broken pipe
> 
> We could hit this in a couple ways:
> 
> 1. The block layer sends a command that is larger than the socket's send
> buffer limits. These are those values you sometimes set in sysctl.conf like:
> 
> net.core.rmem_max
> net.core.wmem_max
> net.core.rmem_default
> net.core.wmem_default
> net.core.optmem_max
> 
> There does not seem to be any checks/code to make sure there is some
> alignment with limits. I will send a patch but that will not help you
> right now. The max io size for nbd is 128k so make sure your net values
> are large enough. Increase the values in sysctl.conf and retry if they
> were too small.

Not sure what I was thinking. Just checked the logs and we have done IO
of the same size that got stuck and it was fine, so the socket sizes
should be ok.

We still need to add code to make sure IO sizes and the af_unix sockets
size limits match up.

> 
> 2. If memory is low on the system, we could be stuck trying to allocate
> memory in the kernel in that code path too.
> 
> rbd-nbd just uses more memory per device, so it could be why we do not
> see a problem with krbd.
> 
> 3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
> He removed that code from the krbd. I will ping him on that.
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com