Re: reproducible rbd-nbd crashes

Mike Christie <mchristi@xxxxxxxxxx> · Tue, 13 Aug 2019 19:04:51 -0500

On 07/31/2019 05:20 AM, Marc Schöchlin wrote:
> Hello Jason,
> 
> it seems that there is something wrong in the rbd-nbd implementation.
> (added this information also at  https://tracker.ceph.com/issues/40822)
> 
> The problem not seems to be related to kernel releases, filesystem types or the ceph and network setup.
> Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems to have the described problem.
> 
> This night a 18 hour testrun with the following procedure was successful:
> -----
> #!/bin/bash
> set -x
> while true; do
>    date
>    find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 2 gzip -v
>    date
>    find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 -n 2 gunzip -v
> done
> -----
> Previous tests crashed in a reproducible manner with "-P 1" (single io gzip/gunzip) after a few minutes up to 45 minutes.
> 
> Overview of my tests:
> 
> - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s device timeout
>   -> 18 hour testrun was successful, no dmesg output
> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created without reboot
>   -> parallel krbd device usage with 99% io usage worked without a problem while running the test
> - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
>   -> parallel krbd device usage with 99% io usage worked without a problem while running the test
> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
>   -> failed after < 10 minutes
>   -> system runs in a high system load, system is almost unusable, unable to shutdown the system, hard reset of vm necessary, manual exclusive lock removal is necessary before remapping the device

Did you see Mykola's question on the tracker about this test? Did the
system become unusable at 13:00?

Above you said it took less than 10 minutes, so we want to clarify if
the test started at 12:39 and failed at 12:49 or if it started at 12:49
and failed by 13:00.

> - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created

How many CPUs and how much memory does the VM have?

I'm not sure which test it covers above, but for
test-with-timeout/ceph-client.archiv.log and dmesg-crash it looks like
the command that probably triggered the timeout got stuck in safe_write
or write_fd, because we see:

// Command completed and right after this log message we try to write
the reply and data to the nbd.ko module.

2019-07-29 21:55:21.148118 7fffbf7fe700 20 rbd-nbd: writer_entry: got:
[4500000000000000 READ 24043755000~20000 0]

// We got stuck and 2 minutes go by and so the timeout fires. That kills
the socket, so we get an error here and after that rbd-nbd is going to exit.

2019-07-29 21:57:21.785111 7fffbf7fe700 -1 rbd-nbd: [4500000000000000
READ 24043755000~20000 0]: failed to write replay data: (32) Broken pipe

We could hit this in a couple ways:

1. The block layer sends a command that is larger than the socket's send
buffer limits. These are those values you sometimes set in sysctl.conf like:

net.core.rmem_max
net.core.wmem_max
net.core.rmem_default
net.core.wmem_default
net.core.optmem_max

There does not seem to be any checks/code to make sure there is some
alignment with limits. I will send a patch but that will not help you
right now. The max io size for nbd is 128k so make sure your net values
are large enough. Increase the values in sysctl.conf and retry if they
were too small.

2. If memory is low on the system, we could be stuck trying to allocate
memory in the kernel in that code path too.

rbd-nbd just uses more memory per device, so it could be why we do not
see a problem with krbd.

3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
He removed that code from the krbd. I will ping him on that.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com