reproducible rbd-nbd crashes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Mike,

see my inline comments.

Am 14.08.19 um 02:09 schrieb Mike Christie:
-----
Previous tests crashed in a reproducible manner with "-P 1" (single io gzip/gunzip) after a few minutes up to 45 minutes.

Overview of my tests:

- SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s device timeout
  -> 18 hour testrun was successful, no dmesg output
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to shutdown the system, hard reset of vm necessary, manual exclusive lock removal is necessary before remapping the device

There is something new compared to yesterday.....three days ago i downgraded a production system to client version 12.2.5.
This night also this machine crashed. So it seems that rbd-nbd is broken in general also with release 12.2.5 and potentially before.

The new (updated) list:

- FAILED: kernel 4.15, ceph 12.2.5, 2TB ec-volume, ext4 file system, 120s device timeout
  -> crashed in production while snapshot trimming is running on that pool
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to shutdown the system, hard reset of vm necessary, manual exclusive lock removal is necessary before remapping the device
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
- FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created


        
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
How many CPUs and how much memory does the VM have?

Charateristic of the crashed vm machine:

  • Ubuntu 18.04, with kernel 4.15, Ceph Client 12.2.5
  • Services: NFS kernel Server, nothing else
  • Crash behavior:
    • daily Task for snapshot creation/deletion started at 19:00
    • a daily database backup started at 19:00, this created
      • 120 IOPS write, and 1 IOPS read
      • 22K/sectors per second write, 0 sectors/per second
      • 97 MBIT inbound and 97 MBIT outbound network usage (nfs server)
    • we had slow requests at the time of the crash
    • rbd-nbd process terminated 25min later without segfault
    • the nfs usage created a 5 min load of 10 from start, 5K context switches/sec
    • memory usage (kernel+userspace) was 10% of the system
    • no swap usage
  • ceph.conf
    [client]
    rbd cache = true
    rbd cache size = 67108864
    rbd cache max dirty = 33554432
    rbd cache target dirty = 25165824
    rbd cache max dirty age = 3
    rbd readahead max bytes = 4194304
    admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
  • 4 CPUs
  • 6 GB RAM
  • Non default Sysctl Settings
    vm.swappiness = 1
    fs.aio-max-nr = 262144
    fs.file-max = 1000000
    kernel.pid_max = 4194303
    vm.zone_reclaim_mode = 0
    kernel.randomize_va_space = 0
    kernel.panic = 0
    kernel.panic_on_oops = 0

I'm not sure which test it covers above, but for
test-with-timeout/ceph-client.archiv.log and dmesg-crash it looks like
the command that probably triggered the timeout got stuck in safe_write
or write_fd, because we see:

// Command completed and right after this log message we try to write
the reply and data to the nbd.ko module.

2019-07-29 21:55:21.148118 7fffbf7fe700 20 rbd-nbd: writer_entry: got:
[4500000000000000 READ 24043755000~20000 0]

// We got stuck and 2 minutes go by and so the timeout fires. That kills
the socket, so we get an error here and after that rbd-nbd is going to exit.

2019-07-29 21:57:21.785111 7fffbf7fe700 -1 rbd-nbd: [4500000000000000
READ 24043755000~20000 0]: failed to write replay data: (32) Broken pipe

We could hit this in a couple ways:

1. The block layer sends a command that is larger than the socket's send
buffer limits. These are those values you sometimes set in sysctl.conf like:

net.core.rmem_max
net.core.wmem_max
net.core.rmem_default
net.core.wmem_default
net.core.optmem_max
see attached file.
There does not seem to be any checks/code to make sure there is some
alignment with limits. I will send a patch but that will not help you
right now. The max io size for nbd is 128k so make sure your net values
are large enough. Increase the values in sysctl.conf and retry if they
were too small.
Not sure what I was thinking. Just checked the logs and we have done IO
of the same size that got stuck and it was fine, so the socket sizes
should be ok.

We still need to add code to make sure IO sizes and the af_unix sockets
size limits match up.


2. If memory is low on the system, we could be stuck trying to allocate
memory in the kernel in that code path too.
memory was definitely not low, we only had 10% memory usage at the time of the crash.
rbd-nbd just uses more memory per device, so it could be why we do not
see a problem with krbd.

3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
He removed that code from the krbd. I will ping him on that.

Interesting. I activated Coredumps for that processes - probably we can find something interesting here...

Regards
Marc


Attachment: sysctl_settings.txt.gz
Description: application/gzip

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux