Re: reproducible rbd-nbd crashes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Jason,

thanks for your response.
See my inline comments.

Am 31.07.19 um 14:43 schrieb Jason Dillaman:
On Wed, Jul 31, 2019 at 6:20 AM Marc Schöchlin <ms@xxxxxxxxxx> wrote:


The problem not seems to be related to kernel releases, filesystem types or the ceph and network setup.
Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems to have the described problem.
 ...

It's basically just a log message tweak and some changes to how the
process is daemonized. If you could re-test w/ each release after
12.2.5 and pin-point where the issue starts occurring, we would have
something more to investigate.

Are there changes related to https://tracker.ceph.com/issues/23891?


You showed me the very low amount of changes in rbd-nbd.
What about librbd, librados, ...?

What else can we do to find a detailed reason for the crash?
Do you think it is useful to activate coredump-creation for that process?

Whats next? Is i a good idea to do a binary search between 12.2.12 and 12.2.5?

Due to the absence of a coworker i almost had no capacity to execute deeper tests with this problem.
But i can say that in reproduced the problem also with release 12.2.12.

The new (updated) list:

- SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s device timeout
  -> 18 hour testrun was successful, no dmesg output
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to shutdown the system, hard reset of vm necessary, manual exclusive lock removal is necessary before remapping the device
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
- FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created

Regards
Marc

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux