Re: RBD I/O errors with QEMU [luminous upgrade/osd change]

Nico Schottelius <nico.schottelius@xxxxxxxxxxx> · Sun, 10 Sep 2017 15:56:21 +0200

Just tried and there is not much more log in ceph -w (see below) neither
from the qemu process.

[15:52:43] server4:~$  /usr/bin/qemu-system-x86_64 -name one-17031 -S
-machine pc-i440fx-2.1,accel=kvm,usb=off -m 8192 -realtime mlock=off
-smp 6,sockets=6,cores=1,threads=1 -uuid
79845fca-9b26-4072-bcb3-7f5206c2a531 -no-user-config -nodefaults
-chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-17031.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -boot strict=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file='rbd:one/one-29-17031-0:id=libvirt:key=DELETEME:auth_supported=cephx\;none:mon_host=server1\:6789\;server3\:6789\;server5\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none' -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/var/lib/one//datastores/100/17031/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -vnc [::]:21131 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on 2>&1 | tee kvmlogwithdebug

-> no output

The command line of qemu is copied out of what opennebula usually
spawns, minus the networking part.

[15:41:54] server4:~# ceph -w
2017-09-10 15:44:32.873281 7f59f17fa700 10 client.?.objecter ms_handle_connect 0x7f59f4150e90
2017-09-10 15:44:32.873315 7f59f17fa700 10 client.?.objecter resend_mon_ops
2017-09-10 15:44:32.873327 7f59f17fa700 10 client.?.objecter ms_handle_connect 0x7f59f41544d0
2017-09-10 15:44:32.873329 7f59f17fa700 10 client.?.objecter resend_mon_ops
2017-09-10 15:44:32.876248 7f59f9a63700 10 client.1021613.objecter _maybe_request_map subscribing (onetime) to next osd map
2017-09-10 15:44:32.876710 7f59f17fa700 10 client.1021613.objecter ms_dispatch 0x7f59f4000fe0 osd_map(9059..9059 src has 8530..9059) v3
2017-09-10 15:44:32.876722 7f59f17fa700  3 client.1021613.objecter handle_osd_map got epochs [9059,9059] > 0
2017-09-10 15:44:32.876726 7f59f17fa700  3 client.1021613.objecter handle_osd_map decoding full epoch 9059
2017-09-10 15:44:32.877099 7f59f17fa700 20 client.1021613.objecter dump_active .. 0 homeless
2017-09-10 15:44:32.877423 7f59f17fa700 10 client.1021613.objecter ms_handle_connect 0x7f59dc00c9c0
  cluster:
    id:     26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum server5,server3,server1
    mgr: 1(active), standbys: 2, 0
    osd: 50 osds: 49 up, 49 in

  data:
    pools:   2 pools, 1088 pgs
    objects: 500k objects, 1962 GB
    usage:   5914 GB used, 9757 GB / 15672 GB avail
    pgs:     1088 active+clean

  io:
    client:   18822 B/s rd, 799 kB/s wr, 6 op/s rd, 52 op/s wr

2017-09-10 15:44:37.876324 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:44:42.876437 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:44:45.223970 7f59f17fa700 10 client.1021613.objecter ms_dispatch 0x7f59f4000fe0 log(2 entries from seq 215046 at 2017-09-10 15:44:45.164162) v1
2017-09-10 15:44:47.876548 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:44:52.876668 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:44:57.876770 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:45:02.876888 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:45:07.877001 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:45:12.877120 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:45:17.877229 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:45:22.877349 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:45:27.877455 7f59f1ffb700 10 client.1021613.objecter tick

Jason Dillaman <jdillama@xxxxxxxxxx> writes:

> Sorry -- meant VM. Yes, librbd uses ceph.conf for configuration settings.
>
> On Sun, Sep 10, 2017 at 9:22 AM, Nico Schottelius
> <nico.schottelius@xxxxxxxxxxx> wrote:
>>
>> Hello Jason,
>>
>> I think there is a slight misunderstanding:
>> There is only one *VM*, not one OSD left that we did not start.
>>
>> Or does librbd also read ceph.conf and will that cause qemu to output
>> debug messages?
>>
>> Best,
>>
>> Nico
>>
>> Jason Dillaman <jdillama@xxxxxxxxxx> writes:
>>
>>> I presume QEMU is using librbd instead of a mapped krbd block device,
>>> correct? If that is the case, can you add "debug-rbd=20" and "debug
>>> objecter=20" to your ceph.conf and boot up your last remaining broken
>>> OSD?
>>>
>>> On Sun, Sep 10, 2017 at 8:23 AM, Nico Schottelius
>>> <nico.schottelius@xxxxxxxxxxx> wrote:
>>>>
>>>> Good morning,
>>>>
>>>> yesterday we had an unpleasant surprise that I would like to discuss:
>>>>
>>>> Many (not all!) of our VMs were suddenly
>>>> dying (qemu process exiting) and when trying to restart them, inside the
>>>> qemu process we saw i/o errors on the disks and the OS was not able to
>>>> start (i.e. stopped in initramfs).
>>>>
>>>> When we exported the image from rbd and loop mounted it, there were
>>>> however no I/O errors and the filesystem could be cleanly mounted [-1].
>>>>
>>>> We are running Devuan with kernel 3.16.0-4-amd64 and saw that there are
>>>> some problems reported with kernels < 3.16.39 and thus we upgraded one
>>>> host that serves as VM host + runs ceph osds to Devuan ascii using
>>>> 4.9.0-3-amd64.
>>>>
>>>> Trying to start the VM again on this host however resulted in the same
>>>> I/O problem.
>>>>
>>>> We then did the "stupid" approach of exporting an image and importing it
>>>> again as the same name [0]. Surprisingly, this solved our problem
>>>> reproducible for all affected VMs and allowed us to go back online.
>>>>
>>>> We intentionally left one broken VM in our system (a test VM) so that we
>>>> have the chance of debugging further what happened and how we can
>>>> prevent it from happening again.
>>>>
>>>> As you might have guessed, there have been some event prior this:
>>>>
>>>> - Some weeks before we upgraded our cluster from kraken to luminous (in
>>>>   the right order of mon's first, adding mgrs)
>>>>
>>>> - About a week ago we added the first hdd to our cluster and modified the
>>>>   crushmap so that it the "one" pool (from opennebula) still selects
>>>>   only ssds
>>>>
>>>> - Some hours before we took out one of the 5 hosts of the ceph cluster,
>>>>   as we intended to replace the filesystem based OSDs with bluestore
>>>>   (roughly 3 hours prior to the event)
>>>>
>>>> - Short time before the event we readded an osd, but did not "up" it
>>>>
>>>> To our understanding, none of these actions should have triggered this
>>>> behaviour, however we are aware that with the upgrade to luminous also
>>>> the client libraries were updated and not all qemu processes were
>>>> restarted. [1]
>>>>
>>>> After this long story, I was wondering about the following things:
>>>>
>>>> - Why did this happen at all?
>>>>   And what is different after we reimported the image?
>>>>   Can it be related to disconnected the image from the parent
>>>>   (i.e. opennebula creates clones prior to starting a VM)
>>>>
>>>> - We have one broken VM left - is there a way to get it back running
>>>>   without doing the export/import dance?
>>>>
>>>> - How / or is http://tracker.ceph.com/issues/18807 related to our issue?
>>>>   How is the kernel involved into running VMs that use librbd?
>>>>   rbd showmapped does not show any mapped VMs, as qemu connects directly
>>>>   to ceph.
>>>>
>>>>   We tried upgrading one host to Devuan ascii which uses 4.9.0-3-amd64,
>>>>   but did not fix our problem.
>>>>
>>>> We would appreciate any pointer!
>>>>
>>>> Best,
>>>>
>>>> Nico
>>>>
>>>>
>>>> [-1]
>>>> losetup -P /dev/loop0 /var/tmp/one-staging/monitoring1-disk.img
>>>> mkdir /tmp/monitoring1-mnt
>>>> mount /dev/loop0p1 /tmp/monitoring1-mnt/
>>>>
>>>>
>>>> [0]
>>>>
>>>> rbd export one/$img /var/tmp/one-staging/$img
>>>> rbd rm one/$img
>>>> rbd import /var/tmp/one-staging/$img one/$img
>>>> rm /var/tmp/one-staging/$img
>>>>
>>>> [1]
>>>> [14:05:34] server5:~# ceph features
>>>> {
>>>>     "mon": {
>>>>         "group": {
>>>>             "features": "0x1ffddff8eea4fffb",
>>>>             "release": "luminous",
>>>>             "num": 3
>>>>         }
>>>>     },
>>>>     "osd": {
>>>>         "group": {
>>>>             "features": "0x1ffddff8eea4fffb",
>>>>             "release": "luminous",
>>>>             "num": 49
>>>>         }
>>>>     },
>>>>     "client": {
>>>>         "group": {
>>>>             "features": "0xffddff8ee84fffb",
>>>>             "release": "kraken",
>>>>             "num": 1
>>>>         },
>>>>         "group": {
>>>>             "features": "0xffddff8eea4fffb",
>>>>             "release": "luminous",
>>>>             "num": 4
>>>>         },
>>>>         "group": {
>>>>             "features": "0x1ffddff8eea4fffb",
>>>>             "release": "luminous",
>>>>             "num": 61
>>>>         }
>>>>     }
>>>> }
>>>>
>>>>
>>>> --
>>>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> --
>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com