It only bites you if you have a hard failure of a VM (i.e. the RBD image wasn't cleanly closed and the lock wasn't cleanly released). In that case, the next librbd client to attempt to acquire the lock will notice the dead lock owner and will attempt to blacklist it from the cluster to ensure it cannot write to the image. On Thu, May 10, 2018 at 10:08 AM, Jonathan Proulx <jon@xxxxxxxxxxxxx> wrote: > On Thu, May 10, 2018 at 09:55:15AM -0700, Jason Dillaman wrote: > :My immediate guess is that your caps are incorrect for your OpenStack > :Ceph user. Please refer to step 6 from the Luminous upgrade guide to > :ensure your RBD users have permission to blacklist dead peers [1] > : > :[1] http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken > > Good spotting! Thanks for fastreply. Next question is why did this > take so long to bite me we've been on luminous for 6 months, not going > to worry too myc about that last quetion though. > > Hoepfully that was the problem (it definitely was a problem). > > Thanks, > -Jon > > :On Thu, May 10, 2018 at 9:49 AM, Jonathan Proulx <jon@xxxxxxxxxxxxx> wrote: > :> Hi All, > :> > :> recently I saw a number of rbd backed VMs in my openstack cloud fail > :> to reboot after a hypervisor crash with errors simialr to: > :> > :> [ 5.279393] blk_update_request: I/O error, dev vda, sector 2048 > :> [ 5.281427] Buffer I/O error on dev vda1, logical block 0, lost async page write > :> [ 5.284114] Buffer I/O error on dev vda1, logical block 1, lost async page write > :> [ 5.286600] Buffer I/O error on dev vda1, logical block 2, lost async page write > :> [ 5.289022] Buffer I/O error on dev vda1, logical block 3, lost async page write > :> [ 5.291515] Buffer I/O error on dev vda1, logical block 4, lost async page write > :> [ 5.338981] blk_update_request: I/O error, dev vda, sector 3088 > :> > :> for many blocks and sectors. I was able to export the rbd images and > :> they seemed fine, also 'rbd flatten' made them boot again with no > :> errors. > :> > :> I found this puzzling and concerning but given the crash and limited > :> time didn't really follow up. > :> > :> Today I intetionally rebooted a VM on a health hypervisor and had it > :> land in the same condition, now I'm really worried. > :> > :> running: > :> Ubuntu16.04 > :> ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable) (on hypervisor) > :> { > :> "mon": { > :> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)": 3 > :> }, > :> "mgr": { > :> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)": 3 > :> }, > :> "osd": { > :> "ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)": 102, > :> "ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) luminous (stable)": 10, > :> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)": 62 > :> } > :> } > :> libvirt-bin 1.3.1-1ubuntu10.21 > :> qemu-system 1:2.5+dfsg-5ubuntu10.24 > :> OpenStack Mitaka > :> > :> Any one seen anything like this or have suggestions where to look for more details? > :> > :> -Jon > :> -- > :> _______________________________________________ > :> ceph-users mailing list > :> ceph-users@xxxxxxxxxxxxxx > :> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > : > : > : > :-- > :Jason > > -- -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com