After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

Yesterday one of our (non-priority) clusters failed when 3 OSDs went down (EC 8+2) together.
This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two hours before.
They failed exactly at the same moment, rendering the cluster unusable (CephFS).
We are using CentOS 7 with latest updates and ceph repo. No cache SSDs, no external journal / wal / db.

OSD 29 (no disk failure in dmesg):
2018-10-03 09:47:15.074 7fb8835ce1c0  0 set uid:gid to 167:167 (ceph:ceph)
2018-10-03 09:47:15.074 7fb8835ce1c0  0 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process ceph-osd, pid 20899
2018-10-03 09:47:15.074 7fb8835ce1c0  0 pidfile_write: ignore empty --pid-file
2018-10-03 09:47:15.100 7fb8835ce1c0  0 load: jerasure load: lrc load: isa
2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev create path /var/lib/ceph/osd/ceph-29/block type kernel
2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a20000 /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a20000 /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e0800000, 932 GiB) block_size 4096 (4 KiB) rotational
2018-10-03 09:47:15.101 7fb8835ce1c0  1 bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > kv_ratio 0.5
2018-10-03 09:47:15.101 7fb8835ce1c0  1 bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 meta 0 kv 1 data 0
2018-10-03 09:47:15.101 7fb8835ce1c0  1 bdev(0x561250a20000 /var/lib/ceph/osd/ceph-29/block) close
2018-10-03 09:47:15.358 7fb8835ce1c0  1 bluestore(/var/lib/ceph/osd/ceph-29) _mount path /var/lib/ceph/osd/ceph-29
2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev create path /var/lib/ceph/osd/ceph-29/block type kernel
2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev(0x561250a20000 /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
2018-10-03 09:47:15.359 7fb8835ce1c0  1 bdev(0x561250a20000 /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e0800000, 932 GiB) block_size 4096 (4 KiB) rotational
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > kv_ratio 0.5
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 meta 0 kv 1 data 0
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev create path /var/lib/ceph/osd/ceph-29/block type kernel
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80 /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80 /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e0800000, 932 GiB) block_size 4096 (4 KiB) rotational
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-29/block size 932 GiB
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs mount
2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs _replay file with link count 0: file(ino 519 size 0x31e2f42 mtime 2018-10-02 12:24:22.632397 bdev 1 allocated 3200000 extents [1:0x7008200000+100000,1:0x7009000000+100000,1:0x7009100000+100000,1:0x7009200000+100000,1:0x7009300000+100000,1:0x7009400000+100000,1:0x7009500000+100000,1:0x7009600000+100000,1:0x7009700000+100000,1:0x7009800000+100000,1:0x7009900000+100000,1:0x7009a00000+100000,1:0x7009b00000+100000,1:0x7009c00000+100000,1:0x7009d00000+100000,1:0x7009e00000+100000,1:0x7009f00000+100000,1:0x700a000000+100000,1:0x700a100000+100000,1:0x700a200000+100000,1:0x700a300000+100000,1:0x700a400000+100000,1:0x700a500000+100000,1:0x700a600000+100000,1:0x700a700000+100000,1:0x700a800000+100000,1:0x700a900000+100000,1:0x700aa00000+100000,1:0x700ab00000+100000,1:0x700ac00000+100000,1:0x700ad00000+100000,1:0x700ae00000+100000,1:0x700af00000+100000,1:0x700b000000+100000,1:0x700b100000+100000,1:0x700b200000+100000,1:0x700b300000+100000,1:0x700b400000+100000,1:0x700b500000+100000,1:0x700b600000+100000,1:0x700b700000+100000,1:0x700b800000+100000,1:0x700b900000+100000,1:0x700ba00000+100000,1:0x700bb00000+100000,1:0x700bc00000+100000,1:0x700bd00000+100000,1:0x700be00000+100000,1:0x700bf00000+100000,1:0x700c000000+100000])
2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs mount failed to replay log: (5) Input/output error
2018-10-03 09:47:15.538 7fb8835ce1c0  1 stupidalloc 0x0x561250b8d030 shutdown
2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluestore(/var/lib/ceph/osd/ceph-29) _open_db failed bluefs mount: (5) Input/output error
2018-10-03 09:47:15.538 7fb8835ce1c0  1 bdev(0x561250a20a80 /var/lib/ceph/osd/ceph-29/block) close
2018-10-03 09:47:15.616 7fb8835ce1c0  1 bdev(0x561250a20000 /var/lib/ceph/osd/ceph-29/block) close
2018-10-03 09:47:15.870 7fb8835ce1c0 -1 osd.29 0 OSD:init: unable to mount object store
2018-10-03 09:47:15.870 7fb8835ce1c0 -1  ** ERROR: osd init failed: (5) Input/output error

OSD 42:
disk is found by lvm, tmpfs is created but service immediately dies on start without log...
This might be failed.

OSD 47 (same as above, seems not be died, no dmesg trace):
2018-10-03 10:02:25.221 7f4d54b611c0  0 set uid:gid to 167:167 (ceph:ceph)
2018-10-03 10:02:25.221 7f4d54b611c0  0 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process ceph-osd, pid 8993
2018-10-03 10:02:25.221 7f4d54b611c0  0 pidfile_write: ignore empty --pid-file
2018-10-03 10:02:25.247 7f4d54b611c0  0 load: jerasure load: lrc load: isa 
2018-10-03 10:02:25.248 7f4d54b611c0  1 bdev create path /var/lib/ceph/osd/ceph-46/block type kernel
2018-10-03 10:02:25.248 7f4d54b611c0  1 bdev(0x564072f96000 /var/lib/ceph/osd/ceph-46/block) open path /var/lib/ceph/osd/ceph-46/block
2018-10-03 10:02:25.248 7f4d54b611c0  1 bdev(0x564072f96000 /var/lib/ceph/osd/ceph-46/block) open size 1000198897664 (0xe8e0800000, 932 GiB) block_size 4096 (4 KiB) rotational
2018-10-03 10:02:25.249 7f4d54b611c0  1 bluestore(/var/lib/ceph/osd/ceph-46) _set_cache_sizes kv_min_ratio 1 > kv_ratio 0.5
2018-10-03 10:02:25.249 7f4d54b611c0  1 bluestore(/var/lib/ceph/osd/ceph-46) _set_cache_sizes cache_size 536870912 meta 0 kv 1 data 0
2018-10-03 10:02:25.249 7f4d54b611c0  1 bdev(0x564072f96000 /var/lib/ceph/osd/ceph-46/block) close
2018-10-03 10:02:25.503 7f4d54b611c0  1 bluestore(/var/lib/ceph/osd/ceph-46) _mount path /var/lib/ceph/osd/ceph-46
2018-10-03 10:02:25.504 7f4d54b611c0  1 bdev create path /var/lib/ceph/osd/ceph-46/block type kernel
2018-10-03 10:02:25.504 7f4d54b611c0  1 bdev(0x564072f96000 /var/lib/ceph/osd/ceph-46/block) open path /var/lib/ceph/osd/ceph-46/block
2018-10-03 10:02:25.504 7f4d54b611c0  1 bdev(0x564072f96000 /var/lib/ceph/osd/ceph-46/block) open size 1000198897664 (0xe8e0800000, 932 GiB) block_size 4096 (4 KiB) rotational
2018-10-03 10:02:25.505 7f4d54b611c0  1 bluestore(/var/lib/ceph/osd/ceph-46) _set_cache_sizes kv_min_ratio 1 > kv_ratio 0.5
2018-10-03 10:02:25.505 7f4d54b611c0  1 bluestore(/var/lib/ceph/osd/ceph-46) _set_cache_sizes cache_size 536870912 meta 0 kv 1 data 0
2018-10-03 10:02:25.505 7f4d54b611c0  1 bdev create path /var/lib/ceph/osd/ceph-46/block type kernel
2018-10-03 10:02:25.505 7f4d54b611c0  1 bdev(0x564072f96a80 /var/lib/ceph/osd/ceph-46/block) open path /var/lib/ceph/osd/ceph-46/block
2018-10-03 10:02:25.505 7f4d54b611c0  1 bdev(0x564072f96a80 /var/lib/ceph/osd/ceph-46/block) open size 1000198897664 (0xe8e0800000, 932 GiB) block_size 4096 (4 KiB) rotational
2018-10-03 10:02:25.505 7f4d54b611c0  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-46/block size 932 GiB
2018-10-03 10:02:25.505 7f4d54b611c0  1 bluefs mount
2018-10-03 10:02:25.620 7f4d54b611c0 -1 bluefs _replay file with link count 0: file(ino 450 size 0x169964c mtime 2018-10-02 12:24:22.602432 bdev 1 allocated 1700000 extents [1:0x6fd9500000+100000,1:0x6fd9600000+100000,1:0x6fd9700000+100000,1:0x6fd9800000+100000,1:0x6fd9900000+100000,1:0x6fd9a00000+100000,1:0x6fd9b00000+100000,1:0x6fd9c00000+100000,1:0x6fd9d00000+100000,1:0x6fd9e00000+100000,1:0x6fd9f00000+100000,1:0x6fda000000+100000,1:0x6fda100000+100000,1:0x6fda200000+100000,1:0x6fda300000+100000,1:0x6fda400000+100000,1:0x6fda500000+100000,1:0x6fda600000+100000,1:0x6fda700000+100000,1:0x6fda800000+100000,1:0x6fda900000+100000,1:0x6fdaa00000+100000,1:0x6fdab00000+100000])
2018-10-03 10:02:25.620 7f4d54b611c0 -1 bluefs mount failed to replay log: (5) Input/output error
2018-10-03 10:02:25.620 7f4d54b611c0  1 stupidalloc 0x0x564073102fc0 shutdown
2018-10-03 10:02:25.620 7f4d54b611c0 -1 bluestore(/var/lib/ceph/osd/ceph-46) _open_db failed bluefs mount: (5) Input/output error
2018-10-03 10:02:25.620 7f4d54b611c0  1 bdev(0x564072f96a80 /var/lib/ceph/osd/ceph-46/block) close
2018-10-03 10:02:25.763 7f4d54b611c0  1 bdev(0x564072f96000 /var/lib/ceph/osd/ceph-46/block) close
2018-10-03 10:02:26.010 7f4d54b611c0 -1 osd.46 0 OSD:init: unable to mount object store
2018-10-03 10:02:26.010 7f4d54b611c0 -1  ** ERROR: osd init failed: (5) Input/output error

We had failing disks in this cluster before but that was easily recovered by out + rebalance.
For me, it seems like one disk died (there was large I/O on the cluster when this happened) and took two additional disks with it.
It is very strange that this happened about two hours after the upgrade + reboot.

Any recommendations?
I have 8 PGs down, the remeining are active and recovery / rebalance.

Kind regards
Kevin
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux