Thank you Jason, Not sure how I missed that step.
On 2018-07-06 08:34 AM, Jason Dillaman
wrote:
There have been several similar reports on the
mailing list about this [1][2][3][4] that are always a result of
skipping step 6 from the Luminous upgrade guide [5]. The new
(starting Luminous) 'profile rbd'-style caps are designed to try
to simplify caps going forward [6].
TL;DR: your Openstack CephX users need to have permission
to blacklist dead clients that failed to properly release the
exclusive lock.
Good morning
all,
After losing all power to our DC last night due to a storm,
nearly all
of the volumes in our Pike cluster are unmountable. Of the 30
VMs in
use at the time, only one has been able to successfully mount
and boot
from its rootfs. We are using Ceph as the backend storage to
cinder
and glance. Any help or pointers to bring this back online
would be
appreciated.
What most of the volumes are seeing is
[ 2.622252] SGI XFS with ACLs, security attributes, no
debug enabled
[ 2.629285] XFS (sda1): Mounting V5 Filesystem
[ 2.832223] sd 2:0:0:0: [sda] FAILED Result:
hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[ 2.838412] sd 2:0:0:0: [sda] Sense Key : Aborted Command
[current]
[ 2.842383] sd 2:0:0:0: [sda] Add. Sense: I/O process
terminated
[ 2.846152] sd 2:0:0:0: [sda] CDB: Write(10) 2a 00 00 80 2c
19 00 04
00 00
[ 2.850146] blk_update_request: I/O error, dev sda, sector
8399897
or
[ 2.590178] EXT4-fs (vda1): INFO: recovery required on
readonly
filesystem
[ 2.594319] EXT4-fs (vda1): write access will be enabled
during recovery
[ 2.957742] print_req_error: I/O error, dev vda, sector
227328
[ 2.962468] Buffer I/O error on dev vda1, logical block 0,
lost async
page write
[ 2.967933] Buffer I/O error on dev vda1, logical block 1,
lost async
page write
[ 2.973076] print_req_error: I/O error, dev vda, sector
229384
As a test for one of the less critical vms, I deleted the vm
and mounted
the volume on the one VM I managed to start. The results were
not
promising:
# dmesg |tail
[ 5.136862] type=1305 audit(1530847244.811:4):
audit_pid=496 old=0
auid=4294967295 ses=4294967295
subj=system_u:system_r:auditd_t:s0 res=1
[ 7.726331] nf_conntrack version 0.5.0 (65536 buckets,
262144 max)
[29374.967315] scsi 2:0:0:1: Direct-Access QEMU QEMU
HARDDISK
2.5+ PQ: 0 ANSI: 5
[29374.988104] sd 2:0:0:1: [sdb] 83886080 512-byte logical
blocks: (42.9
GB/40.0 GiB)
[29374.991126] sd 2:0:0:1: Attached scsi generic sg1 type 0
[29374.995302] sd 2:0:0:1: [sdb] Write Protect is off
[29374.997109] sd 2:0:0:1: [sdb] Mode Sense: 63 00 00 08
[29374.997186] sd 2:0:0:1: [sdb] Write cache: enabled, read
cache:
enabled, doesn't support DPO or FUA
[29375.005968] sdb: sdb1
[29375.007746] sd 2:0:0:1: [sdb] Attached SCSI disk
# parted /dev/sdb
GNU Parted 3.1
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p
Model: QEMU QEMU HARDDISK (scsi)
Disk /dev/sdb: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:
Number Start End Size Type File system Flags
1 1049kB 42.9GB 42.9GB primary xfs boot
# mount -t xfs /dev/sdb temp
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
# xfs_repair /dev/sdb
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!
attempting to find secondary superblock...
Which eventually fails. The ceph cluster looks healthy, I
can export
the volumes from rbd. I can find no other errors in ceph of
openstack
indicating a fault in either system.
- Is this recoverable?
- What happened to all of these volumes and can this be
prevented
from occurring again? Note that any shutdown vm at the time
of the
outage appears to be fine.
Relevant versions:
Base OS: all Centos 7.5
Ceph: Luminous 12.2.5-0
Openstack: Latest Pike releases in
centos-release-openstack-pike-1-1
nova 16.1.4-1
cinder 11.1.1-1
--
Gary Molenkamp Computer Science/Science
Technology Services
Systems Administrator University of Western Ontario
molenkam@xxxxxx http://www.csd.uwo.ca
(519) 661-2111 x86882 (519) 661-3566
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
--
Gary Molenkamp Computer Science/Science Technology Services
Systems Administrator University of Western Ontario
molenkam@xxxxxx http://www.csd.uwo.ca
(519) 661-2111 x86882 (519) 661-3566
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com