I have posted logs/strace from our osds with details to a ticket
in the ceph bug tracker - see here
http://tracker.ceph.com/issues/21142. You can see where exactly
the OSDs crash etc, this can be of help if someone decides to
debug it.
JZ
On 10/01/18 22:05, Josef Zelenka wrote:
Hi, today we had a disasterous crash - we are running a 3 node,
24 osd in total cluster (8 each) with SSDs for blockdb, HDD for
bluestore data. This cluster is used as a radosgw backend, for
storing a big number of thumbnails for a file hosting site -
around 110m files in total. We were adding an interface to the
nodes which required a restart, but after restarting one of the
nodes, a lot of the OSDs were kicked out of the cluster and rgw
stopped working. We have a lot of pgs down and unfound atm. OSDs
can't be started(aside from some, that's a mystery) with this
error - FAILED
assert ( interval.last > last) - they just periodically
restart. So far, the cluster is broken and we can't seem to
bring it back up. We tried fscking the osds via the ceph
objectstore tool, but it was no good. The root of all this
seems to be in the FAILED assert(interval.last > last)
error, however i can't find any info regarding this or how to
fix it. Did someone here also encounter it? We're running
luminous on ubuntu 16.04.
Thanks
Josef Zelenka
Cloudevelops
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com