Re: OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Wed, May 17, 2017 at 10:51 AM Steve Anthony <sma310@xxxxxxxxxx> wrote:
Hello,

After starting a backup (create snap, export and import into a second
cluster - one RBD image still exporting/importing as of this message)
the other day while recovery operations on the primary cluster were
ongoing I noticed an OSD (osd.126) start to crash; I reweighted it to 0
to prepare to remove it. Shortly thereafter I noticed the problem seemed
to move to another OSD (osd.223). After looking at the logs, I noticed
they appeared to have the same problem. I'm running Ceph version 9.2.1
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.

Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe

Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA


May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15
10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors
{default=true}
May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897
7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void
ReplicatedPG::on_local_recover(const hobject_t&, const
object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15
10:39:55.322306
May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192: FAILED
assert(recovery_info.oi.snaps.size())

May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391
7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true}
May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In function
'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const
object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15
16:45:30.799839
May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192: FAILED
assert(recovery_info.oi.snaps.size())


I did some searching and thought it might be related to
http://tracker.ceph.com/issues/13837 aka
https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled
scrubbing and deep-scrubbing, and set osd_pg_max_concurrent_snap_trims
to 0 for all OSDs. No luck. I had changed the systemd service file to
automatically restart osd.223 while recovery was happening, but it
appears to have stalled; I suppose it's needed up for the remaining objects.

Yeah, these aren't really related that I can see — though I haven't spent much time in this code that I can recall. The OSD is receiving a "push" as part of log recovery and finds that the object it's receiving is a snapshot object without having any information about the snap IDs that exist, which is weird. I don't know of any way a client could break it either, but maybe David or Jason know something more.
-Greg
 

I didn't see anything else online, so I thought I see if anyone has seen
this before or has any other ideas. Thanks for taking the time.

-Steve


--
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma310@xxxxxxxxxx


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux