I'm seeing this again on two OSDs after adding another 20 disks
to my cluster. Is there someway I can maybe determine which
snapshots the recovery process is looking for? Or maybe find and
remove the objects it's trying to recover, since there's
apparently a problem with them? Thanks!
-Steve
On 05/18/2017 01:06 PM, Steve Anthony
wrote:
Hmmm, after crashing for a few days every 30 seconds it's
apparently running normally again. Weird. I was thinking since
it's looking for a snapshot object, maybe re-enabling
snaptrimming and removing all the snapshots in the pool would
remove that object (and the problem)? Never got to that point
this time, but I'm going to need to cycle more OSDs in and out
of the cluster, so if it happens again I might try that and
update.
Thanks!
-Steve
On 05/17/2017 03:17 PM, Gregory
Farnum wrote:
Hello,
After starting a backup (create snap, export and import
into a second
cluster - one RBD image still exporting/importing as of
this message)
the other day while recovery operations on the primary
cluster were
ongoing I noticed an OSD (osd.126) start to crash; I
reweighted it to 0
to prepare to remove it. Shortly thereafter I noticed the
problem seemed
to move to another OSD (osd.223). After looking at the
logs, I noticed
they appeared to have the same problem. I'm running Ceph
version 9.2.1
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.
Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe
Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA
May 15 10:39:55 ceph13 ceph-osd[21506]: -9308>
2017-05-15
10:39:51.561342 7f225c385900 -1 osd.126 616621
log_to_monitors
{default=true}
May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15
10:39:55.328897
7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual
void
ReplicatedPG::on_local_recover(const hobject_t&, const
object_stat_sum_t&, const ObjectRecoveryInfo&,
ObjectContextRef,
ObjectStore::Transaction*)' thread 7f2236be3700 time
2017-05-15
10:39:55.322306
May 15 10:39:55 ceph13 ceph-osd[21506]:
osd/ReplicatedPG.cc: 192: FAILED
assert(recovery_info.oi.snaps.size())
May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15
16:45:25.343391
7ff40f41e900 -1 osd.223 619808 log_to_monitors
{default=true}
May 15 16:45:30 ceph19 ceph-osd[30527]:
osd/ReplicatedPG.cc: In function
'virtual void ReplicatedPG::on_local_recover(const
hobject_t&, const
object_stat_sum_t&, const ObjectRecoveryInfo&,
ObjectContextRef,
ObjectStore::Transaction*)' thread 7ff3eab63700 time
2017-05-15
16:45:30.799839
May 15 16:45:30 ceph19 ceph-osd[30527]:
osd/ReplicatedPG.cc: 192: FAILED
assert(recovery_info.oi.snaps.size())
I did some searching and thought it might be related to
http://tracker.ceph.com/issues/13837
aka
https://bugzilla.redhat.com/show_bug.cgi?id=1351320
so I disabled
scrubbing and deep-scrubbing, and set
osd_pg_max_concurrent_snap_trims
to 0 for all OSDs. No luck. I had changed the systemd
service file to
automatically restart osd.223 while recovery was
happening, but it
appears to have stalled; I suppose it's needed up for the
remaining objects.
Yeah, these aren't really related that I can see —
though I haven't spent much time in this code that I can
recall. The OSD is receiving a "push" as part of log
recovery and finds that the object it's receiving is a
snapshot object without having any information about the
snap IDs that exist, which is weird. I don't know of any
way a client could break it either, but maybe David or
Jason know something more.
-Greg
I didn't see anything else online, so I thought I see if
anyone has seen
this before or has any other ideas. Thanks for taking the
time.
-Steve
--
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma310@xxxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma310@xxxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma310@xxxxxxxxxx
|
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com