[PATCH 0/2] osd: force restart peering when osd is marked down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I've found that a PG is eternally stuck in 'unfound_recovery' after
some OSDs are marked down.

For example, the following steps reproduce this.

1) Create EC 2+1 pool. Assume a PG has [1,0,2] up/acting set.
2) Execute "ceph osd out osd.0 osd.2". Now the PG has [1,3,5] up/acting set.
3) Put some objects to the PG.
4) Execute "ceph osd in osd.0 osd.2". It starts recovering to [1,0,2].
5) Execute "ceph osd down osd.3 osd.5". (These downs are fake. osd.3
   and osd.5 are actually not down)
   It leads the PG to transit 'unfound_recovery' and stay on forever.

Interestingly, this bad situation is resolved by mean of marking down
another OSD.

6) Executing "ceph osd down osd.0" (any OSD in acting set is ok) resolves
   'unfound_recovery' and restart recovering.


Upon my investigation, if downed OSD is not a member of current up/acting set,
a PG might stay 'ReplicaActive' and discard peering requests from the primary.
Thus the primary OSD can't exit from unfound state.
PGs of downed OSD should transit to 'Reset' state and start peering.


I'll post two patches. The first one fixes this issue.
The second one is trivial optimization (optional).

Thanks,
Kouya

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux