Re: ceph issue discuss: read pg info incorrect cause unfound objects in EC scenario

Sage Weil <sweil@xxxxxxxxxx> · Mon, 13 Jan 2020 13:22:57 +0000 (UTC)

Hi Song,

On Mon, 13 Jan 2020, song wrote:
> Hi Sage,
> 
> happy new year!
> 
> I am a software engineer from China. Recently I found a issue for fastinfo in Ceph and want to consult you about it.
> 
> In the scenario of EC deployment, suppose we done a peering process for a pg and changed one shard's last_update from lu1(e1'3) to lu2(e1'2) .lu1 was written as fastinfo and lu2 was written as info. After that we restarted this osd and loaded pgs again. when we read pg info from disk, we will find the pg info is lu1 applied to lu2, which becomes incorrect. the true value should be lu2. That may cause the coming peering execute incorrectly and result in unfound objects.
> I currently considered below two options:
> 1. delete fastinfo when we need to change info;
> 2. add extra sequence number to fastinfo and info structure to make it keep them in the right order.
> 
> I am looking forward to hearing your suggestions about this issue and preferred solution.
> if you need any more info, please let me know.

Ah, that does look like a bug.  I've opened a tracker ticket for this,

	https://tracker.ceph.com/issues/43580

Does that look right?  I think the fix is pretty simple:

	https://github.com/ceph/ceph/pull/32615

Thanks!
sage

> 
> 
> thanks,
> Song
> 
> 
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx