Re: 答复: assert(objiter->second->version > last_divergent_update) when testing pull out disk and insert

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 16 Oct 2017 15:35:40 -0700

On Sat, Oct 14, 2017 at 7:24 AM, zhaomingyue <zhao.mingyue@xxxxxxx> wrote:
> 1、this assert happened accidently, not easy to reproduce; In fact, I also suppose this assert is caused by device data lost;
> but if has lost,how it can accur that (last_update +1 = log.rbegin.version) , in case of losting data, it's more likely to be confused. At present, this situation can't think clearly.
>
> 2、According read_log code，assume a situation:
> When osd start ,if pg log has been lost some content because of power off or xfs error,then log.head would be bigger than log.rbegin.version in memory;
> during peering , using last_update as one determinal arg to find_best, so the consistent one(osd who has shorter pg log,but last_update is normal) may become the auth log,
> other osd use 'this auth log' would lead to pg inconsistent if scrub this pg,isn’t it?

I'm not sure I understand what you're saying here.

I don't think OSDs will write down data from their peers until they
get far enough along to actually commit to somebody being primary,
though. So if we have an inconsistent guy with lost log, he'll hit one
of these asserts and then one of the remaining (consistent) OSDs will
start over again and get selected as the winner.

Keep in mind that once we know the PG metadata is inconsistent, we
don't want to keep on using that disk for data as we know it's not
trusted!
-Greg

>
>
>
> -----邮件原件-----
> 发件人: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx]
> 发送时间: 2017年10月14日 0:34
> 收件人: zhaomingyue 09440 (RD)
> 抄送: ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxx
> 主题: Re:  assert(objiter->second->version > last_divergent_update) when testing pull out disk and insert
>
> On Fri, Oct 13, 2017 at 12:48 AM, zhaomingyue <zhao.mingyue@xxxxxxx> wrote:
>> Hi：
>>     I had met an assert problem like
>> bug16279(http://tracker.ceph.com/issues/16279) when testing pull out
>> disk and insert, ceph version 10.2.5，assert(objiter->second->version >
>> last_divergent_update)
>>
>> according to osd log，I think this maybe due to (log.head !=
>> *log.log.rbegin.version.version) when some abnormal condition
>> happened,such as power off ,pull out disk and insert.
>
> I don't think is supposed to be possible. We apply all changes like this atomically; FileStore does all its journaling to prevent partial updates like this.
>
> A few other people have reported the same issue on disk pull, so maybe there's some *other* issue going on, but the correct fix is by preventing those two from differing (unless I misunderstand the context).
>
> Given one of the reporters on that ticket confirms they also had xfs issues, I find it vastly more likely that something in your kernel configuration and hardware stack is not writing out data the way it claims to. Be very, very sure all that is working correctly!
>
>
>> In below situation, merge_log would push 234’1034 into divergent
>> list;and divergent has only one node;then lead to
>> assert(objiter->second->version > last_divergent_update).
>>
>> olog  ----------------   (0’0, 234’1034)  olog.head = 234’1034
>>
>> log   ----------------   (0’0, 234’1034)  log.head = 234’1033
>>
>>
>>
>> I see osd load_pgs code,in function PGLog::read_log() , code like this:
>>  .....
>>  for (p->seek_to_first(); p->valid() ; p->next()) {
>>
>> .....
>>
>>     log.log.push_back(e);
>>
>>     log.head = e.version;  // every pg log node
>>
>>   }
>>
>> .....
>>
>>  log.head = info.last_update;
>>
>>
>>
>> two doubt:
>>
>> first : why set (log.head = info.last_update) after all pg log node
>> processed(every node has updated log.head = e.version)?
>>
>> second: Whether it can occur that info.last_update is less than
>> *log.log.rbegin.version or not and what scene happens?
>
> I'm looking at the luminous code base right now and things have changed a bit so I don't have the specifics of your question on hand.
>
> But the general reason we change these versions around is because we need to reconcile the logs across all OSDs. If one OSD has an entry for an operation that was never returned to the client, we may need to declare it divergent and undo it. (In replicated pools, entries are only divergent if the OSD hosting it was either netsplit from the primary, or else managed to commit something during a failure event that its peers didn't and then was resubmitted under a different ID by the client on recovery. In erasure-coded pools things are more complicated because we can only roll operations forward if a quorum of the shards are present.) -Greg
> -------------------------------------------------------------------------------------------------------------------------------------
> 本邮件及其附件含有新华三技术有限公司的保密信息，仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
> 或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
> 邮件！
> This e-mail and its attachments contain confidential information from New H3C, which is
> intended only for the person or entity whose address is listed above. Any use of the
> information contained herein in any way (including, but not limited to, total or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com