Re: OSD write op out of order

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 27 Dec 2021 10:07:49 +0000

On Mon, Dec 27, 2021 at 9:12 AM gyfelectric <gyfelectric@xxxxxxxxx> wrote:

>
> Hi all,
>
> Recently, the problem of OSD disorder has often appeared in my
> environment(14.2.5) and my Fuse Client borken
> due to "FAILED assert(ob->last_commit_tid < tid)”. My application can’t
> work normally now.
>
> The time series that triggered this problem is like this:
> note:
> a. my datapool is: EC 4+2
> b. osd(osd.x) of pg_1 is down
>
> Event Sequences:
> t1: op_1(write) send to OSD and send 5 shards to 5 osds. only return 4
> shards except primary osd because there is osd(osd.x) down.
> t2: many other operations have occurred in this pg and record in pg_log
> t3: op_n(write) send to OSD and send 5 shards to 5 osds. only return 4
> shards except primary osd because there is osd(osd.x) down.
> t4: the peer osd report osd.x timeout to monitor and osd.x is marked down
> t5: pg_1 start canceling and requeueing op_1, op_2 … op_n to osd op_wq
> t6: pg_1 start peering and op_1 is trimmed from pg_log and dup map in this
> process
>

Unless I’m misunderstanding, either you have more ops that haven’t been
committed+acked than the length of the pg log dup tracking, or else there’s
a bug here and it’s trimming farther than it should.

Can you clarify which case? Because if you’re sending more ops than the pg
log length, this is an expected failure and not one that’s feasible to
resolve. You just need to spend the money to have enough memory for longer
logs and dup detection.

-Greg

t7: pg_1 become active and start reprocessing the op_1, op_2 … op_n
> t8: op_1 is not found in pg_log and dup map, so redo it.
> t9: op_n is found in pg_log or dup map and be considered completed, so
> return osd reply to client directly with tid_op_n
> t10: op_1 complete and return to client with tid_op_1. client will break
> down due to "assert(ob->last_commit_tid < tid)”
>
> I found some relative issues in https://tracker.ceph.com/issues/23827
> <https://tracker.ceph.com/issues/23827.> which have some discussions
> about this problem.
> But i didn’t find an effective method to avoid this problem.
>
> I think the current mechanism to prevent non-idempotent op from being
> repeated is flawed, may be we should redesign it.
> How do you think about it? And if my idea is wrong, what should i do to
> avoid this problem?
>
> Any response is very grateful, thank you!
>
> gyfelectric
> gyfelectric@xxxxxxxxx
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=gyfelectric&uid=gyfelectric%40gmail.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22gyfelectric%40gmail.com%22%5D>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx