Re: OSD write op out of order

YunfeiGuan <gyfelectric@xxxxxxxxx> · Tue, 4 Jan 2022 20:43:01 +0800

I pasted my osd log to https://tracker.ceph.com/issues/23827 .  

Gregory Farnum <gfarnum@xxxxxxxxxx> 于2021年12月27日周一 18:08写道：
On Mon, Dec 27, 2021 at 9:12 AM gyfelectric <gyfelectric@xxxxxxxxx> wrote:

    Hi all, 

Recently, the problem of OSD disorder has often appeared in my environment(14.2.5) and my Fuse Client borken 
due to "FAILED assert(ob->last_commit_tid < tid)”. My application can’t work normally now. 

The time series that triggered this problem is like this:
note:
a. my datapool is: EC 4+2 
b. osd(osd.x) of pg_1 is down

Event Sequences:
t1: op_1(write) send to OSD and send 5 shards to 5 osds. only return 4 shards except primary osd because there is osd(osd.x) down.
t2: many other operations have occurred in this pg and record in pg_log
t3: op_n(write) send to OSD and send 5 shards to 5 osds. only return 4 shards except primary osd because there is osd(osd.x) down.
t4: the peer osd report osd.x timeout to monitor and osd.x is marked down 
t5: pg_1 start canceling and requeueing op_1, op_2 … op_n to osd op_wq
t6: pg_1 start peering and op_1 is trimmed from pg_log and dup map in this process

Unless I’m misunderstanding, either you have more ops that haven’t been committed+acked than the length of the pg log dup tracking, or else there’s a bug here and it’s trimming farther than it should.

Can you clarify which case? Because if you’re sending more ops than the pg log length, this is an expected failure and not one that’s feasible to resolve. You just need to spend the money to have enough memory for longer logs and dup detection.

-Greg

t7: pg_1 become active and start reprocessing the op_1, op_2 … op_n
t8: op_1 is not found in pg_log and dup map, so redo it. 
t9: op_n is found in pg_log or dup map and be considered completed, so return osd reply to client directly with tid_op_n
t10: op_1 complete and return to client with tid_op_1. client will break down due to "assert(ob->last_commit_tid < tid)”

I found some relative issues in https://tracker.ceph.com/issues/23827 which have some discussions about this problem.
But i didn’t find an effective method to avoid this problem. 

I think the current mechanism to prevent non-idempotent op from being repeated is flawed, may be we should redesign it.
How do you think about it? And if my idea is wrong, what should i do to avoid this problem?

Any response is very grateful, thank you!

                                gyfelectric

                                    gyfelectric@xxxxxxxxx

        签名由
        网易邮箱大师
        定制

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx