why one osd-op from client can get two osd-op-reply?

greg@xxxxxxxxxxx (Gregory Farnum) · Thu, 11 Sep 2014 09:54:51 -0700

It's the recovery and backfill code. There's not one place; it's what most
of the OSD code is for.

On Thursday, September 11, 2014, yuelongguang <fastsync at 163.com> wrote:

> as for the second question, could you tell me where the code is.
> how ceph makes size/min_szie copies?
>
> thanks
>
>
>
>
>
>
>
> At 2014-09-11 12:19:18, "Gregory Farnum" <greg at inktank.com <javascript:_e(%7B%7D,'cvml','greg at inktank.com');>> wrote:
> >On Wed, Sep 10, 2014 at 8:29 PM, yuelongguang <fastsync at 163.com <javascript:_e(%7B%7D,'cvml','fastsync at 163.com');>> wrote:
> >>
> >>
> >>
> >>
> >> as for ack and ondisk, ceph has size and min_size to decide there are  how
> >> many replications.
> >> if client receive ack or ondisk, which means there are at least min_size
> >> osds  have  done the ops?
> >>
> >> i am reading the cource code, could you help me with the two questions.
> >>
> >> 1.
> >>  on osd, where is the code that reply ops  separately  according to ack or
> >> ondisk.
> >>  i check the code, but i thought they always are replied together.
> >
> >It depends on what journaling mode you're in, but generally they're
> >triggered separately (unless it goes on disk first, in which case it
> >will skip the ack ? this is the mode it uses for non-btrfs
> >filesystems). The places where it actually replies are pretty clear
> >about doing one or the other, though...
> >
> >>
> >> 2.
> >>  now i just know how client write ops to primary osd, inside osd cluster,
> >> how it promises min_size copy are reached.
> >> i mean  when primary osd receives ops , how it spreads ops to others, and
> >> how it processes other's reply.
> >
> >That's not how it works. The primary for a PG will not go "active"
> >with it until it has at least min_size copies that it knows about.
> >Once the OSD is doing any processing of the PG, it requires all
> >participating members to respond before it sends any messages back to
> >the client.
> >-Greg
> >Software Engineer #42 @ http://inktank.com | http://ceph.com
> >
> >>
> >>
> >> greg, thanks very much
> >>
> >>
> >>
> >>
> >>
> >> ? 2014-09-11 01:36:39?"Gregory Farnum" <greg at inktank.com <javascript:_e(%7B%7D,'cvml','greg at inktank.com');>> ???
> >>
> >> The important bit there is actually near the end of the message output line,
> >> where the first says "ack" and the second says "ondisk".
> >>
> >> I assume you're using btrfs; the ack is returned after the write is applied
> >> in-memory and readable by clients. The ondisk (commit) message is returned
> >> after it's durable to the journal or the backing filesystem.
> >> -Greg
> >>
> >> On Wednesday, September 10, 2014, yuelongguang <fastsync at 163.com <javascript:_e(%7B%7D,'cvml','fastsync at 163.com');>> wrote:
> >>>
> >>> hi,all
> >>> i recently debug ceph rbd, the log tells that  one write to osd can get
> >>> two if its reply.
> >>> the difference between them is seq.
> >>> why?
> >>>
> >>> thanks
> >>> ---log---------
> >>> reader got message 6 0x7f58900010a0 osd_op_reply(15
> >>> rbd_data.19d92ae8944a.0000000000000001 [set-alloc-hint object_size 4194304
> >>> write_size 4194304,write 0~3145728] v211'518 uv518 ack = 0) v6
> >>> 2014-09-10 08:47:32.348213 7f58bc16b700 20 -- 10.58.100.92:0/1047669 queue
> >>> 0x7f58900010a0 prio 127
> >>> 2014-09-10 08:47:32.348230 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader reading tag...
> >>> 2014-09-10 08:47:32.348245 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader got MSG
> >>> 2014-09-10 08:47:32.348257 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader got envelope type=43 src osd.1 front=247 data=0 off 0
> >>> 2014-09-10 08:47:32.348269 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader wants 247 from dispatch throttler 247/104857600
> >>> 2014-09-10 08:47:32.348286 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader got front 247
> >>> 2014-09-10 08:47:32.348303 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).aborted = 0
> >>> 2014-09-10 08:47:32.348312 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader got 247 + 0 + 0 byte message
> >>> 2014-09-10 08:47:32.348332 7f58bc16b700 10 check_message_signature: seq #
> >>> = 7 front_crc_ = 3699418201 middle_crc = 0 data_crc = 0
> >>> 2014-09-10 08:47:32.348369 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader got message 7 0x7f5890003660 osd_op_reply(15
> >>> rbd_data.19d92ae8944a.0000000000000001 [set-alloc-hint object_size 4194304
> >>> write_size 4194304,write 0~3145728] v211'518 uv518 ondisk = 0) v6
> >>>
> >>>
> >>
> >>
> >> --
> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >>
> >>
> >>
>
>
>
>

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140911/3aa6100b/attachment.htm>