Re: reads while 100% write

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 30 Mar 2016 18:32:32 -0400 (EDT)

Evgeniy,

Do you mind repeating your test with this code applied?

Thanks!
sage

On Wed, 30 Mar 2016, Jason Dillaman wrote:

> Opened PR 8380 [1] to pass the WILLNEED flag for object map updates.
> 
> [1] https://github.com/ceph/ceph/pull/8380
> 
> -- 
> 
> Jason Dillaman 
> 
> 
> ----- Original Message -----
> > From: "Sage Weil" <sage@xxxxxxxxxxxx>
> > To: "Jason Dillaman" <dillaman@xxxxxxxxxx>
> > Cc: "Evgeniy Firsov" <Evgeniy.Firsov@xxxxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx
> > Sent: Wednesday, March 30, 2016 4:02:16 PM
> > Subject: Re: reads while 100% write
> > 
> > On Wed, 30 Mar 2016, Jason Dillaman wrote:
> > > This IO is being performed within an OSD class method.  I can add a new
> > > cls_cxx_read2 method to accept cache hints and update the associated
> > > object map methods.  Would this apply to writes as well?
> > 
> > Yeah, we'll want to hint them both.
> > 
> > s
> > 
> > > 
> > > --
> > > 
> > > Jason Dillaman
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Sage Weil" <sage@xxxxxxxxxxxx>
> > > > To: "Jason Dillaman" <dillaman@xxxxxxxxxx>
> > > > Cc: "Evgeniy Firsov" <Evgeniy.Firsov@xxxxxxxxxxx>,
> > > > ceph-devel@xxxxxxxxxxxxxxx
> > > > Sent: Wednesday, March 30, 2016 3:55:14 PM
> > > > Subject: Re: reads while 100% write
> > > > 
> > > > On Wed, 30 Mar 2016, Jason Dillaman wrote:
> > > > > Are you using the RBD default of 4MB object sizes or are you using
> > > > > something much smaller like 64KB?  An object map of that size should be
> > > > > tracking up to 24,576,000 objects.  When you ran your test before, did
> > > > > you have the RBD object map disabled?  This definitely seems to be a
> > > > > use
> > > > > case where the lack of a cache in front of BlueStore is hurting small
> > > > > IO.
> > > > 
> > > > Using the rados cache hint WILLNEED is probably appropriate here..
> > > > 
> > > > sage
> > > > 
> > > > > 
> > > > > --
> > > > > 
> > > > > Jason Dillaman
> > > > > 
> > > > > 
> > > > > ----- Original Message -----
> > > > > > From: "Evgeniy Firsov" <Evgeniy.Firsov@xxxxxxxxxxx>
> > > > > > To: "Jason Dillaman" <dillaman@xxxxxxxxxx>
> > > > > > Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx
> > > > > > Sent: Wednesday, March 30, 2016 3:00:47 PM
> > > > > > Subject: Re: reads while 100% write
> > > > > > 
> > > > > > 1.5T in that run.
> > > > > > With 150G behavior is the same. Except it says "_do_read 0~18 size
> > > > > > 615030”
> > > > > > instead of 6M.
> > > > > > 
> > > > > > Also when random 4k write starts there are more reads then writes:
> > > > > > 
> > > > > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > > > > avgrq-sz
> > > > > > avgqu-sz   await r_await w_await  svctm  %util
> > > > > > 
> > > > > > sdd               0.00  1887.00    0.00  344.00     0.00  8924.00
> > > > > > 51.88
> > > > > >     0.36    1.06    0.00    1.06   0.91  31.20
> > > > > > sde              30.00     0.00   30.00  957.00 18120.00  3828.00
> > > > > > 44.47
> > > > > >     0.25    0.26    3.87    0.14   0.17  16.40
> > > > > > 
> > > > > > Logs: http://pastebin.com/gGzfR5ez
> > > > > > 
> > > > > > 
> > > > > > On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@xxxxxxxxxx> wrote:
> > > > > > 
> > > > > > >How large is your RBD image?  100 terabytes?
> > > > > > >
> > > > > > >--
> > > > > > >
> > > > > > >Jason Dillaman
> > > > > > >
> > > > > > >
> > > > > > >----- Original Message -----
> > > > > > >> From: "Evgeniy Firsov" <Evgeniy.Firsov@xxxxxxxxxxx>
> > > > > > >> To: "Sage Weil" <sage@xxxxxxxxxxxx>
> > > > > > >> Cc: ceph-devel@xxxxxxxxxxxxxxx
> > > > > > >> Sent: Wednesday, March 30, 2016 2:14:12 PM
> > > > > > >> Subject: Re: reads while 100% write
> > > > > > >>
> > > > > > >> These are suspicious lines:
> > > > > > >>
> > > > > > >> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0)
> > > > > > >> read
> > > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
> > > > > > >> 6144018~6012
> > > > > > >> =
> > > > > > >> 6012
> > > > > > >> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0)
> > > > > > >> read
> > > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > > > > >> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
> > > > > > >> _do_read 8210~4096 size 6150030
> > > > > > >> 2016-03-30 10:54:23.142267 7f2e933ff700  5
> > > > > > >> bdev(src/dev/osd0/block)
> > > > > > >> read
> > > > > > >> 8003854336~8192
> > > > > > >> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0)
> > > > > > >> read
> > > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > > > > >> =
> > > > > > >>4096
> > > > > > >> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0)
> > > > > > >>_write
> > > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > > > > >> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
> > > > > > >> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
> > > > > > >> 8210~4096 -
> > > > > > >>have
> > > > > > >> 6150030 bytes in 1 extents
> > > > > > >>
> > > > > > >> More logs here: http://pastebin.com/74WLzFYw
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On 3/30/16, 4:19 AM, "Sage Weil" <sage@xxxxxxxxxxxx> wrote:
> > > > > > >>
> > > > > > >> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
> > > > > > >> >> After pulling master branch on Friday I start seeing odd fio
> > > > > > >>behavior, I
> > > > > > >> >> see a lot of reads while writing and very low performance no
> > > > > > >> >> matter
> > > > > > >> >> whether it read or write workload.
> > > > > > >> >>
> > > > > > >> >> Output from sequential 1M write:
> > > > > > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
> > > > > > >> >> wkB/s
> > > > > > >> >>avgrq-sz
> > > > > > >> >> avgqu-sz   await r_await w_await  svctm  %util
> > > > > > >> >>
> > > > > > >> >> sdd               0.00   409.00    0.00  364.00     0.00
> > > > > > >> >> 3092.00
> > > > > > >> >>16.99
> > > > > > >> >>     0.28    0.78    0.00    0.78   0.76  27.60
> > > > > > >> >> sde               0.00   242.00  365.00  363.00  2436.00
> > > > > > >> >> 9680.00
> > > > > > >> >>33.29
> > > > > > >> >>     0.18    0.24    0.42    0.07   0.23  16.80
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> block.db -> /dev/sdd
> > > > > > >> >> block -> /dev/sde
> > > > > > >> >>
> > > > > > >> >> health HEALTH_OK
> > > > > > >> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
> > > > > > >> >>        election epoch 3, quorum 0 a
> > > > > > >> >> osdmap e7: 1 osds: 1 up, 1 in
> > > > > > >> >>        flags sortbitwise
> > > > > > >> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
> > > > > > >> >>        8210 MB used, 178 GB / 186 GB avail
> > > > > > >> >>              64 active+clean
> > > > > > >> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> While on earlier revision(c1e41af) everything looks as
> > > > > > >> >> expected:
> > > > > > >> >>
> > > > > > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
> > > > > > >> >> wkB/s
> > > > > > >> >>avgrq-sz
> > > > > > >> >> avgqu-sz   await r_await w_await  svctm  %util
> > > > > > >> >> sdd               0.00  4910.00    0.00  680.00     0.00
> > > > > > >> >> 22416.00
> > > > > > >> >>65.93
> > > > > > >> >>     1.05    1.55    0.00    1.55   1.18  80.00
> > > > > > >> >> sde               0.00     0.00    0.00 3418.00     0.00
> > > > > > >> >> 217612.00
> > > > > > >> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
> > > > > > >> >>
> > > > > > >> >> Other observation, may be related to the issue, is that CPU
> > > > > > >> >> load is
> > > > > > >> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the
> > > > > > >> >> rest
> > > > > > >> >> is
> > > > > > >> >>idle.
> > > > > > >> >> Looks like all load goes to single thread pool shard, earlier
> > > > > > >> >> CPU
> > > > > > >> >> was
> > > > > > >> >>well
> > > > > > >> >> balanced.
> > > > > > >> >
> > > > > > >> >Hmm.  Can you capture a log with debug bluestore = 20 and debug
> > > > > > >> >bdev
> > > > > > >> >=
> > > > > > >>20?
> > > > > > >> >
> > > > > > >> >Thanks!
> > > > > > >> >sage
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> ‹
> > > > > > >> >> Evgeniy
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> PLEASE NOTE: The information contained in this electronic mail
> > > > > > >>message
> > > > > > >> >>is intended only for the use of the designated recipient(s)
> > > > > > >> >>named
> > > > > > >>above.
> > > > > > >> >>If the reader of this message is not the intended recipient, you
> > > > > > >> >>are
> > > > > > >> >>hereby notified that you have received this message in error and
> > > > > > >> >>that
> > > > > > >> >>any review, dissemination, distribution, or copying of this
> > > > > > >> >>message
> > > > > > >> >>is
> > > > > > >> >>strictly prohibited. If you have received this communication in
> > > > > > >> >>error,
> > > > > > >> >>please notify the sender by telephone or e-mail (as shown above)
> > > > > > >> >>immediately and destroy any and all copies of this message in
> > > > > > >> >>your
> > > > > > >> >>possession (whether hard copies or electronically stored
> > > > > > >> >>copies).
> > > > > > >> >> --
> > > > > > >> >> To unsubscribe from this list: send the line "unsubscribe
> > > > > > >>ceph-devel" in
> > > > > > >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > > > >> >> More majordomo info at
> > > > > > >> >> http://vger.kernel.org/majordomo-info.html
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > > >> PLEASE NOTE: The information contained in this electronic mail
> > > > > > >> message
> > > > > > >>is
> > > > > > >> intended only for the use of the designated recipient(s) named
> > > > > > >> above.
> > > > > > >>If the
> > > > > > >> reader of this message is not the intended recipient, you are
> > > > > > >> hereby
> > > > > > >> notified that you have received this message in error and that any
> > > > > > >>review,
> > > > > > >> dissemination, distribution, or copying of this message is
> > > > > > >> strictly
> > > > > > >> prohibited. If you have received this communication in error,
> > > > > > >> please
> > > > > > >>notify
> > > > > > >> the sender by telephone or e-mail (as shown above) immediately and
> > > > > > >>destroy
> > > > > > >> any and all copies of this message in your possession (whether
> > > > > > >> hard
> > > > > > >>copies
> > > > > > >> or electronically stored copies).
> > > > > > >>
> > > > > > >>N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w??????????????????j:+v
> > > > > > >>?????????w???j???m????????????????????????zZ+??????ݢj"??????
> > > > > > 
> > > > > > PLEASE NOTE: The information contained in this electronic mail
> > > > > > message is
> > > > > > intended only for the use of the designated recipient(s) named above.
> > > > > > If
> > > > > > the
> > > > > > reader of this message is not the intended recipient, you are hereby
> > > > > > notified that you have received this message in error and that any
> > > > > > review,
> > > > > > dissemination, distribution, or copying of this message is strictly
> > > > > > prohibited. If you have received this communication in error, please
> > > > > > notify
> > > > > > the sender by telephone or e-mail (as shown above) immediately and
> > > > > > destroy
> > > > > > any and all copies of this message in your possession (whether hard
> > > > > > copies
> > > > > > or electronically stored copies).
> > > > > > N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w??????????????????j:+v?????????w???j???m????????????????????????zZ+??????ݢj"??????
> > > > > 
> > > > >
> > > 
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>