Re: SLOW SSD's after moving to Bluestore

Alfredo Deza <adeza@xxxxxxxxxx> · Wed, 12 Dec 2018 07:58:56 -0500

On Tue, Dec 11, 2018 at 7:28 PM Tyler Bishop
<tyler.bishop@xxxxxxxxxxxxxxxxx> wrote:
>
> Now I'm just trying to figure out how to create filestore in Luminous.
> I've read every doc and tried every flag but I keep ending up with
> either a data LV of 100% on the VG or a bunch fo random errors for
> unsupported flags...

An LV with 100% of the VG sounds like it tried to deploy bluestore.
ceph-deploy will try to behave like that unless LVs are created by
hand.

A newer option would be to try the `ceph-volume lvm batch` command on
your server (unsupported as of yet by ceph-deploy) to create all the
vgs/lvs needed including
the detection of HDDs and SSDs that would send the journals to the SSD if any:

    ceph-volume lvm batch --filestore /dev/sda /dev/sdb /dev/sdc

Would create 3 OSDs, one for each spinning drive (assuming these are
spinning), and collocate the journal on the device itself. For the
journal on a separate device
a solid device would need to be added, for example:

    ceph-volume lvm batch --filestore /dev/sda /dev/sdb /dev/sdc /dev/nvme0n1

Would create 3 OSDs again, but would put 3 journals on nvme0n1

>
> # ceph-disk prepare --filestore --fs-type xfs --data-dev /dev/sdb1
> --journal-dev /dev/sdb2 --osd-id 3
> usage: ceph-disk [-h] [-v] [--log-stdout] [--prepend-to-path PATH]
>                  [--statedir PATH] [--sysconfdir PATH] [--setuser USER]
>                  [--setgroup GROUP]
>
>
> {prepare,activate,activate-lockbox,activate-block,activate-journal,activate-all,list,suppress-activate,unsuppress-activate,deactivate,destroy,zap,trigger,fix}
>                  ...
> ceph-disk: error: unrecognized arguments: /dev/sdb1
> On Tue, Dec 11, 2018 at 7:22 PM Christian Balzer <chibi@xxxxxxx> wrote:
> >
> >
> > Hello,
> >
> > On Tue, 11 Dec 2018 23:22:40 +0300 Igor Fedotov wrote:
> >
> > > Hi Tyler,
> > >
> > > I suspect you have BlueStore DB/WAL at these drives as well, don't you?
> > >
> > > Then perhaps you have performance issues with f[data]sync requests which
> > > DB/WAL invoke pretty frequently.
> > >
> > Since he explicitly mentioned using these SSDs with filestore AND the
> > journals on the same SSD I'd expect a similar impact aka piss-poor
> > performance in his existing setup (the 300 other OSDs).
> >
> > Unless of course some bluestore is significantly more sync happy than the
> > filestore journal and/or other bluestore particulars (reduced caching
> > space, not caching in some situations) are rearing their ugly heads.
> >
> > Christian
> >
> > > See the following links for details:
> > >
> > > https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
> > >
> > > https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> > >
> > > The latter link shows pretty poor numbers for M500DC drives.
> > >
> > >
> > > Thanks,
> > >
> > > Igor
> > >
> > >
> > > On 12/11/2018 4:58 AM, Tyler Bishop wrote:
> > >
> > > > Older Crucial/Micron M500/M600
> > > > _____________________________________________
> > > >
> > > > *Tyler Bishop*
> > > > EST 2007
> > > >
> > > >
> > > > O:513-299-7108 x1000
> > > > M:513-646-5809
> > > > http://BeyondHosting.net <http://beyondhosting.net/>
> > > >
> > > >
> > > > This email is intended only for the recipient(s) above and/or
> > > > otherwise authorized personnel. The information contained herein and
> > > > attached is confidential and the property of Beyond Hosting. Any
> > > > unauthorized copying, forwarding, printing, and/or disclosing
> > > > any information related to this email is prohibited. If you received
> > > > this message in error, please contact the sender and destroy all
> > > > copies of this email and any attachment(s).
> > > >
> > > >
> > > > On Mon, Dec 10, 2018 at 8:57 PM Christian Balzer <chibi@xxxxxxx
> > > > <mailto:chibi@xxxxxxx>> wrote:
> > > >
> > > >     Hello,
> > > >
> > > >     On Mon, 10 Dec 2018 20:43:40 -0500 Tyler Bishop wrote:
> > > >
> > > >     > I don't think thats my issue here because I don't see any IO to
> > > >     justify the
> > > >     > latency.  Unless the IO is minimal and its ceph issuing a bunch
> > > >     of discards
> > > >     > to the ssd and its causing it to slow down while doing that.
> > > >     >
> > > >
> > > >     What does atop have to say?
> > > >
> > > >     Discards/Trims are usually visible in it, this is during a fstrim of a
> > > >     RAID1 / :
> > > >     ---
> > > >     DSK |          sdb  | busy     81% |  read       0 | write  8587
> > > >     | MBw/s 2323.4 |  avio 0.47 ms |
> > > >     DSK |          sda  | busy     70% |  read       2 | write  8587
> > > >     | MBw/s 2323.4 |  avio 0.41 ms |
> > > >     ---
> > > >
> > > >     The numbers tend to be a lot higher than what the actual interface is
> > > >     capable of, clearly the SSD is reporting its internal activity.
> > > >
> > > >     In any case, it should give a good insight of what is going on
> > > >     activity
> > > >     wise.
> > > >     Also for posterity and curiosity, what kind of SSDs?
> > > >
> > > >     Christian
> > > >
> > > >     > Log isn't showing anything useful and I have most debugging
> > > >     disabled.
> > > >     >
> > > >     >
> > > >     >
> > > >     > On Mon, Dec 10, 2018 at 7:43 PM Mark Nelson <mnelson@xxxxxxxxxx
> > > >     <mailto:mnelson@xxxxxxxxxx>> wrote:
> > > >     >
> > > >     > > Hi Tyler,
> > > >     > >
> > > >     > > I think we had a user a while back that reported they had
> > > >     background
> > > >     > > deletion work going on after upgrading their OSDs from
> > > >     filestore to
> > > >     > > bluestore due to PGs having been moved around.  Is it possible
> > > >     that your
> > > >     > > cluster is doing a bunch of work (deletion or otherwise)
> > > >     beyond the
> > > >     > > regular client load?  I don't remember how to check for this
> > > >     off the top
> > > >     > > of my head, but it might be something to investigate.  If
> > > >     that's what it
> > > >     > > is, we just recently added the ability to throttle background
> > > >     deletes:
> > > >     > >
> > > >     > > https://github.com/ceph/ceph/pull/24749
> > > >     > >
> > > >     > >
> > > >     > > If the logs/admin socket don't tell you anything, you could
> > > >     also try
> > > >     > > using our wallclock profiler to see what the OSD is spending
> > > >     it's time
> > > >     > > doing:
> > > >     > >
> > > >     > > https://github.com/markhpc/gdbpmp/
> > > >     > >
> > > >     > >
> > > >     > > ./gdbpmp -t 1000 -p`pidof ceph-osd` -o foo.gdbpmp
> > > >     > >
> > > >     > > ./gdbpmp -i foo.gdbpmp -t 1
> > > >     > >
> > > >     > >
> > > >     > > Mark
> > > >     > >
> > > >     > > On 12/10/18 6:09 PM, Tyler Bishop wrote:
> > > >     > > > Hi,
> > > >     > > >
> > > >     > > > I have an SSD only cluster that I recently converted from
> > > >     filestore to
> > > >     > > > bluestore and performance has totally tanked. It was fairly
> > > >     decent
> > > >     > > > before, only having a little additional latency than
> > > >     expected.  Now
> > > >     > > > since converting to bluestore the latency is extremely high,
> > > >     SECONDS.
> > > >     > > > I am trying to determine if it an issue with the SSD's or
> > > >     Bluestore
> > > >     > > > treating them differently than filestore... potential garbage
> > > >     > > > collection? 24+ hrs ???
> > > >     > > >
> > > >     > > > I am now seeing constant 100% IO utilization on ALL of the
> > > >     devices and
> > > >     > > > performance is terrible!
> > > >     > > >
> > > >     > > > IOSTAT
> > > >     > > >
> > > >     > > > avg-cpu:  %user   %nice %system %iowait %steal   %idle
> > > >     > > >            1.37    0.00    0.34   18.59 0.00   79.70
> > > >     > > >
> > > >     > > > Device:         rrqm/s   wrqm/s     r/s     w/s rkB/s    wkB/s
> > > >     > > > avgrq-sz avgqu-sz   await r_await w_await svctm  %util
> > > >     > > > sda               0.00     0.00    0.00 9.50  0.00    64.00
> > > >     > > > 13.47     0.01    1.16    0.00    1.16  1.11  1.05
> > > >     > > > sdb               0.00    96.50    4.50   46.50 34.00 11776.00
> > > >     > > >  463.14   132.68 1174.84  782.67 1212.80 19.61 100.00
> > > >     > > > dm-0              0.00     0.00    5.50  128.00 44.00  8162.00
> > > >     > > >  122.94   507.84 1704.93  674.09 1749.23  7.49 100.00
> > > >     > > >
> > > >     > > > avg-cpu:  %user   %nice %system %iowait %steal   %idle
> > > >     > > >            0.85    0.00    0.30   23.37 0.00   75.48
> > > >     > > >
> > > >     > > > Device:         rrqm/s   wrqm/s     r/s     w/s rkB/s    wkB/s
> > > >     > > > avgrq-sz avgqu-sz   await r_await w_await svctm  %util
> > > >     > > > sda               0.00     0.00    0.00 3.00  0.00    17.00
> > > >     > > > 11.33     0.01    2.17    0.00    2.17  2.17  0.65
> > > >     > > > sdb               0.00    24.50    9.50   40.50 74.00 10000.00
> > > >     > > >  402.96    83.44 2048.67 1086.11 2274.46 20.00 100.00
> > > >     > > > dm-0              0.00     0.00   10.00   33.50 78.00  2120.00
> > > >     > > >  101.06   287.63 8590.47 1530.40 10697.96 22.99 100.00
> > > >     > > >
> > > >     > > > avg-cpu:  %user   %nice %system %iowait %steal   %idle
> > > >     > > >            0.81    0.00    0.30   11.40 0.00   87.48
> > > >     > > >
> > > >     > > > Device:         rrqm/s   wrqm/s     r/s     w/s rkB/s    wkB/s
> > > >     > > > avgrq-sz avgqu-sz   await r_await w_await svctm  %util
> > > >     > > > sda               0.00     0.00    0.00 6.00  0.00    40.25
> > > >     > > > 13.42     0.01    1.33    0.00    1.33  1.25  0.75
> > > >     > > > sdb               0.00   314.50   15.50  72.00  122.00 17264.00
> > > >     > > >  397.39    61.21 1013.30  740.00 1072.13 11.41  99.85
> > > >     > > > dm-0              0.00     0.00   10.00  427.00 78.00 27728.00
> > > >     > > >  127.26   224.12  712.01 1147.00  701.82  2.28 99.85
> > > >     > > >
> > > >     > > > avg-cpu:  %user   %nice %system %iowait %steal   %idle
> > > >     > > >            1.22    0.00    0.29    4.01 0.00   94.47
> > > >     > > >
> > > >     > > > Device:         rrqm/s   wrqm/s     r/s     w/s rkB/s    wkB/s
> > > >     > > > avgrq-sz avgqu-sz   await r_await w_await svctm  %util
> > > >     > > > sda               0.00     0.00    0.00 3.50  0.00    17.00
> > > >     > > >  9.71     0.00    1.29    0.00    1.29  1.14  0.40
> > > >     > > > sdb               0.00     0.00    1.00  39.50  8.00 10112.00
> > > >     > > >  499.75    78.19 1711.83 1294.50 1722.39 24.69 100.00
> > > >     > > >
> > > >     > > >
> > > >     > > > _______________________________________________
> > > >     > > > ceph-users mailing list
> > > >     > > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> > > >     > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >     > > _______________________________________________
> > > >     > > ceph-users mailing list
> > > >     > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> > > >     > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >     > >
> > > >
> > > >
> > > >     --
> > > >     Christian Balzer        Network/Systems Engineer
> > > >     chibi@xxxxxxx <mailto:chibi@xxxxxxx>           Rakuten Communications
> > > >
> > > >
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Rakuten Communications
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com