Re: Rados performance inconsistencies, lower than expected performance

Menno Zonneveld <menno@xxxxxxxx> · Thu, 13 Sep 2018 14:17:20 +0200

Update on the subject, warning, lengthy post but reproducible results and workaround to get performance back to expected level.

One of the servers had a broken disk controller causing some performance issues on this one host, FIO showed about half performance on some disks compared to the other hosts, it’s been replaced but rados performance did not improve.

- write: 313.86 MB/s / 0.203907ms

I figured it would be wise to test all servers and disks to see if they deliver the expected performance.

Since I have data on the cluster that I wanted to keep online I did one server at a time, delete the 3 OSDs, FIO test them, recreate the OSDs and add them back to the cluster, wait till the cluster is healthy and on to the next server. All disks match the expected performance now so with all servers and OSDs up again I redid the rados benchmark.

Performance was almost twice as good as before recreating all OSDs and on par with what I had expected for bandwidth and latency.

- write: 586.679 MB/s / 0.109085ms
- read: 2092.27 MB/s / 0.0292913ms

As the controller was not to blame I wanted to test if having different size OSDs with ‘correct’ weights assigned was causing the issue, I removed one OSD on each storage node (one node at a time) and re-partitioned it and added it back to the cluster with the correct weight, performance was still ok though a little slower as before.

Figuring this wasn’t the cause either I took out the OSDs with partitions again, wiped the disks and recreated the OSDs. Performance now was even lower, almost as low as when I just swapped the controller.

Since I knew the performance could be better I decided to recreate all OSDs one server at a time and performance once again was good.

Since now I was able to reproduce the issue I started once more and document all the steps to see if there is any logic to the issue.

With the cluster performing well I started removing one OSD at a time, wait for the cluster to become healthy again, benchmark, add it back and on to the next server.

These are the results of each step.

One OSD out:

write: 528.439 / 0.121021
read: 2022.19 / 0.03031

OSD back in again:

write: 584.14 / 0.10956
read: 2108.06 / 0.0289867

Next server, one OSD out:

write: 482.923 / 0.132512
read: 2008.38 / 0.0305356

OSD back in again:

write: 578.034 / 0.110686
read: 2059.24 / 0.0297554

Next server, one OSD out:

write: 470.384 / 0.136055
read: 2043.68 / 0.0299759

OSD back in again:

write: 424.01 / 0.150886
read: 2086.94 / 0.0293182

Write performance now is significantly lower as when I started. When I first wrote on the mailing list performance seems to go up once CEPH enters 'near-full' state so I decided to test that again.

I reached full by accident and the last two write tests showed somewhat better performance but not near the level I started with.

write: 468.632 / 0.136559
write: 488.523 / 0.130999

I removed the benchmark pool and recreated it, testing a few more times, performance now seems even lower again and again near the results I started off with.

write: 449.034 / 0.142524
write: 399.532 / 0.160168
write: 366.831 / 0.174451

I know how to get the performance back to the expected level by recreating all OSDs and shuffling data around the cluster but I don’t think this should happen in the first place.

Just to clarify when removing an OSD I reweigh it to 0, wait for it’s safe to delete the OSD, I assume this is the correct way of doing such things.

Am I doing something wrong? Did I run into some sort of bug?

I’m running Proxmox VE 5.2 which includes ceph version 12.2.7 (94ce186ac93bb28c3c444bccfefb8a31eb0748e4) luminous (stable)

Thanks,
Menno

My script to safely remove an OSD:

ceph osd crush reweight osd.$1 0.0
while ! ceph osd safe-to-destroy $1; do echo "not safe to destroy, waiting.." ; sleep 10 ; done
sleep 5
ceph osd out $1
systemctl disable ceph-osd@$1
systemctl stop ceph-osd@$1
ceph osd crush remove osd.$1
ceph auth del osd.$1
ceph osd down $1
ceph osd rm $1

-----Original message-----
> From:Menno Zonneveld <menno@xxxxxxxx>
> Sent: Monday 10th September 2018 11:45
> To: Alwin Antreich <a.antreich@xxxxxxxxxxx>; ceph-users <ceph-users@xxxxxxxxxxxxxx>
> Cc: Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx>
> Subject: RE:  Rados performance inconsistencies, lower than expected performance
> 
> 
> -----Original message-----
> > From:Alwin Antreich <a.antreich@xxxxxxxxxxx>
> > Sent: Thursday 6th September 2018 18:36
> > To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > Cc: Menno Zonneveld <menno@xxxxxxxx>; Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx>
> > Subject: Re:  Rados performance inconsistencies, lower than
> expected performance
> > 
> > On Thu, Sep 06, 2018 at 05:15:26PM +0200, Marc Roos wrote:
> > > 
> > > It is idle, testing still, running a backup's at night on it.
> > > How do you fill up the cluster so you can test between empty and full?
> 
> > > Do you have a "ceph df" from empty and full? 
> > > 
> > > I have done another test disabling new scrubs on the rbd.ssd pool (but
> 
> > > still 3 on hdd) with:
> > > ceph tell osd.* injectargs --osd_max_backfills=0
> > > Again getting slower towards the end.
> > > Bandwidth (MB/sec):     395.749
> > > Average Latency(s):     0.161713
> > In the results you both had, the latency is twice as high as in our
> > tests [1]. That can already make quiet some difference. Depending on the
> > actual hardware used, there may or may not be the possibility for good
> > optimisation.
> > 
> > As a start, you could test the disks with fio, as shown in our benchmark
> > paper, to get some results for comparison. The forum thread [1] has
> > some benchmarks from other users for comparison.
> > 
> > [1] https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
> 
> Thanks for the suggestion, I redid the fio test and one server seem to be
> causing trouble.
> 
> When I initially tested our SSD's according to the benchmark paper our Intel
> SSD's performed more or less equal to the Samsung SSD's used.
> 
> from fio.log
> 
> fio: (groupid=0, jobs=1): err= 0: pid=3606315: Mon Sep 10 11:12:36 2018
>   write: io=4005.9MB, bw=68366KB/s, iops=17091, runt= 60001msec
>     slat (usec): min=5, max=252, avg= 5.76, stdev= 0.66
>     clat (usec): min=6, max=949, avg=51.72, stdev= 9.54
>      lat (usec): min=54, max=955, avg=57.48, stdev= 9.56
> 
> However one of the other machines (with identical SSD's) now performs poorly
> compared to the others with these results
> 
> fio: (groupid=0, jobs=1): err= 0: pid=3893600: Mon Sep 10 11:15:17 2018
>   write: io=1258.8MB, bw=51801KB/s, iops=12950, runt= 24883msec
>     slat (usec): min=5, max=259, avg= 6.17, stdev= 0.78
>     clat (usec): min=53, max=857, avg=69.77, stdev=13.11
>      lat (usec): min=70, max=863, avg=75.93, stdev=13.17
> 
> I'll first resolve the slower machine before doing more testing as this surely
> won't help overall performance.
> 
> 
> > --
> > Cheers,
> > Alwin
> 
> Thanks!,
> Menno
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com