On Thu, Sep 13, 2018 at 02:17:20PM +0200, Menno Zonneveld wrote: > Update on the subject, warning, lengthy post but reproducible results and workaround to get performance back to expected level. > > One of the servers had a broken disk controller causing some performance issues on this one host, FIO showed about half performance on some disks compared to the other hosts, it’s been replaced but rados performance did not improve. > > - write: 313.86 MB/s / 0.203907ms > > I figured it would be wise to test all servers and disks to see if they deliver the expected performance. > > Since I have data on the cluster that I wanted to keep online I did one server at a time, delete the 3 OSDs, FIO test them, recreate the OSDs and add them back to the cluster, wait till the cluster is healthy and on to the next server. All disks match the expected performance now so with all servers and OSDs up again I redid the rados benchmark. > > Performance was almost twice as good as before recreating all OSDs and on par with what I had expected for bandwidth and latency. > > - write: 586.679 MB/s / 0.109085ms > - read: 2092.27 MB/s / 0.0292913ms > > As the controller was not to blame I wanted to test if having different size OSDs with ‘correct’ weights assigned was causing the issue, I removed one OSD on each storage node (one node at a time) and re-partitioned it and added it back to the cluster with the correct weight, performance was still ok though a little slower as before. > > Figuring this wasn’t the cause either I took out the OSDs with partitions again, wiped the disks and recreated the OSDs. Performance now was even lower, almost as low as when I just swapped the controller. > > Since I knew the performance could be better I decided to recreate all OSDs one server at a time and performance once again was good. > > Since now I was able to reproduce the issue I started once more and document all the steps to see if there is any logic to the issue. > > With the cluster performing well I started removing one OSD at a time, wait for the cluster to become healthy again, benchmark, add it back and on to the next server. > > These are the results of each step. > > One OSD out: > > write: 528.439 / 0.121021 > read: 2022.19 / 0.03031 > > OSD back in again: > > write: 584.14 / 0.10956 > read: 2108.06 / 0.0289867 > > Next server, one OSD out: > > write: 482.923 / 0.132512 > read: 2008.38 / 0.0305356 > > OSD back in again: > > write: 578.034 / 0.110686 > read: 2059.24 / 0.0297554 > > Next server, one OSD out: > > write: 470.384 / 0.136055 > read: 2043.68 / 0.0299759 > > OSD back in again: > > write: 424.01 / 0.150886 > read: 2086.94 / 0.0293182 > > Write performance now is significantly lower as when I started. When I first wrote on the mailing list performance seems to go up once CEPH enters 'near-full' state so I decided to test that again. > > I reached full by accident and the last two write tests showed somewhat better performance but not near the level I started with. > > write: 468.632 / 0.136559 > write: 488.523 / 0.130999 > > I removed the benchmark pool and recreated it, testing a few more times, performance now seems even lower again and again near the results I started off with. > > write: 449.034 / 0.142524 > write: 399.532 / 0.160168 > write: 366.831 / 0.174451 > > I know how to get the performance back to the expected level by recreating all OSDs and shuffling data around the cluster but I don’t think this should happen in the first place. > > Just to clarify when removing an OSD I reweigh it to 0, wait for it’s safe to delete the OSD, I assume this is the correct way of doing such things. > > Am I doing something wrong? Did I run into some sort of bug? AFAIR, the object deletion is done in the background. Depending on how quick you do the subsequent tests and how much the cluster is working on the recovery, the results may vary well. > > I’m running Proxmox VE 5.2 which includes ceph version 12.2.7 (94ce186ac93bb28c3c444bccfefb8a31eb0748e4) luminous (stable) 12.2.8 is in the repositories. ;) > > Thanks, > Menno > > > My script to safely remove an OSD: > > ceph osd crush reweight osd.$1 0.0 > while ! ceph osd safe-to-destroy $1; do echo "not safe to destroy, waiting.." ; sleep 10 ; done > sleep 5 > ceph osd out $1 > systemctl disable ceph-osd@$1 > systemctl stop ceph-osd@$1 > ceph osd crush remove osd.$1 > ceph auth del osd.$1 > ceph osd down $1 > ceph osd rm $1 > -- Cheers, Alwin _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com