Hi, currently in use: oldest: SSDs: Intel S3510 80GB HDD: HGST 6TB H3IKNAS600012872SE NAS latest: SSDs: Kingston 120 GB SV300 HDDs: HGST 3TB H3IKNAS30003272SE NAS in future will be in use: SSDs: Samsung SM863 240 GB HDDs: HGST 3TB H3IKNAS30003272SE NAS and/or Seagate ST2000NM0023 2 TB ----- Its hard to say, if and at what chance the newer will fail with the osd's getting down/out, compared to the old ones. We did a lot to avoid that. Without having it in real numbers, my feeling is/was that the newer will fail with a much lower chance. But what is responsible for that, is unknown. In the very end, the old nodes have with 2x 2,3 GHz Intel Celeron ( 2x cores without HT ) and 3x 6 TB HDD much less cpu power per HDD compared to the 4x 3,3 GHz Intel E3-1225v5 CPU ( 4 cores ) with 10x 3 TB HDD. So its just too much different, CPU, HDD, RAM, even the HDD Controller. I will have to make sure, that the new cluster will have enough Hardware to make sure, that i dont need to consider possible problems there. ------ atop: sda/sdb == SSD journal ------ That was my first experience too. At very first, deep-scrubs and even normal scrubs were driving the %WA and business of the HDDs to 100% Flat. ------ I rechecked it with munin. The journal SSD's go from ~40% up to 80-90% during deebscrub. The HDDs go from ~ 20% up to 90-100% flat more or less, during deebscrubt. At the same time, the load avarage goes to 16-20 ( 4 cores ) while the CPU will see up to 318% Idle Waiting Time. ( out of max. 400% ) ------ The OSD's receive a peer timeout. Which is just understandable if the system will see a 300% Idle Waiting Time for just long enough. ------ And yes, as it seems, for clusters which are very busy, especially with low hardware ressources, needs much more than the standard config can/will deliver. As soon as the LTS is out i will have to start busting my head with available config parameters. -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:info@xxxxxxxxxxxxxxxxx Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 Am 11.04.2016 um 05:06 schrieb Christian Balzer: > > Hello, > > On Sat, 9 Apr 2016 02:14:45 +0200 Oliver Dzombic wrote: > >> Hi Christian, >> >> yeah i saw the problems with cache tier in the current hammer. >> >> But as far as i saw it, i would not get in touch with that szenarios. I >> dont plan to change settings like that, to let it go rubbish. >> > Shouldn't, but I'd avoid it anyway. > >> But i already decided to wait for jewel and create a whole new cluster >> and copy all data. >> > Sounds like a safer alternative. > >> - >> >> I am running KVM instances. And will also run openVZ instances. Maybe >> LXC too, lets see. They run all kind of different, independent >> applications. >> >> - >> >> Well i have to admit, the beginnings of the cluster were quiet >> experimentel. Was using 4x ( 2x 2,3 GHz Intel Celeron CPU's for 3x 6 TB >> HDD + 80 GB SSD with 16 GB RAM ). And extending it by 2 Additional of >> that kind. And currently also an E3-1225v5 with 32 GB RAM and 10x 3 TB >> HDD and 2x 120 GB SSD. >> > Would you mind sharing what exact models of HDDs and SSDs you're using? > Also, is the newer node showing the same ratio of unresponsive OSDs as the > older ones? > > In the atop output you posted, which ones are the SSDs (if they're in > there at all)? > >> But all my munin tells me its HDD related, if you want i can show it to >> you. I guess that the hardcore random access on the drives are just >> killing it. >> > Yup, I've seen that with the "bad" cluster here, the first thing to > indicate things were getting to the edge of IOPS capacity was that > deep-scrubs killed performance and then even regular scrubs. > >> I deactivated (deep) scrub also because of this problem and just let it >> run in the night, like now, and having 90% utilization @ journals and >> 97% utilization @ HDD's. >> > This confuses me as well, during deep-scrubs all data gets read, your > journals shouldn't get busier than they were before and last time you > mentioned them being around 60% or so? > >> And yes, its simply fixed by restarting the OSD's. >> >> They receive a heartbeat timeout and just go out/down. >> > Which timeout is it, the peer one or the monitor one? > Have you tried upping the various parameters to prevent this? > >> I tried to set the flag, that there will be no out/down. >> That worked. It did not got marked out/down, but it anyway happend and >> the cluster got instable ( misplaced object / recovery ). >> > That's a band-aid indeed, but I wouldn't expect misplaced objects from it. > >> Well as i see the situation in a case a VM has a file open and using it >> "right now", which is located on a PG on that OSD which is going "right >> now" down/out, then Filesystem of the VM will get in trouble. >> >> It will see a bus error. Depending on the amount of data which are on >> the OSD, and depending how many VM's are "right now" accessing their >> data, you will have a lot of VM's which will receive a bus error. >> >> But the fun goes even worst. It can, and will in most cases happen on >> linux operating systems, that such situation will cause an automatical >> read-only mount of the root partition. And this way, of course, the >> server will basically stop doing its job. >> >> And, since this is not fucky enough, as long as you dont have your OSD >> back up and in, the VM's will not reboot. >> >> Maybe because all is simply too slow, or maybe you had the luck that the >> primary OSD was going down and before this is not rebalanced, this VM >> will have IO Errors and not able to access its HDD. >> >> I was already writing this here: >> >> http://article.gmane.org/gmane.comp.file-systems.ceph.user/27899/match=data+inaccessable+after+single+osd+down+default+size+3+min+1 >> >> but didnt got any reaction on it. >> >> As i see ceph its resilence are more focused, that you primary dont >> loose data in cases of (hardware) failure. And that you are able to use >> at least some part of your infrastructure, until the other part is >> recovered. >> >> Based on my experience, ceph is not very resilent, when it comes to the >> data which are right right to that time accessed, when the hardware >> failure occurs. >> > People (including me) have seen things like that before and there are > definitely places in Ceph where behavior like this can/needs to be > improved. > However this is also very dependent on your timeouts (in case of > unexpected OSD failures) and the loading of your cluster (how long does it > take for the PGs to re-peer, etc). > >> But, to be fair, two points are very important: >> >> 1. my ceph config is for sure very basic, and has for gurantee some >> space for improvement >> >> 2. windows filesystems are able to handle that situation much better. >> >> While linux filesystems will in many cases have an >> >> errors=remount-ro in their /etc/fstab by default >> >> mostly seen on debian based distributions, but also others. >> This will cause to auto remount RO with that BUS errors. >> >> Windows OS's does not have that. As i can see, Centos/Redhat doesn't >> have it too. So these will just continue to work. >> >> So in the very end, thats not ceph's fault ( only ). >> >> But as it seems, read/write's which were active in the moment of >> hardware failure are not in all cases buffered and sended again, in case >> of hardware failure. >> >> ---- >> >> So far my experience, and theoretical thinkings about what i see. >> >> On the other side, if i regulary simply turn off the node or pull the >> network cable, there will not be bus errors. So ceph can handle this >> szenario's better. >> >> But especially when HDD's are going out/down because of heartbeat >> timeout, it seems, ceph is not that resilent. >> > As I said, load plays a factor, if it clearly can't communicate with an > OSD or if that OSD is very slow to respond are different things. > >> ---- >> >> From my experience, the more IO Wait the server has, the higher is the >> risk of OSD's are going down. Thats all what i could see when it comes >> to discernible pattern. >> >> Right now, scrubbing is running. I have no special settings for this. >> All standard. >> > Change that, especially the sleep time. > >> Atop looks like that: >> >> http://pastebin.com/mubaZbk2 >> > That looks a lot more reasonable than the 100ms+ times you quoted from > Munin, pretty typical for very busy HDDs. > Again, which ones are the SSDs? > >> >> ---- >> >> The distance of that two datacenters are some km ( within same city ). >> >> Latency is: >> >> rtt min/avg/max/mdev = 0.528/0.876/1.587/0.333 ms >> >> So should not be a big issue. But that will be changed. >> >> ---- >> >> The plan is, that the cachepool will be ~ 1 TB against 30-60 TB Raw HDD >> capacity, on each of 2-3 OSD nodes. >> >> As i see the situation that should be fine enough to end this random >> access stuff, making it a more linear stream going to and from the cold >> HDDs. >> > That's still a tad random once the cache gets full and starts flushing and > at least 4MB(one object) per write. > But yes, at least for me the backing storage has no issues now with both > promotions and flushing happening. > >> In any situation, with cache the situation can just improve. If i see, >> that the hot cache is instantly full, i will have to change the strategy >> again. But as i see the situation, each VM might have maybe up to 1-5% >> "hot" data in avarage. >> > It will definitely improve things, but to optimize you cluster performance > you're probably best of with very aggressive read-recency settings or > readforward cache mode. > > Christian >> So i think / hope, things can just become better with some faster drives >> in between. >> >> >> > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com