Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock: > On 08/05/12 14:55, Martin Steigerwald wrote: > > Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock: > >> On 08/05/12 00:24, Martin Steigerwald wrote: > >>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock: > >>>> On 07/05/12 20:59, Martin Steigerwald wrote: > >>>>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock: > >>>>>>> Possibly the older disk is lying about doing cache flushes. The > >>>>>>> wonderful disk manufacturers do that with commodity drives to make > >>>>>>> their benchmark numbers look better. If you run some random IOPS > >>>>>>> test against this disk, and it has performance much over 100 IOPS > >>>>>>> then it is definitely not doing real cache flushes. > >>>>> > >>>>> […] > >>>>> > >>>>> I think an IOPS benchmark would be better. I.e. something like: > >>>>> > >>>>> /usr/share/doc/fio/examples/ssd-test > >>>>> > >>>>> (from flexible I/O tester debian package, also included in upstream > >>>>> tarball of course) > >>>>> > >>>>> adapted to your needs. > >>>>> > >>>>> Maybe with different iodepth or numjobs (to simulate several threads > >>>>> generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a > >>>>> Hitachi 5400 rpm harddisk connected via eSATA. > >>>>> > >>>>> Important is direct=1 to bypass the pagecache. > >>>> > >>>> Thanks for suggesting this tool, I've run it against the USB disk and > >>>> an LV on my AHCI/SATA/md array > >>>> > >>>> Incidentally, I upgraded the Seagate firmware (model 7200.12 from CC34 > >>>> to CC49) and one of the disks went offline shortly after I brought the > >>>> system back up. To avoid the risk that a bad drive might interfere > >>>> with the SATA performance, I completely removed it before running any > >>>> tests. Tomorrow I'm out to buy some enterprise grade drives, I'm > >>>> thinking about Seagate Constellation SATA or even SAS. > >>>> > >>>> Anyway, onto the test results: > >>>> > >>>> USB disk (Seagate 9SD2A3-500 320GB): > >>>> > >>>> rand-write: (groupid=3, jobs=1): err= 0: pid=22519 > >>>> > >>>> write: io=46680KB, bw=796512B/s, iops=194, runt= 60012msec […] > >>> Please repeat the test with iodepth=1. > >> > >> For the USB device: > >> > >> rand-write: (groupid=3, jobs=1): err= 0: pid=11855 > >> > >> write: io=49320KB, bw=841713B/s, iops=205, runt= 60001msec […] > >> and for the SATA disk: > >> > >> rand-write: (groupid=3, jobs=1): err= 0: pid=12256 > >> > >> write: io=28020KB, bw=478168B/s, iops=116, runt= 60005msec […] > > […] > > > >> issued r/w: total=0/7005, short=0/0 > >> > >> lat (msec): 4=6.31%, 10=69.54%, 20=22.68%, 50=0.63%, 100=0.76% > >> lat (msec): 250=0.09% > >>> > >>> 194 IOPS appears to be highly unrealistic unless NCQ or something like > >>> that is in use. At least if thats a 5400/7200 RPM sata drive (didn´t > >>> check vendor information). > >> > >> The SATA disk does have NCQ > >> > >> USB disk is supposed to be 5400RPM, USB2, but reporting iops=205 > >> > >> SATA disk is 7200 RPM, 3 Gigabit SATA, but reporting iops=116 > >> > >> Does this suggest that the USB disk is caching data but telling Linux > >> the data is on disk? > > > > Looks like it. > > > > Some older values for a 1.5 TB WD Green Disk: > > > > mango:~# fio -readonly -name iops -rw=randread -bs=512 -runtime=100 > > -iodepth 1 -filename /dev/sda -ioengine libaio -direct=1 > > [...] iops: (groupid=0, jobs=1): err= 0: pid=9939 > > > > read : io=1,859KB, bw=19,031B/s, iops=37, runt=100024msec [...]</pre> > > > > mango:~# fio -readonly -name iops -rw=randread -bs=512 -runtime=100 > > -iodepth 32 -filename /dev/sda -ioengine libaio -direct=1 > > iops: (groupid=0, jobs=1): err= 0: pid=10304 > > > > read : io=2,726KB, bw=27,842B/s, iops=54, runt=100257msec > > > > mango:~# hdparm -I /dev/sda | grep -i queue > > > > Queue depth: 32 > > > > * Native Command Queueing (NCQ) > > > > - 1,5 TB Western Digital, WDC WD15EADS-00P8B0 > > - Pentium 4 mit 2,80 GHz > > - 4 GB RAM, 32-Bit Linux > > - Linux Kernel 2.6.36 > > - fio 1.38-1 […] > >> It is a gigabit network and I think that the performance of the dd > >> command proves it is not something silly like a cable fault (I have come > >> across such faults elsewhere though) > > > > What is the latency? > > $ ping -s 1000 192.168.1.2 > PING 192.168.1.2 (192.168.1.2) 1000(1028) bytes of data. > 1008 bytes from 192.168.1.2: icmp_req=1 ttl=64 time=0.307 ms > 1008 bytes from 192.168.1.2: icmp_req=2 ttl=64 time=0.341 ms > 1008 bytes from 192.168.1.2: icmp_req=3 ttl=64 time=0.336 ms Seems to be fine. > >>> Anyway, 15000 RPM SAS drives should give you more IOPS than 7200 RPM > >>> SATA drives, but SATA drives are cheaper and thus you could - > >>> depending on RAID level - increase IOPS by just using more drives. > >> > >> I was thinking about the large (2TB or 3TB) 7200 RPM SAS or SATA drives > >> in the Seagate `Constellation' enterprise drive range. I need more > >> space anyway, and I need to replace the drive that failed, so I have to > >> spend some money anyway - I just want to throw it in the right direction > >> (e.g. buying a drive, or if the cheap on-board SATA controller is a > >> bottleneck or just extremely unsophisticated, I don't mind getting a > >> dedicated controller) > >> > >> For example, if I knew that the controller is simply not suitable with > >> barriers, NFS, etc and that a $200 RAID card or even a $500 RAID card > >> will guarantee better performance with my current kernel, I would buy > >> that. (However, I do want to use md RAID rather than a proprietary > >> format, so any RAID card would be in JBOD mode) > > > > They point is: How much of the performance will arrive at NFS? I can't > > say yet. > > My impression is that the faster performance of the USB disk was a red > herring, and the problem really is just the nature of the NFS protocol > and the way it is stricter about server-side caching (when sync is > enabled) and consequently it needs more iops. Yes, that seems to be the case here. It seems to be a small blocksize random I/O workload with heavy fsync() usage. You could adapt to /usr/share/doc/fio/examples/iometer-file-access-server to benchmark such a scenario. Also fsmark simulates such a heavy fsync() based quite well. I have packaged it for Debian, but its still in NEW queue. You can grab it from http://people.teamix.net/~ms/debian/sid/ (32-Bit build, but easily buildable for amd64 as well) > I've turned two more machines (a HP Z800 with SATA disk and a Lenovo > X220 with SSD disk) into NFSv3 servers, repeated the same tests, and > found similar performance on the Z800, but 20x faster on the SSD (which > can support more IOPS) Okay, then you want more IOPS. > > And wait I/O is quite high. > > > > Thus it seems this workload can be faster with faster / more disks or a > > RAID controller with battery (and disabling barriers / cache flushes). > > You mean barrier=0,data=writeback? Or just barrier=0,data=ordered? I meant data=ordered. As mentioned by Andreas data=journal could yield a improvement. I'd suggest trying to but the journal onto a different disk then in order to avoid head seeks during writeout of journal data to its final location. > In theory that sounds good, but in practice I understand it creates some > different problems, eg: > > - monitoring the battery, replacing it periodically > > - batteries only hold the charge for a few hours, so if there is a power > outage on a Sunday, someone tries to turn on the server on Monday > morning and the battery has died, cache is empty and disk is corrupt Hmmm, from what I know there are NVRAM based controllers that can hold the cached data for several days. > - some RAID controllers (e.g. HP SmartArray) insist on writing their > metadata to all volumes - so you become locked in to the RAID vendor. I > prefer to just use RAID1 or RAID10 with Linux md onto the raw disks. On > some Adaptec controllers, `JBOD' mode allows md to access the disks > directly, although I haven't verified that yet. I see no reason why SoftRAID cannot be used with a NVRAM based controller. > I'm tempted to just put a UPS on the server and enable NFS `async' mode, > and avoid running anything on the server that may cause a crash. A UPS on the server won't make "async" safe. If the server crashes you still can loose data. Ciao, -- Martin Steigerwald - teamix GmbH - http://www.teamix.de gpg: 19E3 8D42 896F D004 08AC A0CA 1E10 C593 0399 AE90 -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html