uOn Tue, Jan 31, 2012 at 09:52:05AM -0500, Christoph Hellwig wrote: > Try playing with the following tweaks to get larger I/O to the disk: > > a) make sure you use the noop or deadline elevators > b) increase /sys/block/sdX/queue/max_sectors_kb from its low default > c) dramatically increase /sys/devices/virtual/bdi/<major>:<minor>/read_ahead_kb The default settings on this system are: # cat /sys/block/sdb/queue/scheduler noop [deadline] cfq (I think the one in brackets is the active one) # cat /sys/block/sdb/queue/max_sectors_kb 512 # cat /sys/devices/virtual/bdi/8:16/read_ahead_kb 128 I did a series of tests where I increased either or both of those to 1024, but it didn't make any difference to read throughput, which sat stubbornly at 25MB/sec. However I could see the difference in btrace and in tps figures, showing a smaller number of larger transfers taking place. It was clearly doing the right thing: seek and read a large block of data, seek and read the next large block of data, and so on. Writing the files should (to my mind) require the same amount of seeking and disk passing under the head, but it runs at 75MB/sec. So I realised: when you are writing lots of files with write-behind caching, the drive has a lot of opportunity for reordering those writes to minimise the seek and rotational latency. But when reading in a single thread, you are doing sequential seek one - read one - seek two - read two - ... It turns out the -c (concurrency) option to bonnie++ is ignored for the file creation and reading test. So my next steps were: * run 4 instances of bonnie++ concurrently on the same filesystem. This did make some improvement. For the sequential file reading part, it got me to 30MB/sec with bonnie++ reading 32k chunks, and 40MB/sec with it reading 1024k chunks. It fell back to 33MB/sec in the random file reading part. * write a script to do just a random read test, with varying levels of concurrency. The script is given below: it forks a varying number of processes, each of which runs "dd" sequentially on a subset of the files. First running with default params (max_sectors_kb=512, read_ahead_kb=128) #p files/sec dd_args 1 39.87 bs=1024k iflag=direct => 25.9 MB/sec 1 42.51 bs=1024k 2 43.88 bs=1024k iflag=direct 2 29.53 bs=1024k 5 57.40 bs=1024k iflag=direct 5 43.48 bs=1024k 10 68.68 bs=1024k iflag=direct 10 48.02 bs=1024k 20 75.51 bs=1024k iflag=direct 20 53.08 bs=1024k 50 79.37 bs=1024k iflag=direct => 51.6 MB/sec 50 51.30 bs=1024k The files have an average size of 0.65MB, so I've converted some files/sec into MB/sec. What I found surprising was that the performance is lower with iflag=direct for a single reader, but much higher with iflag=direct for concurrent readers. So I tried again with max_sectors_kb=1024, read_ahead_kb=1024 #p files/sec dd_args 1 39.95 bs=1024k iflag=direct 1 42.21 bs=1024k 2 43.14 bs=1024k iflag=direct 2 47.93 bs=1024k 5 56.68 bs=1024k iflag=direct 5 61.95 bs=1024k 10 68.35 bs=1024k iflag=direct 10 75.50 bs=1024k => 49.1 MB/sec 20 75.74 bs=1024k iflag=direct 20 83.36 bs=1024k => 54.2 MB/sec 50 79.45 bs=1024k iflag=direct 50 86.58 bs=1024k => 56.3 MB/sec Now it works better without iflag=direct. With 20+ readers the throughput is approaching decent, albeit still a way from the 75MB/sec I achieve when writing. Next with max_sectors_kb=128, read_ahead_kb=1024 (just to see if smaller SATA transfers work better than large ones) #p files/sec dd_args 1 39.74 bs=1024k iflag=direct 1 42.49 bs=1024k 2 43.92 bs=1024k iflag=direct 2 48.22 bs=1024k 5 56.39 bs=1024k iflag=direct 5 62.61 bs=1024k 10 61.50 bs=1024k iflag=direct 10 68.67 bs=1024k 20 68.21 bs=1024k iflag=direct 20 75.28 bs=1024k => 48.9 MB/s 50 68.36 bs=1024k iflag=direct 50 75.32 bs=1024k => 49.0 MB/s Maybe tiny improvement at low concurrency, but worse at high concurrency (presumably a larger number of queued I/Os is hitting a queue depth limit) Finally with max_sectors_kb=1024, read_ahead_kb=1024, and the noop scheduler: root@storage1:~# echo noop >/sys/block/sdc/queue/scheduler root@storage1:~# cat /sys/block/sdc/queue/scheduler [noop] deadline cfq #p files/sec dd_args 1 40.19 bs=1024k iflag=direct 1 41.98 bs=1024k 2 43.63 bs=1024k iflag=direct 2 48.24 bs=1024k 5 56.97 bs=1024k iflag=direct 5 62.86 bs=1024k 10 68.68 bs=1024k iflag=direct 10 76.81 bs=1024k 20 76.03 bs=1024k iflag=direct 20 85.17 bs=1024k => 55.4 MB/s 50 76.58 bs=1024k iflag=direct 50 83.66 bs=1024k This may be slightly better than the deadline scheduler until we get to 50 concurrent readers. FWIW, this is the controller: [ 12.855639] mpt2sas1: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (8183856 kB) [ 12.855801] mpt2sas1: PCI-MSI-X enabled: IRQ 68 [ 12.855804] mpt2sas1: iomem(0x00000000fb8c0000), mapped(0xffffc90011788000), size(16384) [ 12.855806] mpt2sas1: ioport(0x0000000000008000), size(256) [ 13.142189] mpt2sas1: sending message unit reset !! [ 13.150164] mpt2sas1: message unit reset: SUCCESS [ 13.323195] mpt2sas1: Allocated physical memory: size(16611 kB) [ 13.323200] mpt2sas1: Current Controller Queue Depth(7386), Max Controller Queue Depth(7647) [ 13.323203] mpt2sas1: Scatter Gather Elements per IO(128) [ 13.553727] mpt2sas1: LSISAS2116: FWVersion(05.00.13.00), ChipRevision(0x02), BiosVersion(07.11.00.00) [ 13.553737] mpt2sas1: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ) [ 13.553814] mpt2sas1: sending port enable !! [ 13.555001] mpt2sas1: port enable: SUCCESS [ 13.559519] mpt2sas1: host_add: handle(0x0001), sas_addr(0x500062b2000b4c00), phys(16) and the drives are Hitachi Deskstar 5K3000 HDS5C3030ALA630: [ 13.567932] scsi 5:0:1:0: Direct-Access ATA Hitachi HDS5C303 A580 PQ: 0 ANSI: 5 [ 13.567946] scsi 5:0:1:0: SATA: handle(0x0012), sas_addr(0x4433221101000000), phy(1), device_name(0xcca2500032cd28c0) [ 13.567953] scsi 5:0:1:0: SATA: enclosure_logical_id(0x500062b2000b4c00), slot(1) [ 13.568041] scsi 5:0:1:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y) [ 13.568049] scsi 5:0:1:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(6), cmd_que(1) [ 13.568185] sd 5:0:1:0: Attached scsi generic sg2 type 0 [ 13.569753] sd 5:0:1:0: [sdc] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB) ... [ 13.797275] sd 5:0:1:0: [sdc] Write Protect is off [ 13.797284] sd 5:0:1:0: [sdc] Mode Sense: 73 00 00 08 [ 13.800400] sd 5:0:1:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA They are "CoolSpin" drives of unspecified RPM, i.e. definitely not high performance in absolute terms. But if I can get the filesystem read throughput to approach the write throughput, I'll be extremely happy. Regards, Brian. ------ 8< --------------------------------------------------------------- #!/usr/bin/ruby -w CORPUS = "/data/sdc/Bonnie.26384/*/*" module Perftest def self.run(n_files, n_procs=1, dd_args="") files = Dir[CORPUS].sort_by { rand }[0, n_files] chunks = files.each_slice(n_files/n_procs).to_a[0, n_procs] n_files = chunks.map { |chunk| chunk.size }.inject(:+) t1 = Time.now @pids = chunks.map { |chunk| fork { run_single(chunk, dd_args); exit! } } @pids.delete_if { |pid| Process.waitpid(pid) } t2 = Time.now printf "%3d %10.2f %s\n", n_procs, n_files/(t2-t1), dd_args end def self.run_single(files, dd_args) files.each do |f| system("dd if='#{f}' of=/dev/null #{dd_args} 2>/dev/null") end end def self.kill_all(sig="TERM") @pids.each { |pid| Process.kill(sig, pid) rescue nil } end end at_exit { Perftest.kill_all } puts " #p files/sec dd_args" [1,2,5,10,20,50].each do |nprocs| Perftest.run(4000, nprocs, "bs=1024k iflag=direct") Perftest.run(4000, nprocs, "bs=1024k") end _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs