I am using SSD underneath. However, my problem is not exactly related to disk cache. I think i should give some more background.
These are my key points:
- Read on my passthrough driver on top of lvm is slower than read on just the lvm (with or without any kind of direct i/o).
- Read on my passthrough driver (on top of lvm) is slower than write on my passthrough driver (on top of lvm).
- If i disable lvm readahead (we can do that for all block device drivers) then its read performance becomes almost equal to the read performance of my passthrough driver. This suggested that lvm readahead was helping lvm's performance. But, if it helps the lvm performance then it should also help the performance of my passthrough driver (which is sitting on top of it). This led me to thinking that i am doing something in my device driver which is possibly either disabling the lvm readahead or lvm readahead gets switched off when it is not interacting with the kernel directly.
Or is there some problem when i pass the request to lvm (should i be calling some thing else or passing some kind of flag).
Regards,
Neha
On Thu, Apr 11, 2013 at 5:02 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
Before reading the below, please not the rotating disks are made ofOn Thu, Apr 11, 2013 at 4:48 PM, neha naik <nehanaik27@xxxxxxxxx> wrote:
> HI Greg,
> Thanks a lot. Everything you said made complete sense to me but when i
> tried running with following options my read is so slow (basically with
> direct io, that with 1MB/s it will just take 32minutes to read 32MB data)
> yet my write is doing fine. Should i use some other options of dd (though i
> understand that with direct we bypass all caches, but direct doesn't
> guarantee that everything is written when call returns to user for which i
> am using fdatasync).
>
> time dd if=/dev/shm/image of=/dev/sbd0 bs=4096 count=262144 oflag=direct
> conv=fdatasync
> time dd if=/dev/pdev0 of=/dev/null bs=4096 count=2621262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 17.7809 s, 60.4 MB/s
>
> real 0m17.785s
> user 0m0.152s
> sys 0m1.564s
>
>
> I interrupted the dd for read because it was taking too much time with 1MB/s
> :
> time dd if=/dev/pdev0 of=/dev/null bs=4096 count=262144 iflag=direct
> conv=fdatasync
> ^C150046+0 records in
> 150045+0 records out
> 614584320 bytes (615 MB) copied, 600.197 s, 1.0 MB/s
>
>
> real 10m0.201s
> user 0m2.576s
> sys 0m0.000s
zones with a constant number of sectors/track. In the below I discuss
1 track as holding 1MB of data. I believe that is roughly accurate
for an outer track with near 3" of diameter. A inner track with
roughly 2" of diameter, would only have 2/3rds of 1MB of data. I am
ignoring that for simplicity sake. You can worry about it yourself
separately.
====
When you use iflag=direct, you are telling the kernel, I know what I'm
doing, just do it.
So let's do some math and see if we can figure it out. I assume you
are working with rotating media as your backing store for the LVM
volumes.
A rotating disk with 6000 RPMs takes 10 milliseconds per revolution.
(I'm using this because the math is easy. Check the specs for your
drives.)
With iflag=direct, you have taken control of interacting with a
rotating disk that can only read data once every rotation. That is
relevant sectors are only below the read head once every 10 msecs.
So, you are saying, give me 4KB every time the data rotates below the
read head. That happens about 100 times per second, so per my logic
you should be seeing 400KB/sec read rate.
You are actually getting roughly twice that. Thus my question is what
is happening in your setup that you are getting 10KB per rotation
instead of the 4KB you asked for. (the answer could be that you have
15K rpm drives, instead of the 6K rpm drives I calculated for.)
My laptop is giving 20MB/sec with bs=4KB which implies I'm getting 50x
the speed I expect from the above theory. I have to assume some form
of read-ahead is going on and reading 256KB at a time. That logic may
be in my laptop's disk and not the kernel. (I don't know for sure).
Arlie recommended 1 MB reads. That should be a lot faster because a
disk track is roughly 1 MB, so you are telling the disk: As you spin,
when you get to the sector I care about, do a continuous read for a
full rotation (1MB). By the time you ask for the next 1MB, I would
expect it will be too late get the very next sector, so the drive
would do a full rotation looking for your sector, then do a continuous
1MB read.
So, if my logic is right the drive itself is doing:
rotation 1: searching for first sector of read
rotation 2: read 1MB continuously
rotation 3: searching for first sector of next read
rotation 4: read 1MB continuously
I just checked my laptop's drive, and with bs=1MB it actually achieves
more or less max transfer rate, so for it at least with 1MB reads the
cpu / drive controller is able to keep up with the rotating disk and
not have the 50% wasted rotations that I would actually expect.
Again it appears something is doing some read ahead. Let's assume my
laptop's disk does a 256KB readahead every time it gets a read
request. So when it gets that 1MB request, it actually reads
1MB+256KB, but it returns the first 1MB to the cpu as soon as it has
it. Thus when the 1MB is returned to the cpu, the drive is still
working on the next 256KB and putting it in on-disk cache. If 256KB
is 1/4 of a track's data, then it takes the disk about 2.5 msecs to
read that data from the rotating platter to drives internal controller
cache. If during that 2.5 msecs the cpu issues the next 1MB read
request, the disk will just continue reading and not have any dead
time.
If you want to understand exactly what is happening you would need to
monitor exactly what is going back and forth across the sata bus. Is
the kernel doing a read-ahead even with direct io? Is the drive doing
some kind of read ahead? etc.
If you are going to work with direct io, hopefully the above gives you
a new way to think about things.
Greg
_______________________________________________ Kernelnewbies mailing list Kernelnewbies@xxxxxxxxxxxxxxxxx http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies