On May 28, 2010, at 3:36 PM, Gordan Bobic wrote:
Vincent Diepeveen wrote:
The big speedup that SSD's deliver for average usage is
ESPECIALLY because of the faster random access to the hardware.
Sure - on reads. Writes are a different beast. Look at some
reviews of SSDs of various types and generations. Until
relatively recently, random write performance (and to a large
extent, any write performance) on them has been very poor. Cheap
flash media (e.g. USB sticks) still suffers from this.
You wouldn't want to optimize a file system for hardware of the
past is it?
>
Before a file system is any mature, the hardware that is the
standard today will be very common.
There are a few problems with that line of reasoning.
1) Legacy support is important. If it wasn't, file systems would be
strictly in the realm of fixed disk manufacturers, and we would all
be using object based storage. This hasn't happened, nor is it
likely to in the next decade.
2) We cannot optimize for hardware of the future, because this
hardware may never arrive.
3) "Hardware of the past" is still very much in full production,
and isn't going away any time soon.
The only sane option is to optimize for what is prevalent right now.
if you have some petabytes of storage, i guess the bigger
bandwidth that SSD's deliver is not relevant, as the limitation
is the network bandwidth anyway, so some raid5 with extra spare
will deliver more than sufficient bandwidth.
RAID3/4/5/6 is inherently unsuitable for fast random writes
because if a write-read-write cycle required to update the parity.
Nearly all major supercomputers use raid5 with extra spare as well
as most database servers.
Can you quantify that bold statement?
I would expect vastly higher levels of RAID than RAID5 on
supercomputers, because RAID5 doesn't scale sufficiently. RAID6 is
a bit better, but still doesn't really scale. It comes down to data
error rates on disks. RAID5 with current error rates tops out at
about 6-8TB, which is pitifully small on the supercomputer scale.
I'm speaking of each microunit of course. Call the bigger system as
you want.
Each microunit basically gets built from a raid5 with 1 extra spare.
To be very honest - past so many years i didn't see anything else
anywhere.
About all active supercomputers use this principle; note that most
governments
have no clue on networks and order a cheap network; very few do order
a good network.
I'd say if you already overpay some factors for expensive intel or
ibm processors,
why not also order a good network?
Yet no matter what network you show up with. The total write speed
that your storage delivers is always
going to be a lot more than the network can deliver to it.
These machines get build for a price. Using raid5 with an extra spare
is simply cheapest and makes sense.
You can't beat it pricewise.
Then how each microunit connect with each other is yet another story
and different in each architecture.
Anybody deploying RAID5 on high-performance database servers that
are expected to have more than about 1% write:read ratio has no
business being a database administrator, IMO.
that's a very dumb statement. A single raid5 has nowadays 3 gbit
speed and you got thousands of them.
it is only the tiny pc's such as my quad socket opteron box here,
which run an entire database,
where a higher raid level makes more sense such as raid 10. Yet
that's factor 2 in overhead in i/o.
Isn't that a bit much?
As soon as we speak of clustered or supercomputer systems, the
bandwidth to the i/o is the bottleneck always of course.
The expensive thing is the network or the cpu's anyway, not the
harddrives, as long as you don't go for SSD's :)
Besides, the majority of number crunching software is doing stuff
like matrix calculations (more than 50% of all system time goes to that
of HPC) and the number of reads is a lot more there than the number
of writes.
Then again the fact that I have managed to optimize the performance
of most systems I've been called to provide consultancy on by
factors of between 10 and 1000 without requiring any new hardware
shows me that the industry is full of people who haven't got a clue
what they are doing.
Industry knows very well what they do, price of raid5 is unbeatable.
Then you add an extra spare, or even 2 spares,
so that you can allow for more fault tolerance. 2 disks can fail. Now
the only choice is how big you want to make that raid5 array,
whether you can guess you can get away with the network choice with
10-12 disks or with just 5 + 1 spare.
6 disks is a common choice. You can use that raid unit then within
the grand circus for 60% efficiency.
Stock exchange is more into raid10 type clustering,
but those few harddrives that the stock exchange uses, is that
relevant?
You're pulling examples out of the air, and it is difficult to
discuss them without in-depth system design information. And I
doubt you have access to that level of the system design
information of stock exchange systems unless you work for one. Do you?
Why not take a look on my facebook what i do at home, that saves a
lot of bandwidth in this mailing list.
So a file system should benefit from the special properties of
a SSD to be suited for this modern hardware.
The only actual benefit is decreased latency.
Which is mighty important; so the ONLY interesting type of
filesystem for a SSD is a filesystem
that is optimized for read and write latency rather than
bandwidth IMHO.
Indeed, I agree (up to a point). Random IOPS has long been the
defining measure of disk performance for a reason.
I'm always very careful saying a benchmark is holy.
Most aren't, but every once in a while a meaningful one comes up.
Random IOPS one is one such (relatively rare) example.
Especially read latency i consider most important.
Depends on your application. Remember that reads can be sped up
by caching.
Even relative simple caching is very difficult to improve, with
random reads.
The random read speed is of overwhelming influence.
20 years of experience in high-performance applications, databases
and clusters showed me otherwise. Random read speed is only an
issue until your caches are primed, or if your data set is
sufficiently big to overwhelm any practical amount of RAM you could
apply.
That's a lot of outdated machines.
I look after a number of systems running applications that are
write-bound because the vast majority of reads can be satisfied
from page cache, but writes are unavoidable because transactions
have to be committed to persistent storage.
You're assuming the working set size fits in caching, which is a
very interesting assumption.
Not necessarily the whole working set, but a decent chunk of it,
yes. If it doesn't, you probably need to re-assess what you're
trying to do.
For example, on databases, as a rule of thumb you need to size your
RAM so that all indexes aggregated fit into 50-75% of your RAM. The
rest of the RAM is used for page caches for the actual data.
To put it into a different perspective - a typical RHEL server
install is 5-6GB. That fits into the RAM on the machine on my desk,
and almost fits into the RAM of the laptop on typing up this email on.
If your working set is measured in petabytes, then you are probably
using some big iron from Cray or IBM with suitable amounts of
memory for your application.
Not at all. Until a few years ago they delivered 1Ghz alpha's to run
an entire array.
You cannot limit your performance assessment to the use-case of
an average desktop user running Firefox, Thunderbird and
OpenOffice 99% of the time. Those are not the users that file
systems advances of the past 30 years are aimed at.
Actually manufacturers design cpu's based upon a good analysis of
the spec and linpack benchmark.
That's how it works in reality.
Again, I'd love to hear some basis of this.
It might be helpful if i remind you that i'm co-author of a program
that's in specint2006. Initially it was meant for specint2004.
Note the next specint i won't be in.
I don't think there is any, outside of the realm of specialized
hardware that is specifically designed for linpack. For starters,
such a design would ignore the fact that even simple things like
the different optimizing compilers can yield performance
differences of 4-8x. CPU designers are smarter than to base their
CPU design based on linpack throughput.
You seem to really have no clue on how professional $100 billion
companies are.
If you sell overexpensive products such as intel, marketing is
everything.
For that marketing having something new that outperforms old
generation is everything.
All the testers seem to share they always benchmark the same
applications.
The easiest to design for is spec.
Spec takes years and years to release a benchmark, so that gives
manufacturers like 4-7 years to tape out cpu's designed upon
accurate analysis of spec.
So the applications that get tested in benchmarks you put entire
teams on to analyze and speedup for your hardware.
Same for others such as AMD, Sun etc.
Now if you realize that applications for specint2006 were submitted
years before 2004, as it initially was meant to get specint2004,
and you then figure out which cpu's taped out some years after 2004,
and then you'll notice that some features different manufacturers
have in their new cpu's, definitely 'by accident' work very well for
the programs inside spec.
Nehalem with intel c++ 11.x is the ultimate design for specint2006 in
that sense.
Beating its ipc (per core) is going to be *very* difficult.
If i would generate them the 'stupid manner', which is how about
all software works, then it would be harddrive latency bound.
Of course there is no budget for SSD's for the generation of it, i
explained you my financial status already.
So in contradiction to Ken Thompson i have to be clever.
I'm going to assume that you have already read up on file system
optimizations, WRT stride, stripe-width and block group size.
Otherwise you could find your RAID array limited to the performance
of 1 disk on random IOPS.
The read latency a single SSD gets is so much faster than old
fashioned drives
So already a year or 10 ago with some others we figured out a
manner of generating that's a lot faster and which is not i/o bound
but CPU bound and also the CPU instructions needed have been
reduced up roughly factor 60.
Yet you know what?
Number of reads is bigger than the number of writes. So it's a few
dozen petabyte writes in total and a bit more reads than that.
Probably i'll figure out for this run how to turn off caching, as
i cache myself in the entire RAM already.
Are you talking about reads that actually hit the disks or reads
that the application performs? If the data was recently read/
written, then chances are that the reads will have come from
caches. Pay attention to your iostat figures.
when i speak of reads i always speak of reads that hit the disk.
when i speak of writes i speak always of writes that hit the disk.
In fact writes get done 100% sequential.
Of course i use a relative small amount of RAM whenever possible,
because the latency is the CPU always in all calculations
and the bandwidth to the RAM. Now when using a small amount of
RAM, when that is possible, say a couple of hundreds of MB,
the latency within that is always faster than when using the
entire gigabytes of RAM that the box has.
I'm not sure what you're talking about here. CPU cache hit rates,
maybe?
Oh lala, the big optimizer.
If you use a cache of 10GB of ram then the latency within that ram to
do a random read is slower than when you
do a random read in a smaller part of RAM, say 400MB.
And no, the L1,L2,L3 are not the reason for that.
RAM has become really slow at cheap systems such as that quad socket
opteron here.
Getting randomly 8 bytes out of the RAMis between 300 and 320
nanoseconds.
307 ns at the system here.
I tested that with my own benchmarking application. If you want it, i
can email it you. It's open source.
I wrote it to test SSI's of supercomputers.
Even simple old file systems already can get to the full bandwidth
of any hardware, both read and write,
as this proces is not random, but has been bandwidth optimized for
both i/o as well as CPU.
That's just wrong. It's not about the file system being able to use
the full bandwidth of the hardware, it's about the file system
reducing the amount of I/O required so the hardware can perform
more work with the same amount of physical resources. Unless you
were mis-explaining what you mean.
You're assuming stupid software that doesn't know what it can cache
here.
My software has its own caches which are of course faster than the
pagefile from the OS.
So everytime i use the word READ or WRITE to the file system, i
really mean to physical disk :)
When the final set has been generated, what will happen with it,
is some sort of supercompression to it.
Then it'll fit on SSD hardware easily.
Then it will only be used for reads during searches. So all what
matters then is the random read latency.
That's a very, very specialized case that doesn't apply to the vast
majority of applications.
Name me 1 petabyte storage type database that needs more writes than
reads, or even where it is "on par".
Nearly all big storage is for applications that do an overwhelming
number of reads extra than writes.
This is kind of true for most databases which do not fit in the RAM.
Not at all. Not by a long way. While I agree that database reads
usually outnumber the writes by a factor of 100:1, most of those
reads never hit the disk. For most decently tuned databases, 90%+
of reads are served from caches, and most of the work is performed
before even looking at data tables (usually in page caches), as the
record sets are resolved from the index data (generally in RAM,
unless performance really isn't a concern).
Ignore the caches please. Just look to the number of READS to disk
and WRITES to disk.
The number of reads to disk total overwhelm the number of writes.
In most applications this is mathematical provable by the way.
Number of reads is so overwhelming bigger, that basically with
SSD's you care most for random read speed of course.
SSDs yield impressively fast boot up times and operation while
caches are cold. And page cache latency is still some 2000x faster
than SSD latency (50ns vs 100us).
You're having the wrong assumption that you can improve my caching
system; so the guy who has been doing everything to design over the
past 15
years better caching systems you want to tell he should cache better.
I'm amazed how you focus upon 1 detail here.
That detail has already been solved.
The bottleneck REALLY is the random read latency to disk and nothing
else :)
Now you have a point that the random write speed is important in
many applications;
however it can be a few factors worse than random read speed, as
long as it isn't phenomenal weaker.
Unless your system is tuned to the point where most reads come from
page caches.
You have no idea with whom you're dealing sir.
I am more interested in metrics for how much writing is
required relative to the amount of data being transferred. For
example, if I am restoring a full running system (call it 5GB)
from a tar ball onto nilfs2, ext2, ext3, btrfs, etc., I am
interested in how many blocks worth of writes actually hit the
disk, and to a lesser extent how many of those end up being
merged together (since merged operations, in theory, can cause
less wear on an SSD because bigger blocks can be handle more
efficiently if erasing is required.
The most efficient blocksize for SSD's is 8 channels of 4KB blocks.
I'm not going to bite and get involved in debating the
correctness of this (somewhat limited) view. I'll just point out
that it bears very little relevant to the paragraph that it
appears to be responding to.
Don't act arrogant.
To say it in a manner guys with 100 IQ points below me understand;
If you're doing random writes using the 8 independant channels of
4KB you'll hit the full bandwidth of the SSD basically.
Except you don't get 8 channels on your interface to the SSD. All
you are talking about here is the fact that the SSD might be using
8 flash chips in RAID0, which is less relevant. The number of
channels also varies wildly across products (the current line of
Intel X25-M drives has a 10-channel design). But this still doesn't
take away from the fact that random writes are difficult for SSDs.
Switch off the write caching on your SSD (hdparm -W0) and see what
kind of a performance hit you get. Since you are claiming that SSDs
don't have issues with random writes, how do you explain that?
I'm claiming that random write speed though relevant is far less
relevant than random read speed.
You focus just upon random write speed here, whereas most software
has optimized the writing already at software level
wherever it was possible, to stream it sequential to disk; so no need
to do that at filesystem level.
What really matters as we both agree upon is that there shouldn't be
a too big gap (say factor 100) between random write speed versus
random read speed.
But a few times slower write speed is quite ok.
The only reason they are better at managing this random write
deficiency on the current generation of drives is because they are
doing some serious write re-ordering and physical/logical re-
mapping to linearize the writes.
Have a look here for more info on this, conceptually if not product-
wise:
http://www.managedflash.com/index.htm
If you were right and it wasn't an issue, ingenious hacks like this
wouldn't help. While I'm slightly skeptical about the net benefit
of this for the latest generation of SSDs (I haven't tried it yet),
it is clear that older drives extract considerable benefit from it.
I prefer the price of the SSD's to go down rather than the write
speed get faster :)
But the original point I was making in the original paragraph this
has been spawned from is about how many writes a file system
requires to make the data stick, after all the journaling, metadata
and superblock writes are accounted for. Essentially, for writing
1000 files, which file system requires fewest writes to the disk.
While this may not be an issue for expensive SSDs with good wear
leveling, it is certainly an issue for applications that use cheap
disk-like media (CF, SD, etc.) that may not have as advanced a wear
leveling algorithm in it's firmware, thus making avoidance of
unnecessary writes all the more important.
What will be most important is that all the different threads that
write to the i/o, that they are fast.
Where you tend to believe it is 50 ns to get a memory access, that's
completely wrong.
Even at a 2 socket nehalem system the fastest access to RAM (say a 2
GB buffer) with 8 cores at the same time,
is roughly 70 ns. Then you just have 8 bytes. In reality you want
quite a bit more than 8 bytes.
At quad socket hardware it's far over 300 nanoseconds in fact to just
get 8 bytes.
So it's definitely a lot slower than you guess.
The real problem with the file system when all cores are busy doing
something, will be that all the cores must message each other
to invalidate cache lines and so on. Cache snooping etc.
That's really ugly slow.
So it is very important to not setup a datastructure where the cpu is
nonstop busy with this.
If it has to do it a couple of hundreds of times, then you also have
a significant penalty (say 30-100 us)
just to updatign the file system.
Where this might be peanuts at a system where little i/o gets done,
it's an useless loss of time.
Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-
nilfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html