On 11/19/09 1:04 PM, "Greg Smith" <greg@xxxxxxxxxxxxxxx> wrote: > That won't help. Once the checkpoint is done, the problem isn't just > that the WAL segments are recycled. The server isn't going to use them > even if they were there. The reason why you can erase/recycle them is > that you're doing so *after* writing out a checkpoint record that says > you don't have to ever look at them again. What you'd actually have to > do is hack the server code to insert that delay after every fsync--there > are none that you can cheat on and not introduce a corruption > possibility. The whole WAL/recovery mechanism in PostgreSQL doesn't > make a lot of assumptions about what the underlying disk has to actually > do beyond the fsync requirement; the flip side to that robustness is > that it's the one you can't ever violate safely. Yeah, I guess its not so easy. Having the system "hold" one extra checkpoint worth of segments and then during recovery, always replay that previoius one plus the current might work, but I don't know if that could cause corruption. I assume replaying a log twice won't, so replaying N-1 checkpoint, then the current one, might work. If so that would be a cool feature -- so long as the N-2 checkpoint is no longer in the OS or I/O hardware caches when checkpoint N completes, you're safe! Its probably more complicated though, especially with respect to things like MVCC on DDL changes. > Right. It's not used like the write-cache on a regular hard drive, > where they're buffering 8MB-32MB worth of writes just to keep seek > overhead down. It's there primarily to allow combining writes into > large chunks, to better match the block size of the underlying SSD flash > cells (128K). Having enough space for two full cells allows spooling > out the flash write to a whole block while continuing to buffer the next > one. > > This is why turning the cache off can tank performance so badly--you're > going to be writing a whole 128K block no matter what if it's force to > disk without caching, even if it's just to write a 8K page to it. As others mentioned, flash must erase a whole block at once, but it can write sequentially to a block in much smaller chunks. I believe that MLC and SLC differ a bit here, SLC can write smaller subsections of the erase block. A little old but still very useful: http://research.microsoft.com/apps/pubs/?id=63596 > That's only going to reach 1/16 of the usual write speed on single page > writes. And that's why you should also be concerned at whether > disabling the write cache impacts the drive longevity, lots of small > writes going out in small chunks is going to wear flash out much faster > than if the drive is allowed to wait until it's got a full sized block > to write every time. This is still a concern, since even if the SLC cells are technically capable of writing sequentially in smaller chunks, with the write cache off they may not do so. > > The fact that the cache is so small is also why it's harder to catch the > drive doing the wrong thing here. The plug test is pretty sensitive to > a problem when you've got megabytes worth of cached writes that are > spooling to disk at spinning hard drive speeds. The window for loss on > a SSD with no seek overhead and only a moderate number of KB worth of > cached data is much, much smaller. Doesn't mean it's gone though. It's > a shame that the design wasn't improved just a little bit; a cheap > capacitor and blocking new writes once the incoming power dropped is all > it would take to make these much more reliable for database use. But > that would raise the price, and not really help anybody but the small > subset of the market that cares about durable writes. Yup. There are manufacturers who claim no data loss on power failure, hopefully these become more common. http://www.wdc.com/en/products/ssd/technology.asp?id=1 I still contend its a lot more safe than a hard drive. I have not seen one fail yet (out of about 150 heavy use drive-years on X25-Ms). Any system that does not have a battery backed write cache will be faster and safer if an SSD, with write cache on, than hard drives with write cache on. BBU caching is not fail-safe either, batteries wear out, cards die or malfunction. If you need the maximum data integrity, you will probably go with a battery-backed cache raid setup with or without SSDs. If you don't go that route SSD's seem like the best option. The 'middle ground' of software raid with hard drives with their write caches off doesn't seem useful to me at all. I can't think of one use case that isn't better served by a slightly cheaper array of disks with a hardware bbu card (if the data is important or data size is large) OR a set of SSD's (if performance is more important than data safety). >> 4: Yet another solution: The drives DO adhere to write barriers properly. >> A filesystem that used these in the process of fsync() would be fine too. >> So XFS without LVM or MD (or the newer versions of those that don't ignore >> barriers) would work too. >> > If I really trusted anything beyond the very basics of the filesystem to > really work well on Linux, this whole issue would be moot for most of > the production deployments I do. Ideally, fsync would just push out the > minimum of what's needed, it would call the appropriate write cache > flush mechanism the way the barrier implementation does when that all > works, life would be good. Alternately, you might even switch to using > O_SYNC writes instead, which on a good filesystem implementation are > both accelerated and safe compared to write/fsync (I've seen that work > as expected on Vertias VxFS for example). > We could all move to OpenSolaris where that stuff does work right... ;) I think a lot of the things that make ZFS slower for some tasks is that it correctly implements and uses write barriers... > Meanwhile, in the actual world we live, patches that make writes more > durable by default are dropped by the Linux community because they tank > performance for too many types of loads, I'm frightened to turn on > O_SYNC at all on ext3 because of reports of corruption on the lists > here, fsync does way more work than it needs to, and the way the > filesystem and block drivers have been separated makes it difficult to > do any sort of device write cache control from userland. This is why I > try to use the simplest, best tested approach out there whenever possible. > Oh I hear you :) At least ext4 looks like an improvement for the RHEL6/CentOS6 timeframe. Checksums are handy. Many of my systems though don't need the highest data reliability. And a raid 0 of X-25 M's will be much, much more safe than the same thing of regular hard drives, and faster. Putting in a few of those on one system soon (yes M, won't put WAL on it). 2 such drives kick the crap out of anything else for the price when performance is most important and the data is just a copy of something stored in a much safer place than any single server. Previously on such systems, a caching raid card would be needed for performance, but without a bbu data loss risk is very high (much higher than a ssd with caching on -- 256K versus 512M cache!). And a SSD costs less than the raid card. So long as the total data size isn't too big they work well. And even then, some tablespaces can be put on a large HD leaving the more critical ones on the SSD. I estimate the likelihood of complete data loss from a 2 SSD raid-0 as the same as a 4-disk RAID 5 of hard drives. There is a big difference between a couple corrupted files and a lost drive... I have recovered postgres systems with corruption by reindexing and restoring single tables from backups. When one drive in a stripe is lost or a pair in a raid 10 go down, all is lost. I wonder -- has anyone seen an Intel SSD randomly die like a hard drive? I'm still trying to get a "M" to wear out by writing about 120GB a day to it for a year. But rough calculations show that I'm likely years from trouble... By then I'll have upgraded to the gen 3 or 4 drives. > -- > Greg Smith 2ndQuadrant Baltimore, MD > PostgreSQL Training, Services and Support > greg@xxxxxxxxxxxxxxx www.2ndQuadrant.com > > -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance