On 15/11/2009 2:05 PM, Laszlo Nagy wrote: > >> A change has been written to the WAL and fsync()'d, so Pg knows it's hit >> disk. It can now safely apply the change to the tables themselves, and >> does so, calling fsync() to tell the drive containing the tables to >> commit those changes to disk. >> >> The drive lies, returning success for the fsync when it's just cached >> the data in volatile memory. Pg carries on, shortly deleting the WAL >> archive the changes were recorded in or recycling it and overwriting it >> with new change data. The SSD is still merrily buffering data to write >> cache, and hasn't got around to writing your particular change yet. >> > All right. I believe you. In the current Pg implementation, I need to > turn of disk cache. That's certainly my understanding. I've been wrong many times before :S > #1. user wants to change something, resulting in a write_to_disk(data) call > #2. data is written into the WAL and fsync()-ed > #3. at this point the write_to_disk(data) call CAN RETURN, the user can > continue his work (the WAL is already written, changes cannot be lost) > #4. Pg can continue writting data onto the disk, and fsync() it. > #5. Then WAL archive data can be deleted. > > Now maybe I'm wrong, but between #3 and #5, the data to be written is > kept in memory. This is basically a write cache, implemented in OS > memory. We could really handle it like a write cache. E.g. everything > would remain the same, except that we add some latency. We can wait some > time after the last modification of a given block, and then write it out. I don't know enough about the whole affair to give you a good explanation ( I tried, and it just showed me how much I didn't know ) but here are a few issues: - Pg doesn't know the erase block sizes or positions. It can't group writes up by erase block except by hoping that, within a given file, writing in page order will get the blocks to the disk in roughly erase-block order. So your write caching isn't going to do anywhere near as good a job as the SSD's can. - The only way to make this help the SSD out much would be to use a LOT of RAM for write cache and maintain a LOT of WAL archives. That's RAM not being used for caching read data. The large number of WAL archives means incredibly long WAL replay times after a crash. - You still need a reliable way to tell the SSD "really flush your cache now" after you've flushed the changes from your huge chunks of WAL files and are getting ready to recycle them. I was thinking that write ordering would be an issue too, as some changes in the WAL would hit main disk before others that were earlier in the WAL. However, I don't think that matters if full_page_writes are on. If you replay from the start, you'll reapply some changes with older versions, but they'll be corrected again by a later WAL record. So ordering during WAL replay shouldn't be a problem. On the other hand, the INCREDIBLY long WAL replay times during recovery would be a nightmare. > I don't think that any SSD drive has more than some > megabytes of write cache. The big, lots-of-$$ ones have HUGE battery backed caches for exactly this reason. > The same amount of write cache could easily be > implemented in OS memory, and then Pg would always know what hit the disk. Really? How does Pg know what order the SSD writes things out from its cache? -- Craig Ringer -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance