Hi Xiaoxi, On Mon, 25 Mar 2013, Chen, Xiaoxi wrote: > From Ceph-w , ceph reports a very high Ops (10000+ /s) , but > technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random > write. > > When digging into the code, I found that the OSD write data to > Pagecache than returned, although it called ::sync_file_range, but this > syscall doesn?t actually sync data to disk when it return,it?s an aync call. > So the situation is , the random write will be extremely fast since it only > write to journal and pagecache, but once syncing , it will take very long > time. The speed gap between journal and OSDs exist, the amount of data that > need to be sync keep increasing, and it will certainly exceed 600s. The sync_file_range is only there to push things to disk sooner, so that the eventual syncfs(2) takes less time. When the async flushing is enabled, there is a limit to the number of flushes that are in the queue, but if it hits the max it just does dout(10) << "queue_flusher ep " << sync_epoch << " fd " << fd << " " << off << "~" << len << " qlen " << flusher_queue_len << " hit flusher_max_fds " << m_filestore_flusher_max_fds << ", skipping async flush" << dendl; Can you confirm that the filestore is taking this path? (debug filestore = 10 and then reproduce.) You may want to try filestore flusher = false filestore sync flush = true and see if that changes things--it will make the sync_file_range() happen inline after the write. Anyway, it sounds like you may be queueing up so many random writes that the sync takes forever. I've never actually seen that happen, so if we can confirm that's what is going on that will be very interesting. Thanks- sage > > > > For more information, I have tried to reproduce this by rados > bench,but failed. > > > > Could you please let me know if you need any more informations & > have some solutions? Thanks > > ?? ? ?? ? ?? ? Xiaoxi > > >