Ceph write path optimization

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Tue, 28 Jul 2015 21:08:27 +0000

Hi,
Eventually, I have a working prototype and able to gather some performance comparison data with the changes I was talking about in the last performance meeting. Mark's suggestion of a write up was long pending, so, trying to summarize what I am trying to do.

Objective:
-----------

1. Is to saturate SSD write bandwidth with ceph + filestore.
     Most of the deployment of ceph + all flash so far (as far as I know) is having both data and journal on the same SSD. SSDs are far from saturate and the write performance of ceph is dismal (compare to HW). Can we improve that ?

2. Ceph write performance in most of the cases are not stable, can we have a stable performance out most of the time ?

Findings/Optimization so far..
------------------------------------

1. I saw in flash environment you need to reduce the filestore_max_sync_interval a lot (from default 5min) and thus the benefit of syncfs coalescing and writing is going away.

2. We have some logic to determine the max sequence number it can commit. That is adding some latency (>1 ms or so).

3. This delay is filling up journals quickly if I remove all throttles from the filestore/journal.

4. Existing throttle scheme is very difficult to tune.

5. In case of write-ahead journaling the commit file is probably redundant as we can get the last committed seq number from journal headers during next OSD start. The fact that, the sync interval we need to reduce , this extra write will only add more to WA (also one extra fsync).

The existing scheme is well suited for HDD environment, but, probably not for flash. So, I made the following changes.

1. First, I removed the extra commit seq file write and changed the journal replay stuff accordingly.

2. Each filestore Op threads is now doing O_DSYNC write followed by posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);

3. I derived an algorithm that each worker thread is executing to determine the max seq it can trim the journal to.

4. Introduced a new throttle scheme that will throttle journal write based on the % space left.

5. I saw that this scheme is definitely emptying up the journal faster and able to saturate the SSD more.

6. But, even if we are not saturating any resources, if we are having both data and journal on the same drive, both writes are suffering latencies.

7. Separating out journal to different disk , the same code (and also stock)  is running faster. Not sure about the exact reason, but, something to do with underlying layer. Still investigating.

8. Now, if we want to separate out journal, SSD is *not an option*. The reason is, after some point we will be limited by SSD BW and all writes for N osds going to that SSD will wear out that SSD very fast. Also, this will be a very expensive solution considering high end journal SSD.

9. So, I started experimenting with small PCIe NVRAM partition (128 MB). So, If we have ~4GB NVRAM we can put ~32 OSDs in that(considering NVRAM durability is much higher).  The stock code as is (without throttle), the performance is becoming very spiky for obvious reason.

10. But, with the above mentioned changes, I am able to make a constant high performance out most of the time.

11. I am also trying the existing synfs codebase (without op_seq file) + the throttle scheme I mentioned in this setup to see if we can get out a stable improve performance out or not. This is still under investigation.

12. Initial benchmark with single OSD (no replication) looks promising and you can find the draft here.

       https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjWG8bkNwKY8/edit?usp=sharing

13. I still need to try this out by increasing number of OSDs.

14. Also, need to see how this scheme is helping both data/journal on the same SSD.

15. The main challenge I am facing in both the scheme is XFS metadata flush process (xfsaild) is choking all the processes accessing the disk when it is waking up. I can delay it till max 30 sec and if there are lot of dirty metadata, there is a performance spike down for very brief amount of time. Even if we are acknowledging writes from say NVRAM journal write, still the opthreads are doing getattrs on the XFS and those threads are getting blocked. I tried with ext4 and this problem is not there since it is writing metadata synchronously by default, but, the overall performance of ext4 is much less. I am not an expert on filesystem, so, any help on this is much appreciated.

Mark,
If we have time, we can discuss this result in tomorrow's performance meeting.

Thanks & Regards
Somnath

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html