Re: Some thoughts regarding the new store

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 27 May 2015 06:56:49 -0500

On 05/27/2015 04:46 AM, Haomai Wang wrote:
On Wed, May 27, 2015 at 4:41 PM, Li Wang <liwang@xxxxxxxxxxxxxxx> wrote:
I have just noticed the new store development, and had a
look at the idea behind it (http://www.spinics.net/lists/ceph-
devel/msg22712.html), so my understanding, we wanna avoid the
double-write penalty of WRITE_AHEAD_LOGGING journal mechanism,
the straightforward thought is to optimize CREATE, APPEND and
FULL-OBJECT-OVERWRITE by writing into new files directly,
then update the metadata in a transaction. Other changes include:
move the object metadata from filesystem extend attrbutes into
key value database; map an object into possibly multiple files.

If my understanding is correct, then it seems there follows some issues,

1 Garbage collection is needed to reclaim orphan files generated
from crashing;

Yes, but still now we haven't dive into this problem. Because
currently newstore only allow one object one file.

Anyway I guess GC isn't a big problem. journal keys should be help, is
something I missed here?

2 On spinning disks, it loses the advantages that journal makes random
  writes into sequential writes, then commits them in groups and
leverages another disk to hide the committing delay.

We need to clarify something here, for small random write workload,
newstore still need journal to make durable and shorter latency.

Although filejournal make use of write ahead to improve performance,
but journal is far away from data location in disk(partition or
preallocation file). We always need to write data to disk and the seek
distance is long I think. For newstore, actually in my best wish
journal and data could be in one allocation group in local filesystem
concept(it may be difficult though), just like a ideal fragment
implementation as expected. In other word, fragment should be
something to aggregate small writes, but we haven't make it done as
expected.

Although now newstore's random write performance is bad than
filestore, I think it's not related to design. We still have lots of
things could be apply to improve.

FWIW, newstore was looking as good or better for RBD random writes in 
the last set of tests I did:

http://nhm.ceph.com/newstore/8c8c5903_rbd_rados_tests.pdf

3 OVERWRITE theoretically does not benefit from this design, and the
introducing of fragment, increases the object metadata overhead. The
possibly mapping of multiple files may also slow down the object
read/write performance. OVERWRITE is the major scenario for RBD,
consequently, for cloud environment.

yes, we need to handle this thing. Actually for one object mapping to
multi file, we doesn't have a design(@sage yes? or I missed?). We may
could think of a solution to make tradeoff  :-)

This is the biggest issue holding us back right now imho.  If you look 
at the linked graphs above, the only place we are really significantly 
behind filestore is on semi-large partial object overwrites.  I suspect 
we'll have to create fragments down to some size (maybe 512k?).  There 
was some discussion about all of this a couple of weeks ago at the 
weekly perf meeting.

4 By mapping an object into multiple files, potentially we can optimize
OVERWRITE by turning it also into APPEND by using small fragments,
that, actually mimic Btrfs. However, for many small writes, it may
leave many small files in the backend local file system, that may slow
down the object read/write performance, especially on spinning
disk. More importantly, I think it, to some extent, against the
philosophy of object storage, which uses a big object to store data to
reduce the metadata cost, and leaves the block management for local
file system. For a local file system, big file performance is generally
better than small file. If we introduce fragment, it looks like the
object storage self cares about the object data allocation now.

What is the community's option?

The cost of large partial overwrites in newstore is pretty expensive.  I 
suspect we'll both need to improve how rocksdb handles it's WAL and I 
think introduce at least semi decently sized fragments.  A simpler 
alternative might be to reduce the default RBD block size and try to 
optimize for that case.  In the report I linked above there are rados 
bench tests at different object sizes to try to get an idea of how rbd 
performance at different block sizes might be bound.

Anyway, I think the core idea is we make newstore better than
filestore in most of workloads.

I agree.  I think already it's showing significant enough improvement in 
enough cases that it's worth continuing to invest in.

Cheers,
Li Wang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html