Re: 2 related bluestore questions

Igor Fedotov <ifedotov@xxxxxxxxxxxx> · Tue, 10 May 2016 17:41:46 +0300

On 10.05.2016 15:53, Sage Weil wrote:
On Tue, 10 May 2016, Igor Fedotov wrote:
Hi Sage,
Please find my comments below.

WRT 1. there is an alternative approach that doesn't need persistent refmap.
It works for non-shared bnode only though. In fact one can build such a refmap
using onode's lextents map pretty easy. It looks like any procedure that
requires such a refmap has a logical offset as an input. This provides an
appropriate lextent referring to some blob we need refmap for. What we need to
do for blob's refmap building is to enumerate lextents within +-max_blob_size
range from the original loffset. I suppose we are going to avoid small lextent
entries most of time by merging them thus such enumeration should be short
enough. Most probably such refmap build is needed for background wal procedure
(or its replacement - see below) thus it wouldn't affect primary write path
performance. And this procedure will require some neighboring lxtent
enumeration to detect  lextents to merge anyway.

Actually I don't have strong opinion which approach is better.  Just a minor
point that tracking persistent refmap is a bit more complex and space
consuming.
Yeah, that's my only real concern--and mostly on the memory allocation
side, less so on the size of the encoded metadata.  Since the alternative
only works in the non-shared bnode case, I think it'll be simpler to only
implement one approach for now, and consider optimizing later, since we'd
have to implement to share-capable approach either way.  (For example,
most blobs will have one reference for their full range; we could probably
represent this as an empty map with a bit of care.)
So the initial approach is to have refmap, right?
WRT to 2. IMO single  byte granularity is OK. Initial write request handling
can create lextents of any size depending on the input data blocks. But we
will try to eliminate it during wal processing to have larger extents and
better space usage though.
Ok cool.

WRT WAL changes. My idea is to replace WAL with a bit different extent merge
(defragmenter, garbage collector, space optimizer - whatever name of your
choice) process. The main difference - current WAL implementation tracks some
user data and thus it's a part of the consistency model (i.e. one has to check
if data block is in the WAL). In my approach data is always consistent without
such a service. At the first write handling step we always write data to the
store by allocating new blob and modifying lextent map. And apply
corresponding checksum using regular means if needed. Thus we always have
consistent data in lextent/blob structures. And defragmenter process is just a
cleanup/optimization thread that merges sparse lextents to improve space
utilization. To avoid full lextent map enumeration during defragmentation
ExtentManager (or whatever entity that handles writes) may return some 'hints'
where space optimization should be applied. This is to be done at initial
write processing. Such hint is most probably just a logical offset or some
interval within object logical space. Write handler provides such a hint if it
detects (by lextent map inspection) that optimization is required, e.g. in
case of partial lextent overwrite, big hole punch, sparse small lextents etc.
Pending optimization tasks (list of hints) are maintained by the BlueStore and
passed to EM (or another corresponding entity) for processing in the context
of a specific thread. Based of such hints defragmenter locates lextents to
merge and do the job: Read/Modify/Write multiple lextents and/or blobs.
Optionally this can be done with with some delay to care write burst within a
specific object region. Another point is that hint list can be potentially
tracked without KV store (some in-memory data structure is enough) as there is
no mandatory need for its replay in case of OSD failure - data are always
consistent at the store and failure can lead to some local space
ineffectiveness only. That's a rare case though.

What do you think about this approach?
My concern is that it makes a simple overwrite less IO efficient because
you have to (1) write a new (temporary-ish) blob, (2) commit the kv
transaction, and then (3) write an updated/merged blob, then (4) commit
the kv txn for new blob.
Yes, that's true. But there are some concerns about WAL case as well:
1) Are you sure that writing larger KV record ( metadata + user data ) 
is better than direct data write to the store + smaller KV (metadata 
only) update?

2) Either WAL records will increase or we need to have both WAL and 
optimizer simultaneously. Especially for compressed case. As far as I 
understand currently WAL record has up to block_size bytes of user data. 
With blob introduction this raises up to max_blob_size ( 
N*min_alloc_size). Or we'll need to maintain both WAL and optimizer
E.g. there is an lextent 0~256K and overwrite 1K~254K, block size = 4K
For no checksum and compression case WAL records are 2 * 3K
For checksum case WAL records are 2 * (max( csum_block_size, block_size) 
- 1K)
For compression case WAL records are 2 * (max( max_blob_size, 
block_size) - 1K) or do that temporary blob allocation.

3) WAL apply locks the subsequent read until its completion. I.e. 
subsequent read has to wait until WAL apply is completed ( o->flush() 
call in _do_read()). In case of optimizer approach lock can be postponed 
as optimizer doesn't need to perform the task immediately.

And if I understand the proposal correctly any
overwrite is still off-limits because you can't to the overwrite IO
atomically with the kv commit.  Is that right?
Could you please elaborate - not sure I understand the question.
Making the wal part of the consistency model is more complex, but it means
we can (1) log our intent to overwrite atomically with the kv txn commit,
and then (2) do the async overwrite.  It will get a bit more complex
because we'll be doing metadata updates as part of the wal completion, but
it's not a big step from where we are now, and I think the performance
benefit will be worth it.
May I have some example how it's supposed to work please?

I think we'll still want a gc/cleanup/optimizer async process like you
describe, but it can be driven by wal hints or whatever other mechanism we
like.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html