Hi Sage,
Please find my comments below.
WRT 1. there is an alternative approach that doesn't need persistent
refmap. It works for non-shared bnode only though. In fact one can build
such a refmap using onode's lextents map pretty easy. It looks like any
procedure that requires such a refmap has a logical offset as an input.
This provides an appropriate lextent referring to some blob we need
refmap for. What we need to do for blob's refmap building is to
enumerate lextents within +-max_blob_size range from the original
loffset. I suppose we are going to avoid small lextent entries most of
time by merging them thus such enumeration should be short enough. Most
probably such refmap build is needed for background wal procedure (or
its replacement - see below) thus it wouldn't affect primary write path
performance. And this procedure will require some neighboring lxtent
enumeration to detect lextents to merge anyway.
Actually I don't have strong opinion which approach is better. Just a
minor point that tracking persistent refmap is a bit more complex and
space consuming.
WRT to 2. IMO single byte granularity is OK. Initial write request
handling can create lextents of any size depending on the input data
blocks. But we will try to eliminate it during wal processing to have
larger extents and better space usage though.
WRT WAL changes. My idea is to replace WAL with a bit different extent
merge (defragmenter, garbage collector, space optimizer - whatever name
of your choice) process. The main difference - current WAL
implementation tracks some user data and thus it's a part of the
consistency model (i.e. one has to check if data block is in the WAL).
In my approach data is always consistent without such a service. At the
first write handling step we always write data to the store by
allocating new blob and modifying lextent map. And apply corresponding
checksum using regular means if needed. Thus we always have consistent
data in lextent/blob structures. And defragmenter process is just a
cleanup/optimization thread that merges sparse lextents to improve space
utilization. To avoid full lextent map enumeration during
defragmentation ExtentManager (or whatever entity that handles writes)
may return some 'hints' where space optimization should be applied. This
is to be done at initial write processing. Such hint is most probably
just a logical offset or some interval within object logical space.
Write handler provides such a hint if it detects (by lextent map
inspection) that optimization is required, e.g. in case of partial
lextent overwrite, big hole punch, sparse small lextents etc. Pending
optimization tasks (list of hints) are maintained by the BlueStore and
passed to EM (or another corresponding entity) for processing in the
context of a specific thread. Based of such hints defragmenter locates
lextents to merge and do the job: Read/Modify/Write multiple lextents
and/or blobs. Optionally this can be done with with some delay to care
write burst within a specific object region. Another point is that hint
list can be potentially tracked without KV store (some in-memory data
structure is enough) as there is no mandatory need for its replay in
case of OSD failure - data are always consistent at the store and
failure can lead to some local space ineffectiveness only. That's a rare
case though.
What do you think about this approach?
Thanks,
Igor
On 09.05.2016 21:31, Sage Weil wrote:
1. In 7fb649a3800a5653f5f7ddf942c53503f88ad3f1 I added an extent_ref_map_t
to the blob_t. This lets us keep track, for each blob, of references to
the logical blob extents (in addition to the raw num_refs that just counts
how many lextent_t's point to us). It will let us make decisions about
deallocating unused portions of the blob that are no longer referenced
(e.g., when we are uncompressed). It will also let us sanely reason
about whether we can write into the blob's allocated space that is not
referenced (e.g., past end of object/file, but within a min_alloc_size
chunk).
The downside is that it's a bit more metadata to maintain. OTOH, we need
it in many cases, and it would be slow/tedious to create it on the fly.
I think yes, though some minor changes to the current extent_ref_map_t are
needed, since it currently has weird assumptoins about empty meaning a ref
count of 1.
2. Allow lextent_t's to be byte-granularity.
For example, if we write 10 bytes into the object, we'd have a blob of
min_alloc_size, and an lextent_t that indicates [0,10) points to that
blob.
The upside here is that truncate and zero are trivial updates to the
lextent map and never need to do any IO--we just punch holes in our
mapping.
The downside is that we might get odd mappings like
0: 0~10->1
4000: 4000~96->1
after a hole (10~3990) has been punched, and we may need to piece the
mapping back together. I think we will need most of this complexity
(e.g., merging adjacent lextents that map to adjacent regions of the same
blob) anyway.
Hmm, there is probably some other downside but now I can't think of a good
reason not to do this. It'll basically put all of the onus on the write
code to do the right thing... which is probably a good thing.
Yes?
Also, one note on the WAL changes: we'll need to have any read portion of
a wal event include the raw pextents *and* the associated checksum(s).
This is because the events need to be idempotent and may overwrite the
read region, or interact with wal ops that come before/after.
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html