Re: 2 related bluestore questions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sage,
Please find my comments below.

WRT 1. there is an alternative approach that doesn't need persistent refmap. It works for non-shared bnode only though. In fact one can build such a refmap using onode's lextents map pretty easy. It looks like any procedure that requires such a refmap has a logical offset as an input. This provides an appropriate lextent referring to some blob we need refmap for. What we need to do for blob's refmap building is to enumerate lextents within +-max_blob_size range from the original loffset. I suppose we are going to avoid small lextent entries most of time by merging them thus such enumeration should be short enough. Most probably such refmap build is needed for background wal procedure (or its replacement - see below) thus it wouldn't affect primary write path performance. And this procedure will require some neighboring lxtent enumeration to detect lextents to merge anyway.

Actually I don't have strong opinion which approach is better. Just a minor point that tracking persistent refmap is a bit more complex and space consuming.

WRT to 2. IMO single byte granularity is OK. Initial write request handling can create lextents of any size depending on the input data blocks. But we will try to eliminate it during wal processing to have larger extents and better space usage though.

WRT WAL changes. My idea is to replace WAL with a bit different extent merge (defragmenter, garbage collector, space optimizer - whatever name of your choice) process. The main difference - current WAL implementation tracks some user data and thus it's a part of the consistency model (i.e. one has to check if data block is in the WAL). In my approach data is always consistent without such a service. At the first write handling step we always write data to the store by allocating new blob and modifying lextent map. And apply corresponding checksum using regular means if needed. Thus we always have consistent data in lextent/blob structures. And defragmenter process is just a cleanup/optimization thread that merges sparse lextents to improve space utilization. To avoid full lextent map enumeration during defragmentation ExtentManager (or whatever entity that handles writes) may return some 'hints' where space optimization should be applied. This is to be done at initial write processing. Such hint is most probably just a logical offset or some interval within object logical space. Write handler provides such a hint if it detects (by lextent map inspection) that optimization is required, e.g. in case of partial lextent overwrite, big hole punch, sparse small lextents etc. Pending optimization tasks (list of hints) are maintained by the BlueStore and passed to EM (or another corresponding entity) for processing in the context of a specific thread. Based of such hints defragmenter locates lextents to merge and do the job: Read/Modify/Write multiple lextents and/or blobs. Optionally this can be done with with some delay to care write burst within a specific object region. Another point is that hint list can be potentially tracked without KV store (some in-memory data structure is enough) as there is no mandatory need for its replay in case of OSD failure - data are always consistent at the store and failure can lead to some local space ineffectiveness only. That's a rare case though.

What do you think about this approach?

Thanks,
Igor

On 09.05.2016 21:31, Sage Weil wrote:
1. In 7fb649a3800a5653f5f7ddf942c53503f88ad3f1 I added an extent_ref_map_t
to the blob_t.  This lets us keep track, for each blob, of references to
the logical blob extents (in addition to the raw num_refs that just counts
how many lextent_t's point to us).  It will let us make decisions about
deallocating unused portions of the blob that are no longer referenced
(e.g., when we are uncompressed).  It will also let us sanely reason
about whether we can write into the blob's allocated space that is not
referenced (e.g., past end of object/file, but within a min_alloc_size
chunk).

The downside is that it's a bit more metadata to maintain.  OTOH, we need
it in many cases, and it would be slow/tedious to create it on the fly.

I think yes, though some minor changes to the current extent_ref_map_t are
needed, since it currently has weird assumptoins about empty meaning a ref
count of 1.

2. Allow lextent_t's to be byte-granularity.

For example, if we write 10 bytes into the object, we'd have a blob of
min_alloc_size, and an lextent_t that indicates [0,10) points to that
blob.

The upside here is that truncate and zero are trivial updates to the
lextent map and never need to do any IO--we just punch holes in our
mapping.

The downside is that we might get odd mappings like

  0: 0~10->1
  4000: 4000~96->1

after a hole (10~3990) has been punched, and we may need to piece the
mapping back together.  I think we will need most of this complexity
(e.g., merging adjacent lextents that map to adjacent regions of the same
blob) anyway.

Hmm, there is probably some other downside but now I can't think of a good
reason not to do this.  It'll basically put all of the onus on the write
code to do the right thing... which is probably a good thing.

Yes?


Also, one note on the WAL changes: we'll need to have any read portion of
a wal event include the raw pextents *and* the associated checksum(s).
This is because the events need to be idempotent and may overwrite the
read region, or interact with wal ops that come before/after.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux