Re: Disperse volume : Sequential Writes

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Wed, 5 Jul 2017 15:46:39 +0530

On Tue, Jul 4, 2017 at 1:39 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote:
Hi Pranith,

On 03/07/17 05:35, Pranith Kumar Karampuri wrote:

Ashish, Xavi,

       I think it is better to implement this change as a separate

read-after-write caching xlator which we can load between EC and client

xlator. That way EC will not get a lot more functionality than necessary

and may be this xlator can be used somewhere else in the stack if possible.

while this seems a good way to separate functionalities, it has a big problem. If we add a caching xlator between ec and *all* of its subvolumes, it will only be able to cache encoded data. So, when ec needs the "cached" data, it will need to issue a request to each of its subvolumes and compute the decoded data before being able to use it, so we don't avoid the decoding overhead.

Also, if we want to make the xlator generic, it will probably cache a lot more data than ec really needs. Increasing memory footprint considerably for no real use.

Additionally, this new xlator will need to guarantee that the cached data is current, so it will need its own locking logic (that would be another copy&paste of the existing logic in one of the current xlators) which is slow and difficult to maintain, or it will need to intercept and reuse locking calls from parent xlators, which can be quite complex since we have multiple xlator levels where locks can be taken, not only ec.

This is a relatively simple change to make inside ec, but a very complex change (IMO) if we want to do it as a stand-alone xlator and be generic enough to be reused and work safely in other places of the stack.

If we want to separate functionalities I think we should create a new concept of xlator which is transversal to the "traditional" xlator stack.

Current xlators are linear in the sense that each one operates only at one place (it can be moved by reconfiguration, but once instantiated, it always work at the same place) and passes data to the next one.

A transversal xlator (or maybe a service xlator would be better) would be one not bound to any place of the stack, but could be used by all other xlators to implement some service, like caching, multithreading, locking, ... these are features that many xlators need but cannot use easily (nor efficiently) if they are implicitly implemented in some specific place of the stack outside its control.

The transaction framework we already talked, could be though as one of these service xlators. Multithreading could also benefit of this approach because xlators would have more control about what things can be processed by a background thread and which ones not. Probably there are other features that could benefit from this approach.

In the case of brick multiplexing, if some xlators are removed from each stack and loaded as global services, most probably the memory footprint will be lower and the resource usage more optimized.

I like the service xlator approach. But I don't think we have enough time to make it operational in the short term. Let us go with implementation of this feature in EC for now. I didn't realize the extra cost of decoding when I thought about the separation. So I guess we will stick to the old idea for now.

Just an idea...

Xavi

On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey <aspandey@xxxxxxxxxx

<mailto:aspandey@xxxxxxxxxx>> wrote:

    I think it should be done as we have agreement on basic design.

    ------------------------------------------------------------------------

    *From: *"Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx

    <mailto:pkarampu@xxxxxxxxxx>>

    *To: *"Xavier Hernandez" <xhernandez@xxxxxxxxxx

    <mailto:xhernandez@xxxxxxxxxx>>

    *Cc: *"Ashish Pandey" <aspandey@xxxxxxxxxx

    <mailto:aspandey@xxxxxxxxxx>>, "Gluster Devel"

    <gluster-devel@xxxxxxxxxxx <mailto:gluster-devel@gluster.org>>

    *Sent: *Friday, June 16, 2017 3:50:09 PM

    *Subject: *Re:  Disperse volume : Sequential Writes

    On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez

    <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>> wrote:

        On 16/06/17 10:51, Pranith Kumar Karampuri wrote:

            On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez

            <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>

            <mailto:xhernandez@xxxxxxxxxx

            <mailto:xhernandez@xxxxxxxxxx>>> wrote:

                On 15/06/17 11:50, Pranith Kumar Karampuri wrote:

                    On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey

                    <aspandey@xxxxxxxxxx <mailto:aspandey@xxxxxxxxxx>

            <mailto:aspandey@xxxxxxxxxx <mailto:aspandey@xxxxxxxxxx>>

                    <mailto:aspandey@xxxxxxxxxx

            <mailto:aspandey@xxxxxxxxxx> <mailto:aspandey@xxxxxxxxxx

            <mailto:aspandey@xxxxxxxxxx>>>> wrote:

                        Hi All,

                        We have been facing some issues in disperse (EC)

            volume.

                        We know that currently EC is not good for random

            IO as it

                    requires

                        READ-MODIFY-WRITE fop

                        cycle if an offset and offset+length falls in

            the middle of

                    strip size.

                        Unfortunately, it could also happen with

            sequential writes.

                        Consider an EC volume with configuration  4+2.

            The stripe

                    size for

                        this would be 512 * 4 = 2048. That is, 2048

            bytes of user data

                        stored in one stripe.

                        Let's say 2048 + 512 = 2560 bytes are already

            written on this

                        volume. 512 Bytes would be in second stripe.

                        Now, if there are sequential writes with offset

            2560 and of

                    size 1

                        Byte, we have to read the whole stripe, encode

            it with 1

                    Byte and

                        then again have to write it back.

                        Next, write with offset 2561 and size of 1 Byte

            will again

                        READ-MODIFY-WRITE the whole stripe. This is

            causing bad

                    performance.

                        There are some tools and scenario's where such

            kind of load is

                        coming and users are not aware of that.

                        Example: fio and zip

                        Solution:

                        One possible solution to deal with this issue is

            to keep

                    last stripe

                        in memory.

                        This way, we need not to read it again and we

            can save READ fop

                        going over the network.

                        Considering the above example, we have to keep

            last 2048 bytes

                        (maximum)  in memory per file. This should not

            be a big

                        deal as we already keep some data like xattr's

            and size info in

                        memory and based on that we take decisions.

                        Please provide your thoughts on this and also if

            you have

                    any other

                        solution.

                    Just adding more details.

                    The stripe will be in memory only when lock on the

            inode is active.

                I think that's ok.

                    One

                    thing we are yet to decide on is: do we want to read

            the stripe

                    everytime we get the lock or just after an extending

            write is

                    performed.

                    I am thinking keeping the stripe in memory just after an

                    extending write

                    is better as it doesn't involve extra network operation.

                I wouldn't read the last stripe unconditionally every

            time we lock

                the inode. There's no benefit at all on random writes

            (in fact it's

                worse) and a sequential write will issue the read anyway

            when

                needed. The only difference is a small delay for the

            first operation

                after a lock.

            Yes, perfect.

                What I would do is to keep the last stripe of every

            write (we can

                consider to do it per fd), even if it's not the last

            stripe of the

                file (to also optimize sequential rewrites).

            Ah! good point. But if we remember it per fd, one fd's

            cached data can

            be over-written by another fd on the disk so we need to also

            do cache

            invalidation.

        We only cache data if we have the inodelk, so all related fd's

        must be from the same client, and we'll control all its writes

        so cache invalidation in this case is pretty easy.

        There exists the possibility to have two fd's from the same

        client writing to the same region. To control this we would need

        some range checking in the writes, but all this is local, so

        it's easy to control it.

        Anyway, this is probably not a common case, so we could start by

        caching only the last stripe of the last write, ignoring the fd.

            May be implementation should consider this possibility.

            Yet to think about how to do this. But it is a good point.

            We should

            consider this.

        Maybe we could keep a list of cached stripes sorted by offset in

        the inode (if the maximum number of entries is small, we could

        keep the list not sorted). Each fd should store the offset of

        the last write. Cached stripes should have a ref counter just to

        account for the case that two fd's point to the same offset.

        When a new write arrives, we check the offset stored in the fd

        and see if it corresponds to a sequential write. If so, we look

        at the inode list to find the cached stripe, otherwise we can

        release the cached stripe.

        We can limit the number of cached entries and release the least

        recently used when we reach some maximum.

    Yeah, this works :-).

    Ashish,

        Can all of this be implemented by 3.12?

                One thing I've observed is that a 'dd' with block size

            of 1MB gets

                split into multiple 128KB blocks that are sent in

            parallel and not

                necessarily processed in the sequential order. This

            means that big

                block sizes won't benefit much from this optimization

            since they

                will be seen as partially non-sequential writes. Anyway

            the change

                won't hurt.

            In this case as per the solution we won't cache anything

            right? Because

            we didn't request anything from the disk. We will only keep

            the data in

            cache if it is not aligned write which is at the current

            EOF. At least

            that is what I had in mind.

        Suppose we are writing multiple 1MB blocks at offset 1. If each

        write is split into 8 blocks of 128KB, all writes will be not

        aligned, and can be received in any order. Suppose that the

        first write happens to be at offset 128K + 1. We don't have

        anything cached, so we read the needed stripes and cache the

        last one. Now the next write is at offset 1. In this case we

        won't get any benefit from the previous write, since the stripe

        we need is not cached. However the write from the user point of

        view is sequential.

        It won't hurt but it won't take all benefits from the new

        caching mechanism.

        As a mitigating factor, we could consider to extend the previous

        solution I've explained to allow caching multiple stripes per

        fd. A small number like 8 would be enough.

        Xavi

                Xavi

                        ---

                        Ashish

                        _______________________________________________

                        Gluster-devel mailing list

                        Gluster-devel@xxxxxxxxxxx

            <mailto:Gluster-devel@gluster.org>

            <mailto:Gluster-devel@gluster.org

            <mailto:Gluster-devel@gluster.org>>

                    <mailto:Gluster-devel@gluster.org

            <mailto:Gluster-devel@gluster.org>

                    <mailto:Gluster-devel@gluster.org

            <mailto:Gluster-devel@gluster.org>>>

            http://lists.gluster.org/mailman/listinfo/gluster-devel

            <http://lists.gluster.org/mailman/listinfo/gluster-devel>

            <http://lists.gluster.org/mailman/listinfo/gluster-devel

            <http://lists.gluster.org/mailman/listinfo/gluster-devel>>

            <http://lists.gluster.org/mailman/listinfo/gluster-devel

            <http://lists.gluster.org/mailman/listinfo/gluster-devel>

            <http://lists.gluster.org/mailman/listinfo/gluster-devel

            <http://lists.gluster.org/mailman/listinfo/gluster-devel>>>

                    --

                    Pranith

            --

            Pranith

    --

    Pranith

--

Pranith

-- 
Pranith

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel