Re: Disperse volume : Sequential Writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote:
On 15/06/17 11:50, Pranith Kumar Karampuri wrote:


On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey <aspandey@xxxxxxxxxx
<mailto:aspandey@xxxxxxxxxx>> wrote:

    Hi All,

    We have been facing some issues in disperse (EC) volume.
    We know that currently EC is not good for random IO as it requires
    READ-MODIFY-WRITE fop
    cycle if an offset and offset+length falls in the middle of strip size.

    Unfortunately, it could also happen with sequential writes.
    Consider an EC volume with configuration  4+2. The stripe size for
    this would be 512 * 4 = 2048. That is, 2048 bytes of user data
    stored in one stripe.
    Let's say 2048 + 512 = 2560 bytes are already written on this
    volume. 512 Bytes would be in second stripe.
    Now, if there are sequential writes with offset 2560 and of size 1
    Byte, we have to read the whole stripe, encode it with 1 Byte and
    then again have to write it back.
    Next, write with offset 2561 and size of 1 Byte will again
    READ-MODIFY-WRITE the whole stripe. This is causing bad performance.

    There are some tools and scenario's where such kind of load is
    coming and users are not aware of that.
    Example: fio and zip

    Solution:
    One possible solution to deal with this issue is to keep last stripe
    in memory.
    This way, we need not to read it again and we can save READ fop
    going over the network.
    Considering the above example, we have to keep last 2048 bytes
    (maximum)  in memory per file. This should not be a big
    deal as we already keep some data like xattr's and size info in
    memory and based on that we take decisions.

    Please provide your thoughts on this and also if you have any other
    solution.


Just adding more details.
The stripe will be in memory only when lock on the inode is active.

I think that's ok.

One
thing we are yet to decide on is: do we want to read the stripe
everytime we get the lock or just after an extending write is performed.
I am thinking keeping the stripe in memory just after an extending write
is better as it doesn't involve extra network operation.

I wouldn't read the last stripe unconditionally every time we lock the inode. There's no benefit at all on random writes (in fact it's worse) and a sequential write will issue the read anyway when needed. The only difference is a small delay for the first operation after a lock.

Yes, perfect.
 

What I would do is to keep the last stripe of every write (we can consider to do it per fd), even if it's not the last stripe of the file (to also optimize sequential rewrites).

Ah! good point. But if we remember it per fd, one fd's cached data can be over-written by another fd on the disk so we need to also do cache invalidation. May be implementation should consider this possibility. Yet to think about how to do this. But it is a good point. We should consider this.
 

One thing I've observed is that a 'dd' with block size of 1MB gets split into multiple 128KB blocks that are sent in parallel and not necessarily processed in the sequential order. This means that big block sizes won't benefit much from this optimization since they will be seen as partially non-sequential writes. Anyway the change won't hurt.

In this case as per the solution we won't cache anything right? Because we didn't request anything from the disk. We will only keep the data in cache if it is not aligned write which is at the current EOF. At least that is what I had in mind.
 

Xavi





    ---
    Ashish



    _______________________________________________
    Gluster-devel mailing list
    Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@gluster.org>
    http://lists.gluster.org/mailman/listinfo/gluster-devel
    <http://lists.gluster.org/mailman/listinfo/gluster-devel>




--
Pranith




--
Pranith
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux