Re: Sharding - Inode write fops - recoverability from failures - design

Krutika Dhananjay <kdhananj@xxxxxxxxxx> · Tue, 24 Feb 2015 05:49:17 -0500 (EST)



From: "Vijay Bellur" <vbellur@xxxxxxxxxx>
To: "Krutika Dhananjay" <kdhananj@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Tuesday, February 24, 2015 4:13:13 PM
Subject: Re:  Sharding - Inode write fops - recoverability from failures - design

On 02/24/2015 01:53 PM, Krutika Dhananjay wrote:
>
>
> ------------------------------------------------------------------------
>
>     *From: *"Vijay Bellur" <vbellur@xxxxxxxxxx>
>     *To: *"Krutika Dhananjay" <kdhananj@xxxxxxxxxx>
>     *Cc: *"Gluster Devel" <gluster-devel@xxxxxxxxxxx>
>     *Sent: *Tuesday, February 24, 2015 12:26:58 PM
>     *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -
>     recoverability from failures - design
>
>     On 02/24/2015 12:19 PM, Krutika Dhananjay wrote:
>      >
>      >
>      >
>     ------------------------------------------------------------------------
>      >
>      >     *From: *"Vijay Bellur" <vbellur@xxxxxxxxxx>
>      >     *To: *"Krutika Dhananjay" <kdhananj@xxxxxxxxxx>
>      >     *Cc: *"Gluster Devel" <gluster-devel@xxxxxxxxxxx>
>      >     *Sent: *Tuesday, February 24, 2015 11:35:28 AM
>      >     *Subject: *Re:  Sharding - Inode write fops -
>      >     recoverability from failures - design
>      >
>      >     On 02/24/2015 10:36 AM, Krutika Dhananjay wrote:
>      >      >
>      >      >
>      >      >
>      >
>     ------------------------------------------------------------------------
>      >      >
>      >      >     *From: *"Vijay Bellur" <vbellur@xxxxxxxxxx>
>      >      >     *To: *"Krutika Dhananjay" <kdhananj@xxxxxxxxxx>,
>     "Gluster Devel"
>      >      >     <gluster-devel@xxxxxxxxxxx>
>      >      >     *Sent: *Monday, February 23, 2015 5:25:57 PM
>      >      >     *Subject: *Re:  Sharding - Inode write
>     fops -
>      >      >     recoverability from failures - design
>      >      >
>      >      >     On 02/22/2015 06:08 PM, Krutika Dhananjay wrote:
>      >      >      > Hi,
>      >      >      >
>      >      >      > Please find the design doc for one of the problems in
>      >     sharding which
>      >      >      > Pranith and I are trying to solve and its solution @
>      >      >      > http://review.gluster.org/#/c/9723/1.
>      >      >      > Reviews and feedback are much appreciated.
>      >      >      >
>      >      >
>      >      >     Can this feature be made optional? I think there are use
>      >     cases like
>      >      >     virtual machine image storage, hdfs etc. where the
>     number of
>      >     metadata
>      >      >     queries might not be very high. It would be an acceptable
>      >     tradeoff in
>      >      >     such cases to not be very efficient for answering metadata
>      >     queries but
>      >      >     be very efficient for data operations.
>      >      >
>      >      >     IOW, can we have two possible modes of operation for
>     the sharding
>      >      >     translator to answer metadata queries?
>      >      >
>      >      >     1. One that behaves like a regular filesystem where we
>     expect
>      >     a mix of
>      >      >     data and metadata operations. Your document seems to cover
>      >     that part
>      >      >     well. We can look at optimizing behavior for
>     multi-threaded
>      >     single
>      >      >     writer use cases after an initial implementation is in
>     place.
>      >      >     Techniques
>      >      >     like eager locking can be applied here.
>      >      >
>      >      >     2. Another mode where we do not expect a lot of metadata
>      >     queries. In
>      >      >     this mode, we can visit all nodes where we have shards to
>      >     answer these
>      >      >     queries.
>      >      >
>      >      > But for sharding translator to be able to visit all
>     shards, it is
>      >      > required to know the last shard number.
>      >      > Without this, it will never know when to stop looking up the
>      >     different
>      >      > shards. For this to happen, we
>      >      > still need to maintain the size attribute for each file.
>      >      >
>      >
>      >     Wouldn't maintaining the total number of shards in the metadata
>      >     shard be
>      >     sufficient?
>      >
>      > Maintaining the correctness of "total number of shards" would again
>      > incur the same cost as maintaining size or any other metadata
>     attribute
>      > if a client/brick crashes in the middle of a write fop before the
>      > attribute is committed to disk.
>      > In other words, we will again need to maintain a "dirty" and
>     "committed"
>      > copy of the shard_count to ensure its correctness.
>      >
>
>     I think the cost of maintaining "total number of shards" is not as
>     expensive as maintaining size or any other metadata attribute. The
>     shard
>     count needs to be updated only when an extending operation results in
>     the creation of a new shard or when a truncate operation results in the
>     removal of a shard. Maintaining other metadata attributes would need
>     a 5
>     phase transaction for every write operation. Isn't that the case?
>
> Even size attribute changes only in case of extending writes and
> truncates. In fact, Pranith and I had
> initially chosen to persist shard count as opposed to size in the first
> design for inode write fops.
> But the reason we decided to go with size in the end is to prevent extra
> lookup on the last shard to
> find the total size of the file (i.e., if N is the total number of
> shards, file size = (N-1)*shard_block_size + sizeof(last shard)).
>

I am probably confused about the definition of size.
By size, I mean the total size of the file in bytes.
For maintaining 
accurate size, wouldn't we need to account for truncates and writes that 
happen within the scope of one shard?
Correct. This particular increase/decrease in size can be deduced from the change in ia_size between postbuf and prebuf in the respective callback.
-Krutika


-Vijay


_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel