Re: Patch for "Striped" read from AFR volumes

Gareth Bult <gareth@xxxxxxxxxxxxx> · Mon, 31 Dec 2007 18:51:43 +0000 (GMT)

Hi,

If I understand what you're saying, you effectively want to tie a specific file to a specific node?

The proposed patch (at the very least - in concept) might work well for large files .. but if you tie files to nodes, a large file would not gain any of the benefit of these striped reads ...

???

----- Original Message -----
From: "Krishna Srinivas" <krishna@xxxxxxxxxxxxx>
To: "Csibra Gergo" <gergo@xxxxxxxxx>
Cc: gluster-devel@xxxxxxxxxx
Sent: Monday, December 31, 2007 6:14:24 PM (GMT) Europe/London
Subject: Re: Patch for "Striped" read from AFR volumes

Hi Csibra,

The patch contribution is really appreciated. I did not verify the
correctness of
the code but I can make out that you are doing RR of readv().
But making read()s round-robin will decrease the performance (theoritically)
as we wont be taking advantage of read ahead algorithm of the kernel. The
better approach would be to make a file to be read from the same child
everytime (even on the next open) but make different files to be read from
different children. A good way of deciding the child to be read from is
by (inode_number % child_count), this change is in the TLA repository. Could
you test how your patch performs against the TLA source?

check doc/translator-option.txt for the options of AFR (option read-subvolume)

A better way to define striped reads would be: if a read request comes for 1MB,
get 0.5 MB from first child and 0.5MB from second child and combine the reads.
However this way also we are not sure about the performance gain.

Thanks
Krishna

On Dec 31, 2007 9:44 PM, Csibra Gergo <gergo@xxxxxxxxx> wrote:
> Hi,
>
> apply following patch, to read AFR volumes like RAID0 volumes. The
> current implementation of AFR reads every blocks from the first child
> if that available. With this simple patch cycles through all available
> childs. This meand every afr_readv calls reads from the next child
> readed as previous call. So if U have 4 child, first block will be
> readed from 1st next from 2nd next from 3rd next from 4th and starts
> from first so next from 1st.
>
> to apply this patch
> cd xlators/cluster/afr/src
> patch -p0 <afr_striped_read_1.3.7.diff
> make
> make install
>
> patch also available here:
> http://www.csibra.hu/glusterfs/afr_striped_read_1.3.7.diff
>
> as you see this patch against 1.3.7 version.
>
> here's the patch:
> >>>>CUT HERE<<<<
> *** /root/afr.c 2007-10-17 17:40:37.000000000 +0200
> --- afr.c       2007-12-31 16:51:38.000000000 +0100
> ***************
> *** 2448,2453 ****
> --- 2448,2469 ----
>         if (afrfdp->fdstate[i])
>           break;
>         }
> +       if(i == pvt->child_count) {
> +         // if we reached the last child, test if maybe there're unreaded child
> +         data_t *fr = dict_get(local->fd->ctx, "first_read");
> +       if(fr) {
> +         int32_t frd = data_to_int32(fr);
> +         // frd contains the first child what readed
> +         if(frd > 0) {
> +           // if first readed child was not the first physical child, start child search again
> +           i = 0;
> +           for (; i < pvt->child_count; i++) {
> +             if (afrfdp->fdstate[i])
> +               break;
> +           }
> +         }
> +       }
> +       }
>         if (i < pvt->child_count) {
>                 STACK_WIND (frame,
>                     afr_readv_cbk,
> ***************
> *** 2492,2501 ****
>     local->size = size;
>     local->fd = fd;
>
> !   for (i = 0; i < child_count; i++) {
>       if (afrfdp->fdstate[i] && pvt->state[i])
>         break;
>     }
>     if (i == child_count) {
>       STACK_UNWIND (frame, -1, ENOTCONN, NULL, 0, NULL);
>     } else {
> --- 2508,2548 ----
>     local->size = size;
>     local->fd = fd;
>
> !   int32_t next_child, first_read = 0;
> !   data_t *nxtc = dict_get(fd->ctx, "next_child");
> !   if(nxtc) {
> !     next_child = data_to_int32(nxtc);
> !   } else {
> !     next_child = -1;
> !     first_read = 1;
> !   }
> !   next_child++;
> !   if(next_child == child_count) {
> !     next_child = 0;
> !   }
> !
> !   for (i = next_child; i < child_count; i++) {
>       if (afrfdp->fdstate[i] && pvt->state[i])
>         break;
>     }
> +
> +   if(i == child_count) {
> +     i = 0;
> +     for (i = 0; i < child_count; i++) {
> +       if (afrfdp->fdstate[i] && pvt->state[i])
> +       break;
> +     }
> +     if(i == child_count) {
> +       next_child = 0;
> +     } else {
> +       next_child = i;
> +     }
> +   }
> +   dict_set(fd->ctx, "next_child", data_from_int32(next_child));
> +   if(first_read) {
> +       dict_set(fd->ctx, "first_read", data_from_int32(i));
> +   }
> +
>     if (i == child_count) {
>       STACK_UNWIND (frame, -1, ENOTCONN, NULL, 0, NULL);
>     } else {
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
http://lists.nongnu.org/mailman/listinfo/gluster-devel