some problems of afr

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



In this days, I have test afr as carefully as  I can, and I have some problems:
1. The mop stats problem. Afr has no its own mop->stats function, so it will use the default stats function, this bring up a problem when gluster is configured as unify + afr. I have fixed this bug, and I will share the patch later.

2. The first_up_child problem.  As I read the source code and the afr webpage,  I have found somes of the fops (readdir for example ) use the first up child of afr to do the action. But it is not true obviouly, given that an afr made of subvol client0 and client1 where client0 connect to brick0 of server0, and client1 connect to brick1 of sever1. After hours running normally, sever0 is down, so client0 lost its connection, and afr has to use client1 as its first_up_child. When server0 is restored, afr will use client0 as the first_up_child, and  if the user did not remerber the files newly created when server0 is down, then some of the files would not be recoverried by "ls -lsR" until server0's next stop.

My opinion: 1)first_up_child can be replaced by first_reference_child. For example when afr first startup, the first_reference_child is client0, and when client0 is stop, it should be client1 even client0 is back again, if client1 died, it turns to client0. First_reference_child may also have problems, but I think it can do a better job than the first_up_child. 2) File recovery should be done automatically when one client is restored, so you may need some logs for this. This can solve the first_up_child problem totally.

3. Wrong  files recovery.  The problem is very simple to reproduce, your  can create a dir named DIR in the gluster root directory (say /mnt/gl), and you can create a file named FILE in DIR ( /mnt/gl/DIR/FILE), all the above actions are done when all the sub-volumes(say client0 and client1) of afr are up. Then you  turn server0 down, and  do "rm -rf Dir" under /mnt/gl. After that your restore the sever0, and so client0 restored to working, your do "ls -lsR" under /mnt/gl, you will see that DIR and FILE are steal there.

This problem exists because  gluster failed to remove the DIR in the data sync proccess when client0 is up, so it recoverid the files in a reverse way. I have write an routine which removing the files in the directory and delete the directory finally recursivly. But some times it can prevents the wrong file recovry, and some times failed. The reason seams complicated, I will post more later.


Thanks for your attention, I hop that I have describe my view when I was so desired for sleep.

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux