On Thu, Jun 27, 2019 at 12:17:10PM +0530, Nithya Balachandran wrote: > There are some edge cases that may prevent a file from being migrated > during a remove-brick. Please do the following after this: > > 1. Check the remove-brick status for any failures. If there are any, > check the rebalance log file for errors. > 2. Even if there are no failures, check the removed bricks to see if any > files have not been migrated. If there are any, please check that they are > valid files on the brick and copy them to the volume from the brick to the > mount point. Well, looks like I hit one of those edge cases. Probably because of some issues around a reboot last September which left a handful of files in a state where self-heal identified them as needing to be healed, but incapable of actually healing them. (Check the list archives for "Kicking a stuck heal", posted on Sept 4, if you want more details.) So I'm getting 9 failures on the arbiter (merlin), 8 on one data brick (gandalf), and 3 on the other (saruman). Looking in /var/log/gluster/palantir-rebalance.log, I see those numbers of migrate file failed: /.shard/291e9749-2d1b-47af-ad53-3a09ad4e64c6.229: failed to lock file on palantir-replicate-1 [Stale file handle] errors. Also, merlin has four errors, and gandalf has one, of the form: Gfid mismatch detected for <gfid:be318638-e8a0-4c6d-977d-7a937aa84806>/0f500288-ff62-4f0b-9574-53f510b4159f.2898>, 9f00c0fe-58c3-457e-a2e6-f6a006d1cfc6 on palantir-client-7 and 08bb7cdc-172b-4c21-916a-2a244c095a3e on palantir-client-1. There are no gfid mismatches recorded on saruman. All of the gfid mismatches are for <gfid:be318638-e8a0-4c6d-977d-7a937aa84806> and (on saruman) appear to correspond to 0-byte files (e.g., .shard/0f500288-ff62-4f0b-9574-53f510b4159f.2898, in the case of the gfid mismatch quoted above). For both types of errors, all affected files are in .shard/ and have UUID-style names, so I have no idea which actual files they belong to. File sizes are generally either 0 bytes or 4M (exactly), although one of them has a size slightly larger than 3M. So I'm assuming they're chunks of larger files (which would be almost all the files on the volume - it's primarily holding disk image files for kvm servers). Web searches generally seem to consider gfid mismatches to be a form of split-brain, but `gluster volume heal palantir info split-brain` shows "Number of entries in split-brain: 0" for all bricks, including those bricks which are reporting gfid mismatches. Given all that, how do I proceed with cleaning up the stale handle issues? I would guess that this will involve somehow converting the shard filename to a "real" filename, then shutting down the corresponding VM and maybe doing some additional cleanup. And then there's the gfid mismatches. Since they're for 0-byte files, is it safe to just ignore them on the assumption that they only hold metadata? Or do I need to do some kind of split-brain resolution on them (even though gluster says no files are in split-brain)? Finally, a listing of /var/local/brick0/data/.shard on saruman, in case any of the information it contains (like file sizes/permissions) might provide clues to resolving the errors: --- cut here --- root@saruman:/var/local/brick0/data/.shard# ls -l total 63996 -rw-rw---- 2 root libvirt-qemu 0 Sep 17 2018 0f500288-ff62-4f0b-9574-53f510b4159f.2864 -rw-rw---- 2 root libvirt-qemu 0 Sep 17 2018 0f500288-ff62-4f0b-9574-53f510b4159f.2868 -rw-rw---- 2 root libvirt-qemu 0 Sep 17 2018 0f500288-ff62-4f0b-9574-53f510b4159f.2879 -rw-rw---- 2 root libvirt-qemu 0 Sep 17 2018 0f500288-ff62-4f0b-9574-53f510b4159f.2898 -rw------- 2 root libvirt-qemu 4194304 May 17 14:42 291e9749-2d1b-47af-ad53-3a09ad4e64c6.229 -rw------- 2 root libvirt-qemu 4194304 Jun 24 09:10 291e9749-2d1b-47af-ad53-3a09ad4e64c6.925 -rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 26 12:54 2df12cb0-6cf4-44ae-8b0a-4a554791187e.266 -rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 26 16:30 2df12cb0-6cf4-44ae-8b0a-4a554791187e.820 -rw-r--r-- 2 root libvirt-qemu 4194304 Jun 17 20:22 323186b1-6296-4cbe-8275-b940cc9d65cf.27466 -rw-r--r-- 2 root libvirt-qemu 4194304 Jun 27 05:01 323186b1-6296-4cbe-8275-b940cc9d65cf.32575 -rw-r--r-- 2 root libvirt-qemu 3145728 Jun 11 13:23 323186b1-6296-4cbe-8275-b940cc9d65cf.3448 ---------T 2 root libvirt-qemu 0 Jun 28 14:26 4cd094f4-0344-4660-98b0-83249d5bd659.22998 -rw------- 2 root libvirt-qemu 4194304 Mar 13 2018 6cdd2e5c-f49e-492b-8039-239e71577836.1302 ---------T 2 root libvirt-qemu 0 Jun 28 13:22 7530a2d1-d6ec-4a04-95a2-da1f337ac1ad.47131 ---------T 2 root libvirt-qemu 0 Jun 28 13:22 7530a2d1-d6ec-4a04-95a2-da1f337ac1ad.52615 -rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 27 08:56 8fefae99-ed2a-4a8f-ab87-aa94c6bb6e68.100 -rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 27 11:29 8fefae99-ed2a-4a8f-ab87-aa94c6bb6e68.106 -rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 28 02:35 8fefae99-ed2a-4a8f-ab87-aa94c6bb6e68.137 -rw-rw-r-- 2 root libvirt-qemu 4194304 Nov 4 2018 9544617c-901c-4613-a94b-ccfad4e38af1.165 -rw-rw-r-- 2 root libvirt-qemu 4194304 Nov 4 2018 9544617c-901c-4613-a94b-ccfad4e38af1.168 -rw-rw-r-- 2 root libvirt-qemu 4194304 Nov 5 2018 9544617c-901c-4613-a94b-ccfad4e38af1.193 -rw-rw-r-- 2 root libvirt-qemu 4194304 Nov 6 2018 9544617c-901c-4613-a94b-ccfad4e38af1.3800 ---------T 2 root libvirt-qemu 0 Jun 28 15:02 b48a5934-5e5b-4918-8193-6ff36f685f70.46559 -rw-rw---- 2 root libvirt-qemu 0 Oct 12 2018 c5bde2f2-3361-4d1a-9c88-28751ef74ce6.3568 -rw-r--r-- 2 root libvirt-qemu 4194304 Apr 13 2018 c953c676-152d-4826-80ff-bd307fa7f6e5.10724 -rw-r--r-- 2 root libvirt-qemu 4194304 Apr 11 2018 c953c676-152d-4826-80ff-bd307fa7f6e5.3101 --- cut here --- -- Dave Sherohman _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users