Re: Debugging georeplication failures

Aravinda <avishwan@xxxxxxxxxx> · Wed, 25 Nov 2015 11:04:33 +0530



    One more thing,

    
    Need not worry too much about the SKIPPED_GFIDs list. Due to entry
    failure Geo-rep is unable to create the entry and successive rsync
    fails for that file, but all the GFIDs which were in the same batch
    is logged as failures. Which is not true, Rsync does partial sync
    skips the failed GFIDs and syncs rest of the files. I am working on
    fixing the logging issue.

    regards
Aravinda
    On 11/25/2015 10:51 AM, Aravinda wrote:

    
    Hi,
      

      Looks like GFID conflict in Slave. (Same filename with different
      GFID exists in Slave undeleted may be due to unlink failure or any
      other failure)
      

      Need to identify the cause for GFID conflict. Please share the
      workload details or share the changelogs from brick
      backend(/data/media/.glusterfs/changelogs)
      

      "ENTRY FAILED" shows file exists error but shows different GFID
      

      [2015-11-20 11:40:14.93090] W
      [master(/data/media):803:log_failures]
      

      _GMaster: ENTRY FAILED: ({'uid': 33, 'gfid':
      

      '31d66429-c700-4a10-bb32-35e1b36a479f', 'gid': 33, 'mode': 33206,
      'entry':
      

'.gfid/b1dc6c6d-dac7-4da9-9577-4614942a72a0/official-nightmare-before-christmas-vampire-teddy-girls-dress-body-web.jpg',
      

      'op': 'CREATE'},*17, 'df0e67f5-f2ce-45c3-b4f1-224aa3059ec7'*)
      

      Also looks like Split brain issues in Slave. Refer this document
      to resolve Split brain issues in Slave.
      

https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md
      

      regards
      

      Aravinda
      

      On 11/25/2015 03:08 AM, Audrius Butkevicius wrote:
      

      So the version of rsync is 3.1.0, but the
        bug mentioned only applies to
        

        large files, where as in my case the files are less than a MB.
        

        I've started digging through the logs and found a bunch of these
        on the
        

        slave:
        

        [2015-11-20 11:40:46.730805] W
        [fuse-bridge.c:1978:fuse_create_cbk]
        

        0-glusterfs-fuse: 1882288:
        /.gfid/31d66429-c700-4a10-bb32-35e1b36a479f =>
        

        -1 (Operation not permitted)
        

        [2015-11-20 12:39:59.269844] W
        [fuse-bridge.c:1978:fuse_create_cbk]
        

        0-glusterfs-fuse: 1918306:
        /.gfid/6802a0c6-1f62-4213-a70d-7b46d9ff8f3a =>
        

        -1 (Operation not permitted)
        

        So something funky was happening for an hour 4 days ago. Given
        the volume
        

        is on EBS, maybe there was some glitch there.
        

        I can also find the corresponding failures on the master:
        

        [2015-11-20 11:40:14.93090] W
        [master(/data/media):803:log_failures]
        

        _GMaster: ENTRY FAILED: ({'uid': 33, 'gfid':
        

        '31d66429-c700-4a10-bb32-35e1b36a479f', 'gid': 33, 'mode':
        33206, 'entry':
        

'.gfid/b1dc6c6d-dac7-4da9-9577-4614942a72a0/official-nightmare-before-christmas-vampire-teddy-girls-dress-body-web.jpg',
        

        'op': 'CREATE'}, 17, 'df0e67f5-f2ce-45c3-b4f1-224aa3059ec7')
        

        [2015-11-20 11:40:14.265054] W
        [master(/data/media):803:log_failures]
        

        _GMaster: META FAILED: ({'go':
        

        '.gfid/31d66429-c700-4a10-bb32-35e1b36a479f', 'stat': {'atime':
        

        1448019600.232466, 'gid': 33, 'mtime': 1448019600.316466,
        'mode': 33279,
        

        'uid': 33}, 'op': 'META'}, 2)
        

        If I grep for SKIPPED GFID I get the following:
        

        [2015-11-20 11:40:40.704817] W
        [master(/data/media):1014:process] _GMaster:
        

        SKIPPED GFID =
        

192632af-28c5-4e03-a62d-458fe7f3b5f9,7ea8d7a8-524b-4dd0-b97a-dc7d3481f341,204f6112-0e8d-4f6d-855b-bf10f9c63b62,7e626e8f-edad-4f39-a6c6-547a1da34aa1,1f0d0208-1962-4eb1-91d4-cf7ed297d8e3,95d389c4-3258-4ca0-8fc4-26b8427b1eaf,425cedc6-6343-4326-8540-996d2d56dc9c,5955928b-2b8f-4cc9-a336-3eac4382789b,8932efcd-ba90-46ec-84c8-5e9e51cc84e9,2530275d-5f03-4143-9abf-d07cc79bf80a,73574466-86f3-4ab2-b5da-c31ac28c27c1,776e5e8f-5c6a-46b1-ad54-733e157d2097,008a69f3-217c-4dbc-a469-5a5bc8ecd589,dca8d8d9-03cf-4793-92e4-bfcfddd262f6,c85b7a29-73af-4f44-a07e-a44082d7a93a,6c1f56d6-4ea6-4910-9677-ea33edd35d28,0ea56588-87fa-4355-9403-e311525454fc,c8ce76c9-e21d-46ce-a2b5-14dfd0070f64,db9e6484-0e5e-4f6e-815b-3c2b273deee5,35d10752-43b5-4398-be5f-17cb9de73a6b,396e5faf-74a1-4849-97e3-009dbfb22836,d148e7d5-c2f3-4d06-8cd6-8588e6aac196,404d20c5-1c6c-4aad-98be-2c23930173b3,f1fae11c-db8e-4cd5-8e47-a3870316f89c,d8daa413-e57f-44fb-b907-b1a497f2dcfa,5f6ee8c2-84fb-432e-95cd-e428ab256e83,6bf54dcd-c3b4-4187-a390-eca!
 841e46570,
335c07ca-d339-4d3a-aa88-3b5753d24fbf,8fdbac00-6628-4f22-8fb4-b7a6524cae49,31d66429-c700-4a10-bb32-35e1b36a479f
        

        [2015-11-20 11:41:35.907850] W
        [master(/data/media):1014:process] _GMaster:
        

        SKIPPED GFID =
        

03069c7f-8eaa-45b0-92ed-50cb648cd912,788f5ed1-923e-4b86-9696-2a6de07ebb2e,43d12b40-b6e2-43c4-8883-85e89dc81321
        

        [2015-11-20 12:11:55.492068] W
        [master(/data/media):1014:process] _GMaster:
        

        SKIPPED GFID =
        

eb02369f-7ca8-480a-b00c-768964410ed8,17045ac9-27dd-4bf9-9f90-d7b146070dd5,265e3d9c-1657-45cb-bbf6-db439eb18ccf,553c420f-b3cc-47f2-8d5f-cfc2ffdd1a92
        

        [2015-11-20 12:12:53.372432] W
        [master(/data/media):1014:process] _GMaster:
        

        SKIPPED GFID =
        

66c5878e-8c00-4f7d-a3ad-4adec84a5e22,f4dc086d-9c2b-449c-9e31-bbae9ebcdea7,f99317b2-72e8-49e3-b676-647abad508b1
        

        [2015-11-20 12:37:55.773813] W
        [master(/data/media):1014:process] _GMaster:
        

        SKIPPED GFID =
        

4af54f1c-e8e1-4915-9328-a458d5d35d5d,acbe1f12-87e8-4192-b864-d90030269bba,7d27a795-da63-4742-9e91-abd8fa543612,8d4e642d-fd40-44d6-8419-8d3459df7ce3
        

        [2015-11-20 12:39:28.852575] W
        [master(/data/media):1014:process] _GMaster:
        

        SKIPPED GFID =
        

d90dc121-02e7-4a79-bc03-1bd8fddd9f48,54bb563f-ab44-4e91-a46b-764a122ce7fa,088141de-7545-40f9-b776-751738a89740,2dab3faf-4a6c-407a-88cd-cddef6f55299,d887806f-23b4-4389-a4dc-f9027702a2df,fc5a9bc8-ea62-4677-baed-16510541373a,33136ad2-c5b4-448c-991d-1e72fefef021,cf3e2675-e41b-4782-9478-91773eb0a4aa,6412d878-e0f1-4700-84df-05f4af35962f,ec3cf6e1-7f27-4650-b978-8a5a7f620389,d3651bb9-cd2d-4c5f-93e6-fe4fb1cdf5db,ecb0415e-1524-40f4-870e-1fd0f8371b1d,a118aaae-bd3e-4b19-a0e0-891aa9edb09a,7642d3f3-f1e5-4aca-bcfe-bdb3c44779a9,2e29f3f8-c460-48eb-9db5-b281b67cc2bf,e61db54b-3979-488a-8789-a5d0615c5a97,4212d840-9c22-4d9e-b61b-5e35271dfe80,dad1c60b-9da6-4e57-b014-daa1aca73ce3,93699a3d-40b8-4bbd-b78f-aabf965df57f,4fad7468-91f2-4deb-aaf7-6401068c9e6d,c9738295-46cc-4fe7-b359-dc94f5815ce9,91853c5c-4877-4c9e-9481-c86368942f78,59deed8e-d3d0-4ab7-854e-53a8dd455de0,20b86c13-7df1-4d13-bac1-7d628a00d6ce,b7b86a2d-7963-41a4-a423-14e25d1e78c4,3c17d7fe-bb7f-489c-a525-5c8b7bb93c3e,e230d207-7c68-4983-a958-f2d!
 cfc1ce694,
fa8bf3c0-abae-446c-83c5-45ef8bcaa4b8,14089102-8106-45d9-a3f1-d1446b568f4e,6802a0c6-1f62-4213-a70d-7b46d9ff8f3a,0a253bbc-ef98-4da0-951f-e17c5a7f5858,ef054b76-986b-4a89-b8e6-b4988221aaa2,48c0a153-708c-44ee-b186-cf255936a02b,fa2646a6-807c-4e9d-8f2b-a9cdf2674e0c,1ed4a563-4f6a-4b5a-9866-89025fe7afd5,0f293cf7-bc32-4f8a-87d5-388a4bffb4af,f4126726-667b-451d-8214-a18bb3f468cd,e23dc8b3-da1c-4d18-aec9-22e0aa174d81,40b9f10d-7304-4c0b-8498-bef23b305d03,15c25d1e-2a62-495e-887f-14d0cb0527b1,67371804-9084-4801-b664-44e88bea8ac3,4750fa3f-d1a4-4472-b10d-3f75d0b451dc
        

        [2015-11-23 09:18:10.43391] W [master(/data/media):1014:process]
        _GMaster:
        

        SKIPPED GFID =
        

228843f3-62f0-4687-b5eb-6d1e21257ad0,b0078359-fbf0-4709-8f40-8383a11d7875,60cff4d5-8b5d-4f7f-8bc1-27081a011458,bedb6ac4-208d-47e1-812c-5547c84ab841,da6810d9-4883-45e1-b73e-55a7ff17b5e7,e03b5c03-b25c-49ba-86f0-8a709a9c2658,053673a0-c1cc-4057-83fa-f97740cb5d4f,dbd6ea84-8f24-4a47-ac41-22c3fd788ecf,43caa3e7-ca04-47ab-b950-105606b313a4,62d8b1d0-fc89-4fb1-a41a-957dcb34d325,4e8fe1fa-60cd-47fa-bad6-f617c312f53b,6c3d6cf3-62ae-4ab8-9dc3-7815552401fe,f79be814-7e78-4985-bcdd-688da23d1808,c4186455-0f06-4b5d-89be-3c5ccbdeb6f0,f9c4ccdb-2337-479d-845d-ee4d85b69ece,bcd14726-1bab-4d97-8915-ec8bbe8faf8c,cca82341-a430-4a59-a900-1af66dcf7bb8,b7043a8e-4286-4831-91ec-c146e40bc6be,995ffeb6-a906-4078-88c6-404a2b38aad4,227f9987-5057-4133-848a-2b22aca5dde1,90b35242-32db-4570-8070-cf9dd49322a5,c6863c8f-1914-4a2d-814b-6e5853134faf,e2d19b1a-fc07-441c-b110-ca816b46fc40,9a3d0c0b-7d84-416f-9f3e-21b32a11ba1d,d8163f6b-8c40-418c-9c06-b3743af24e4e,522d7247-a75b-4af9-acb2-52a99eeced89,4b56ea9d-413a-4e24-b44e-433!
 f7603ad6d
        

        There are also the following lines on the master, which might
        have some
        

        impact:
        

        E [MSGID: 108008] [afr-read-txn.c:89:afr_read_txn_refresh_done]
        

        0-media-replicate-0: Failing READ on gfid
        

        abdc7d5e-9187-4916-ae83-a8b615e32a17: split-brain observed.
        [Input/output
        

        error]
        

        E [MSGID: 108008] [afr-read-txn.c:89:afr_read_txn_refresh_done]
        

        0-media-replicate-0: Failing GETXATTR on gfid
        

        abdc7d5e-9187-4916-ae83-a8b615e32a17: split-brain observed.
        [Input/output
        

        error]
        

        E [mem-pool.c:417:mem_get0]
        

        (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x809a2)
        [0x7f79e436b9a2]
        

        -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg+0x79f)
        

        [0x7f79e430cb1f]
        

        -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(mem_get0+0x81)
        

        [0x7f79e433e4a1] ) 0-mem-pool: invalid argument [Invalid
        argument]
        

        E [mem-pool.c:417:mem_get0]
        

(-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(recursive_rmdir+0x192)
        

        [0x7f79e4329b32]
        

        -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg+0x79f)
        

        [0x7f79e430cb1f]
        

        -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(mem_get0+0x81)
        

        [0x7f79e433e4a1] ) 0-mem-pool: invalid argument [Invalid
        argument]
        

        E [resource(/data/media):222:errlog] Popen: command "ssh
        

        -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
        

        /var/lib/glusterd/geo-replication/secret.pem
        -oControlMaster=auto -S
        

        /tmp/gsyncd-aux-ssh-dpY5cI/8216bb7da58a00926f369bb7ac8c7e03.sock
        

        root@xxxxxxxxxxxxxxxxxxxxxxxxxx
        /usr/lib/x86_64-linux-gnu/glusterfs/gsyncd
        

        --session-owner 6922055e-49a1-4afd-a3a0-a47960d6ba54 -N --listen
        --timeout
        

        120 gluster://localhost:media" returned with 143, saying:
        

        E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
        

        21:57:19.772896] I [cli.c:721:main] 0-cli: Started running
        

        /usr/sbin/gluster with version 3.7.5
        

        E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
        

        21:57:19.772955] I [cli.c:608:cli_rpc_init] 0-cli: Connecting to
        remote
        

        glusterd at localhost
        

        E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
        

        21:57:19.871930] I [MSGID: 101190]
        

        [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started
        thread
        

        with index 1
        

        E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
        

        21:57:19.872018] I [socket.c:2355:socket_event_handler]
        0-transport:
        

        disconnecting now
        

        E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
        

        21:57:19.872898] I [cli-rpc-ops.c:6348:gf_cli_getwd_cbk] 0-cli:
        Received
        

        resp to getwd
        

        E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
        

        21:57:19.872963] I [input.c:36:cli_batch] 0-: Exiting with: 0
        

        Status detail shows the following:
        

        root@eu-gluster-1:/var/log/glusterfs/geo-replication/media#
        gluster volume
        

        geo-replication media
        root@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx::media
        

        status detail
        

        MASTER NODE                            MASTER VOL    MASTER
        BRICK    SLAVE
        

        USER    SLAVE                                            SLAVE
        NODE
        

                                STATUS     CRAWL STATUS      
        LAST_SYNCED
        

          ENTRY    DATA    META    FAILURES    CHECKPOINT TIME   
        CHECKPOINT
        

        COMPLETED    CHECKPOINT COMPLETION TIME
        

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        

        eu-gluster-1.websitewebsitewebs.com    media        
        /data/media     root
        

                us-west-gluster.websitewebsitewebs.com::media
        

        us-west-gluster.websitewebsitewebs.com    Active     Changelog
        Crawl
        

          2015-11-24 20:59:25    0        0       0       633        
        N/A
        

              N/A                     N/A
        

        eu-gluster-2.websitewebsitewebs.com    media        
        /data/media     root
        

                us-west-gluster.websitewebsitewebs.com::media
        

        us-west-gluster.websitewebsitewebs.com    Passive   
        N/A                N/A
        

                            N/A      N/A     N/A     N/A         N/A
        

          N/A                     N/A
        

        What is the right way to retry failed items?
        

        Can I get a list of them somehow so that I could touch them in
        hopes to fix
        

        this?
        

        I wonder why does it not retry the items automatically?
        

        On Tue, Nov 24, 2015 at 6:11 AM, Venky Shankar
        <vshankar@xxxxxxxxxx> wrote:
        

        On Tue, Nov 24, 2015 at 1:23 AM, Audrius
          Butkevicius
          

          <audrius.butkevicius@xxxxxxxxx> wrote:
          

          Hi,
            

            I've got a geo-replicated gluster volume, with a few hundred
            thousand
            

            images, which get generated on demand.
            

            I started getting replication failures in the status detail
            view, but
            

          it's
          

          not obvious to me where to find the
            actual errors or how to actually fix
            

            them.
            

          Chris here[1] mentioned about a bug in rsync (thanks!). Could
          that be
          

          the issue here?
          

          Mind checking rsync version used?
          

          [1]:
          

http://www.gluster.org/pipermail/gluster-users/2015-November/024423.html
          

          The docs seem to be secretive about
            this as well. It seems if I tear the
            

            geo-replication down, and do a force create from scratch, it
            goes back in
            

            sync again, but as the files get generated, it starts
            getting failures
            

          again
          

          at some point.
            

            Can someone provide me with information on how to check
            which files are
            

            causing failures, and what are the actual failures? Or point
            me to the
            

            relevant part in the docs?
            

            Version 3.7.5-ubuntu1~trusty1
            

            Related SO question:
            

http://stackoverflow.com/questions/33839056/gluster-geo-replication-debugging-failures
          

          Thanks,
            

            Audrius.
            

            _______________________________________________
            

            Gluster-users mailing list
            

            Gluster-users@xxxxxxxxxxx
            

            http://www.gluster.org/mailman/listinfo/gluster-users
            

        _______________________________________________
        

        Gluster-users mailing list
        

        Gluster-users@xxxxxxxxxxx
        

        http://www.gluster.org/mailman/listinfo/gluster-users
        

      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
    
    
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users