On Tue, Aug 16, 2016 at 01:34:36PM +0800, qingwei wei wrote: > Hi, > > I am currently trying to test the distributed replica (3 replicas) > reliability when 1 brick is down. I tried using both software unplug > method by issuing the exho offline > /sys/block/sdx/device/state and > also physically unplug the HDD and i encountered 2 different outcomes. > For software unplug, the FIO workload continue to run but for > physically unplug the HDD, FIO workload cannot continue with the > following error: > > [2016-08-12 10:33:41.854283] E [MSGID: 108008] > [afr-transaction.c:1989:afr_transaction] 0-ad17hwssd7-replicate-0: > Failing WRITE on gfid 665a43df-1ece-4c9a-a6ee-fcfa960d95bf: > split-brain observed. [Input/output error] > > From the server where i unplug the disk, i can see the following: > > [2016-08-12 10:33:41.916456] D [MSGID: 0] > [io-threads.c:351:iot_schedule] 0-ad17hwssd7-io-threads: LOOKUP > scheduled as fast fop > [2016-08-12 10:33:41.916666] D [MSGID: 115050] > [server-rpc-fops.c:179:server_lookup_cbk] 0-ad17hwssd7-server: 8127: > LOOKUP /.shard/150e99ee-ce3b-4b57-8c40-99b4ecdf3822.90 > (be318638-e8a0-4c6d-977d-7a937aa84806/150e99ee-ce3b-4b57-8c40-99b4ecdf3822.90) > ==> (No such file or directory) [No such file or directory] > [2016-08-12 10:33:41.916804] D [MSGID: 101171] > [client_t.c:417:gf_client_unref] 0-client_t: > hp.dctopenstack.org-25780-2016/08/12-10:33:07:589960-ad17hwssd7-client-0-0-0: > ref-count 1 > [2016-08-12 10:33:41.917098] D [MSGID: 101171] > [client_t.c:333:gf_client_ref] 0-client_t: > hp.dctopenstack.org-25780-2016/08/12-10:33:07:589960-ad17hwssd7-client-0-0-0: > ref-count 2 > [2016-08-12 10:33:41.917145] W [MSGID: 115009] > [server-resolve.c:571:server_resolve] 0-ad17hwssd7-server: no > resolution type for (null) (LOOKUP) > [2016-08-12 10:33:41.917182] E [MSGID: 115050] > [server-rpc-fops.c:179:server_lookup_cbk] 0-ad17hwssd7-server: 8128: > LOOKUP (null) (00000000-0000-0000-0000-000000000000/150e99ee-ce3b-4b57-8c40-99b4ecdf3822.90) > ==> (Invalid argument) [Invalid argument] > > I am using gluster 3.7.10 and the configuration is as follow: > > diagnostics.brick-log-level: DEBUG > diagnostics.client-log-level: DEBUG > performance.io-thread-count: 16 > client.event-threads: 2 > server.event-threads: 2 > features.shard-block-size: 16MB > features.shard: on > server.allow-insecure: on > storage.owner-uid: 165 > storage.owner-gid: 165 > nfs.disable: true > performance.quick-read: off > performance.io-cache: off > performance.read-ahead: off > performance.stat-prefetch: off > cluster.lookup-optimize: on > cluster.quorum-type: auto > cluster.server-quorum-type: server > transport.address-family: inet > performance.readdir-ahead: on > > This error only occur for sharding configuration. Do you guys perform > this type of test before? Or do you think physically unplug the HDD is > a valid test case? If you use replica-3, things should settle down again. The kernel and teh brick process needs a little time to find out that the filesystem on the disk that you pulled out is not responding anymore. The output og "gluster volume status" should show that the brick process is offline. As long as you have quorum, things should continue after a small delay while waiting to mark the brick offline. People actually should test this scenario, it can be that power to disks fail, or even (connections to) RAID-controllers. Hot-unplugging is definitely a scenario that can emulate real-world problems. Niels
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel