<div dir="ltr"><div>Should be OK. But running the same version on both clients and servers is always the safest bet.<br><br></div>-Krutika<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Aug 22, 2016 at 10:39 AM, qingwei wei <span dir="ltr">&lt;<a href="mailto:tchengwee@gmail.com" target="_blank">tchengwee@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

I updated my client to 3.7.14 and i no longer experience split-brain<br>

error. It seems like there is some changes on the client side that fix<br>

this error. Please note that, for this test, i still stick to 3.7.10<br>

server. Will it be an issue with client and server using different<br>

version?<br>

<br>

Cw<br>

<div class="HOEnZb"><div class="h5"><br>

On Tue, Aug 16, 2016 at 5:46 PM, Krutika Dhananjay &lt;<a href="mailto:kdhananj@redhat.com">kdhananj@redhat.com</a>&gt; wrote:<br>

&gt; 3.7.11 had quite a few bugs in afr and sharding+afr interop that were fixed<br>

&gt; in 3.7.12.<br>

&gt; Some of them were about files being reported as being in split-brain.<br>

&gt; Chances are that some of them existed in 3.7.10 as well - which is what<br>

&gt; you&#39;re using.<br>

&gt;<br>

&gt; Do you mind trying the same test with 3.7.12 or a later version?<br>

&gt;<br>

&gt; -Krutika<br>

&gt;<br>

&gt; On Tue, Aug 16, 2016 at 2:46 PM, qingwei wei &lt;<a href="mailto:tchengwee@gmail.com">tchengwee@gmail.com</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; Hi Niels,<br>

&gt;&gt;<br>

&gt;&gt; My situation is that when i unplug the HDD physically, the FIO<br>

&gt;&gt; application exits with Input/Output error. However, when i do echo<br>

&gt;&gt; offline on the disk, the FIO application does freeze a bit but still<br>

&gt;&gt; manage to resume the IO workload after the freeze.<br>

&gt;&gt;<br>

&gt;&gt; From what i can see from the client log, the error is split-brain<br>

&gt;&gt; which does not make sense as i still have 2 working replicas.<br>

&gt;&gt;<br>

&gt;&gt; [2016-08-12 10:33:41.854283] E [MSGID:<br>

&gt;&gt; 108008][afr-transaction.c:<wbr>1989:afr_transaction]<br>

&gt;&gt; 0-ad17hwssd7-replicate-0:<br>

&gt;&gt; Failing WRITE on gfid 665a43df-1ece-4c9a-a6ee-<wbr>fcfa960d95bf:split-brain<br>

&gt;&gt; observed. [Input/output error]<br>

&gt;&gt;<br>

&gt;&gt; So anyone can share their testing experience on this type disruptive<br>

&gt;&gt; test on shard volume? Thanks!<br>

&gt;&gt;<br>

&gt;&gt; Regards,<br>

&gt;&gt;<br>

&gt;&gt; Cheng Wee<br>

&gt;&gt;<br>

&gt;&gt; On Tue, Aug 16, 2016 at 4:45 PM, Niels de Vos &lt;<a href="mailto:ndevos@redhat.com">ndevos@redhat.com</a>&gt; wrote:<br>

&gt;&gt; &gt; On Tue, Aug 16, 2016 at 01:34:36PM +0800, qingwei wei wrote:<br>

&gt;&gt; &gt;&gt; Hi,<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; I am currently trying to test the distributed replica (3 replicas)<br>

&gt;&gt; &gt;&gt; reliability when 1 brick is down. I tried using both software unplug<br>

&gt;&gt; &gt;&gt; method by issuing the exho offline &gt; /sys/block/sdx/device/state and<br>

&gt;&gt; &gt;&gt; also physically unplug the HDD and i encountered 2 different outcomes.<br>

&gt;&gt; &gt;&gt; For software unplug, the FIO workload continue to run but for<br>

&gt;&gt; &gt;&gt; physically unplug the HDD, FIO workload cannot continue with the<br>

&gt;&gt; &gt;&gt; following error:<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; [2016-08-12 10:33:41.854283] E [MSGID: 108008]<br>

&gt;&gt; &gt;&gt; [afr-transaction.c:1989:afr_<wbr>transaction] 0-ad17hwssd7-replicate-0:<br>

&gt;&gt; &gt;&gt; Failing WRITE on gfid 665a43df-1ece-4c9a-a6ee-<wbr>fcfa960d95bf:<br>

&gt;&gt; &gt;&gt; split-brain observed. [Input/output error]<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; From the server where i unplug the disk, i can see the following:<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; [2016-08-12 10:33:41.916456] D [MSGID: 0]<br>

&gt;&gt; &gt;&gt; [io-threads.c:351:iot_<wbr>schedule] 0-ad17hwssd7-io-threads: LOOKUP<br>

&gt;&gt; &gt;&gt; scheduled as fast fop<br>

&gt;&gt; &gt;&gt; [2016-08-12 10:33:41.916666] D [MSGID: 115050]<br>

&gt;&gt; &gt;&gt; [server-rpc-fops.c:179:server_<wbr>lookup_cbk] 0-ad17hwssd7-server: 8127:<br>

&gt;&gt; &gt;&gt; LOOKUP /.shard/150e99ee-ce3b-4b57-<wbr>8c40-99b4ecdf3822.90<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; (be318638-e8a0-4c6d-977d-<wbr>7a937aa84806/150e99ee-ce3b-<wbr>4b57-8c40-99b4ecdf3822.90)<br>

&gt;&gt; &gt;&gt; ==&gt; (No such file or directory) [No such file or directory]<br>

&gt;&gt; &gt;&gt; [2016-08-12 10:33:41.916804] D [MSGID: 101171]<br>

&gt;&gt; &gt;&gt; [client_t.c:417:gf_client_<wbr>unref] 0-client_t:<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; hp.dctopenstack.org-25780-<wbr>2016/08/12-10:33:07:589960-<wbr>ad17hwssd7-client-0-0-0:<br>

&gt;&gt; &gt;&gt; ref-count 1<br>

&gt;&gt; &gt;&gt; [2016-08-12 10:33:41.917098] D [MSGID: 101171]<br>

&gt;&gt; &gt;&gt; [client_t.c:333:gf_client_ref] 0-client_t:<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; hp.dctopenstack.org-25780-<wbr>2016/08/12-10:33:07:589960-<wbr>ad17hwssd7-client-0-0-0:<br>

&gt;&gt; &gt;&gt; ref-count 2<br>

&gt;&gt; &gt;&gt; [2016-08-12 10:33:41.917145] W [MSGID: 115009]<br>

&gt;&gt; &gt;&gt; [server-resolve.c:571:server_<wbr>resolve] 0-ad17hwssd7-server: no<br>

&gt;&gt; &gt;&gt; resolution type for (null) (LOOKUP)<br>

&gt;&gt; &gt;&gt; [2016-08-12 10:33:41.917182] E [MSGID: 115050]<br>

&gt;&gt; &gt;&gt; [server-rpc-fops.c:179:server_<wbr>lookup_cbk] 0-ad17hwssd7-server: 8128:<br>

&gt;&gt; &gt;&gt; LOOKUP (null)<br>

&gt;&gt; &gt;&gt; (00000000-0000-0000-0000-<wbr>000000000000/150e99ee-ce3b-<wbr>4b57-8c40-99b4ecdf3822.90)<br>

&gt;&gt; &gt;&gt; ==&gt; (Invalid argument) [Invalid argument]<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; I am using gluster 3.7.10 and the configuration is as follow:<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; diagnostics.brick-log-level: DEBUG<br>

&gt;&gt; &gt;&gt; diagnostics.client-log-level: DEBUG<br>

&gt;&gt; &gt;&gt; performance.io-thread-count: 16<br>

&gt;&gt; &gt;&gt; client.event-threads: 2<br>

&gt;&gt; &gt;&gt; server.event-threads: 2<br>

&gt;&gt; &gt;&gt; features.shard-block-size: 16MB<br>

&gt;&gt; &gt;&gt; features.shard: on<br>

&gt;&gt; &gt;&gt; server.allow-insecure: on<br>

&gt;&gt; &gt;&gt; storage.owner-uid: 165<br>

&gt;&gt; &gt;&gt; storage.owner-gid: 165<br>

&gt;&gt; &gt;&gt; nfs.disable: true<br>

&gt;&gt; &gt;&gt; performance.quick-read: off<br>

&gt;&gt; &gt;&gt; performance.io-cache: off<br>

&gt;&gt; &gt;&gt; performance.read-ahead: off<br>

&gt;&gt; &gt;&gt; performance.stat-prefetch: off<br>

&gt;&gt; &gt;&gt; cluster.lookup-optimize: on<br>

&gt;&gt; &gt;&gt; cluster.quorum-type: auto<br>

&gt;&gt; &gt;&gt; cluster.server-quorum-type: server<br>

&gt;&gt; &gt;&gt; transport.address-family: inet<br>

&gt;&gt; &gt;&gt; performance.readdir-ahead: on<br>

&gt;&gt; &gt;&gt;<br>

&gt;&gt; &gt;&gt; This error only occur for sharding configuration. Do you guys perform<br>

&gt;&gt; &gt;&gt; this type of test before? Or do you think physically unplug the HDD is<br>

&gt;&gt; &gt;&gt; a valid test case?<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; If you use replica-3, things should settle down again. The kernel and<br>

&gt;&gt; &gt; teh brick process needs a little time to find out that the filesystem on<br>

&gt;&gt; &gt; the disk that you pulled out is not responding anymore. The output og<br>

&gt;&gt; &gt; &quot;gluster volume status&quot; should show that the brick process is offline.<br>

&gt;&gt; &gt; As long as you have quorum, things should continue after a small delay<br>

&gt;&gt; &gt; while waiting to mark the brick offline.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; People actually should test this scenario, it can be that power to disks<br>

&gt;&gt; &gt; fail, or even (connections to) RAID-controllers. Hot-unplugging is<br>

&gt;&gt; &gt; definitely a scenario that can emulate real-world problems.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Niels<br>

&gt;&gt; ______________________________<wbr>_________________<br>

&gt;&gt; Gluster-devel mailing list<br>

&gt;&gt; <a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>

&gt;&gt; <a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><br>

&gt;<br>

&gt;<br>

</div></div></blockquote></div><br></div>