<div dir="ltr">I changed quota-version=1 on the two new nodes, and was able to join the cluster. I also rebooted the two new nodes and everything came up correctly.<div><br></div><div>Then I triggered a rebalance fix-layout and one of the original cluster members (node gluster03) glusterd crashed. I restarted glusterd and was connected but after a few minutes I'm left with:</div><div><br></div><div><div># gluster peer status</div><div>Number of Peers: 5</div><div><br></div><div>Hostname: 10.0.231.51</div><div>Uuid: b01de59a-4428-486b-af49-cb486ab44a07</div><div>State: Peer in Cluster (Connected)</div><div><br></div><div>Hostname: 10.0.231.52</div><div>Uuid: 75143760-52a3-4583-82bb-a9920b283dac</div><div><b>State: Peer Rejected (Connected)</b></div><div><br></div><div>Hostname: 10.0.231.53</div><div>Uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411</div><div>State: Peer in Cluster (Connected)</div><div><br></div><div>Hostname: 10.0.231.54</div><div>Uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c</div><div>State: Peer in Cluster (Connected)</div><div><br></div><div>Hostname: 10.0.231.55</div><div>Uuid: 9c155c8e-2cd1-4cfc-83af-47129b582fd3</div><div>State: Peer in Cluster (Connected)</div></div><div><br></div><div>I see in the logs (attached) there is now a cksum error:</div><div><br></div><div><div>[2016-02-29 19:16:42.082256] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.55</div><div>[2016-02-29 19:16:42.082298] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.55 (0), ret: 0</div><div>[2016-02-29 19:16:42.092535] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411, host: 10.0.231.53, port: 0</div><div>[2016-02-29 19:16:42.096036] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-export-domain-storage/export-domain-storage on port 49153</div><div>[2016-02-29 19:16:42.097296] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-vm-storage/vm-storage on port 49155</div><div>[2016-02-29 19:16:42.100727] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700</div><div>[2016-02-29 19:16:42.108495] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411</div><div>[2016-02-29 19:16:42.109295] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.53</div><div>[2016-02-29 19:16:42.109338] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.53 (0), ret: 0</div><div>[2016-02-29 19:16:42.119521] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/lv-env-modules/env-modules on port 49157</div><div>[2016-02-29 19:16:42.122856] I [MSGID: 106143] [glusterd-pmap.c:229:pmap_registry_bind] 0-pmap: adding brick /mnt/raid6-storage/storage on port 49156</div><div>[2016-02-29 19:16:42.508104] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: b01de59a-4428-486b-af49-cb486ab44a07, host: 10.0.231.51, port: 0</div><div>[2016-02-29 19:16:42.519403] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700</div><div>[2016-02-29 19:16:42.524353] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b01de59a-4428-486b-af49-cb486ab44a07</div><div>[2016-02-29 19:16:42.524999] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.51</div><div>[2016-02-29 19:16:42.525038] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.51 (0), ret: 0</div><div>[2016-02-29 19:16:42.592523] I [MSGID: 106493] [glusterd-rpc-ops.c:480:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c, host: 10.0.231.54, port: 0</div><div>[2016-02-29 19:16:42.599518] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700</div><div>[2016-02-29 19:16:42.604821] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c</div><div>[2016-02-29 19:16:42.605458] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.54</div><div>[2016-02-29 19:16:42.605492] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.0.231.54 (0), ret: 0</div><div>[2016-02-29 19:16:42.621943] I [MSGID: 106163] [glusterd-handshake.c:1193:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30700</div><div>[2016-02-29 19:16:42.628443] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: a965e782-39e2-41cc-a0d1-b32ecccdcd2f</div><div>[2016-02-29 19:16:42.629079] E [MSGID: 106010] [glusterd-utils.c:2717:glusterd_compare_friend_volume] 0-management: Version of Cksums storage differ. local cksum = 50348222, remote cksum = 50348735 on peer 10.0.231.50</div></div><div><br></div><div>On gluster01/02/04/05</div><div>/var/lib/glusterd/vols/storage/cksum info=998305000<br></div><div><br></div><div>On gluster03</div><div><div>/var/lib/glusterd/vols/storage/cksum info=998305001</div></div><div><br></div><div>How do I recover from this? Can I just stop glusterd on gluster03 and change the cksum value?</div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Feb 25, 2016 at 12:49 PM, Mohammed Rafi K C <span dir="ltr"><<a href="mailto:rkavunga@redhat.com" target="_blank">rkavunga@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><div><div class="h5">
<br>
<br>
<div>On 02/26/2016 01:53 AM, Mohammed Rafi K
C wrote:<br>
</div>
<blockquote type="cite">
<br>
<br>
<div>On 02/26/2016 01:32 AM, Steve Dainard
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>I haven't done anything more than peer thus far, so I'm a
bit confused as to how the volume info fits in, can you
expand on this a bit?<br>
</div>
<div><br>
</div>
<div>Failed commits? Is this split brain on the replica
volumes? I don't get any return from 'gluster volume heal
<volname> info' on all the replica volumes, but if I
try a gluster volume heal <volname> full I get:
'Launching heal operation to perform full self heal on
volume <volname> has been unsuccessful'.</div>
</div>
</blockquote>
<br>
forget about this. it is not for metadata selfheal .<br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div><br>
</div>
<div>I have 5 volumes total.</div>
<div><br>
</div>
<div>'Replica 3' volumes running on gluster01/02/03:</div>
<div>vm-storage</div>
<div>iso-storage</div>
<div>export-domain-storage</div>
<div>env-modules</div>
<div><br>
</div>
<div>And one distributed only volume 'storage' info shown
below:<br>
</div>
<div>
<div><br>
</div>
<div><b>From existing host gluster01/02:</b></div>
<div>
<div>type=0</div>
<div>count=4</div>
<div>status=1</div>
<div>sub_count=0</div>
<div>stripe_count=1</div>
<div>replica_count=1</div>
<div>disperse_count=0</div>
<div>redundancy_count=0</div>
<div>version=25</div>
<div>transport-type=0</div>
<div>volume-id=26d355cb-c486-481f-ac16-e25390e73775</div>
<div>username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4c</div>
<div>password=</div>
<div>op-version=3</div>
<div>client-op-version=3</div>
<div>quota-version=1</div>
<div>parent_volname=N/A</div>
<div>restored_from_snap=00000000-0000-0000-0000-000000000000</div>
<div>snap-max-hard-limit=256</div>
<div>features.quota-deem-statfs=on</div>
<div>features.inode-quota=on</div>
<div>diagnostics.brick-log-level=WARNING</div>
<div>features.quota=on</div>
<div>performance.readdir-ahead=on</div>
<div>performance.cache-size=1GB</div>
<div>performance.stat-prefetch=on</div>
<div>brick-0=10.0.231.50:-mnt-raid6-storage-storage</div>
<div>brick-1=10.0.231.51:-mnt-raid6-storage-storage</div>
<div>brick-2=10.0.231.52:-mnt-raid6-storage-storage</div>
<div>brick-3=10.0.231.53:-mnt-raid6-storage-storage</div>
</div>
<div><br>
</div>
<div>
<div><b>From existing host gluster03/04:</b><br>
</div>
<div>
<div>type=0</div>
<div>count=4</div>
<div>status=1</div>
<div>sub_count=0</div>
<div>stripe_count=1</div>
<div>replica_count=1</div>
<div>disperse_count=0</div>
<div>redundancy_count=0</div>
<div>version=25</div>
<div>transport-type=0</div>
<div>volume-id=26d355cb-c486-481f-ac16-e25390e73775</div>
<div>username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4c</div>
<div>password=</div>
<div>op-version=3</div>
<div>client-op-version=3</div>
<div>quota-version=1</div>
<div>parent_volname=N/A</div>
<div>restored_from_snap=00000000-0000-0000-0000-000000000000</div>
<div>snap-max-hard-limit=256</div>
<div>features.quota-deem-statfs=on</div>
<div>features.inode-quota=on</div>
<div>performance.stat-prefetch=on</div>
<div>performance.cache-size=1GB</div>
<div>performance.readdir-ahead=on</div>
<div>features.quota=on</div>
<div>diagnostics.brick-log-level=WARNING</div>
<div>brick-0=10.0.231.50:-mnt-raid6-storage-storage</div>
<div>brick-1=10.0.231.51:-mnt-raid6-storage-storage</div>
<div>brick-2=10.0.231.52:-mnt-raid6-storage-storage</div>
<div>brick-3=10.0.231.53:-mnt-raid6-storage-storage</div>
</div>
<div><br>
</div>
<div>So far between gluster01/02 and gluster03/04 the
configs are the same, although the ordering is different
for some of the features.</div>
<div><br>
</div>
<div>On gluster05/06 the ordering is different again, and
the quota-version=0 instead of 1.</div>
</div>
</div>
</div>
</blockquote>
<br>
This is why the peer shows as rejected. Can you check the
op-version of all the glusterd including the one which is in
reject state. you can find out the op-version here in
/var/lib/glusterd/<a href="http://glusterd.info" target="_blank">glusterd.info</a> <br>
</blockquote>
<br></div></div>
If all the op-version are same and 3.7.6, then to work-around the
issue, you can manually make it quota-version=1, and restarting the
glusterd will solve the problem, But I would strongly recommend you
to figure out the RCA. May be you can file a bug for this.<span class="HOEnZb"><font color="#888888"><br>
<br>
Rafi</font></span><div><div class="h5"><br>
<br>
<blockquote type="cite"> <br>
Rafi KC<br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>
<div><br>
</div>
<div><b>From new hosts gluster05/gluster06:</b></div>
<div>type=0</div>
<div>count=4</div>
<div>status=1</div>
<div>sub_count=0</div>
<div>stripe_count=1</div>
<div>replica_count=1</div>
<div>disperse_count=0</div>
<div>redundancy_count=0</div>
<div>version=25</div>
<div>transport-type=0</div>
<div>volume-id=26d355cb-c486-481f-ac16-e25390e73775</div>
<div>username=eb9e2063-6ba8-4d16-a54f-2c7cf7740c4c</div>
<div>password=</div>
<div>op-version=3</div>
<div>client-op-version=3</div>
<div>quota-version=0</div>
<div>parent_volname=N/A</div>
<div>restored_from_snap=00000000-0000-0000-0000-000000000000</div>
<div>snap-max-hard-limit=256</div>
<div>performance.stat-prefetch=on</div>
<div>performance.cache-size=1GB</div>
<div>performance.readdir-ahead=on</div>
<div>features.quota=on</div>
<div>diagnostics.brick-log-level=WARNING</div>
<div>features.inode-quota=on</div>
<div>features.quota-deem-statfs=on</div>
<div>brick-0=10.0.231.50:-mnt-raid6-storage-storage</div>
<div>brick-1=10.0.231.51:-mnt-raid6-storage-storage</div>
<div>brick-2=10.0.231.52:-mnt-raid6-storage-storage</div>
<div>brick-3=10.0.231.53:-mnt-raid6-storage-storage</div>
</div>
<div><br>
</div>
</div>
<div>Also, I forgot to mention that when I initially peer'd
the two new hosts, glusterd crashed on gluster03 and had to
be restarted (log attached) but has been fine since.</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Steve</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu, Feb 25, 2016 at 11:27 AM,
Mohammed Rafi K C <span dir="ltr"><<a href="mailto:rkavunga@redhat.com" target="_blank">rkavunga@redhat.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><span> <br>
<br>
<div>On 02/25/2016 11:45 PM, Steve Dainard wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Hello,<br>
<br>
I upgraded from 3.6.6 to 3.7.6 a couple weeks ago.
I just peered 2 new nodes to a 4 node cluster and
gluster peer status is:<br>
<br>
# gluster peer status <b><-- from node
gluster01</b><br>
Number of Peers: 5<br>
<br>
Hostname: 10.0.231.51<br>
Uuid: b01de59a-4428-486b-af49-cb486ab44a07<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: 10.0.231.52<br>
Uuid: 75143760-52a3-4583-82bb-a9920b283dac<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: 10.0.231.53<br>
Uuid: 2c0b8bb6-825a-4ddd-9958-d8b46e9a2411<br>
State: Peer in Cluster (Connected)<br>
<br>
Hostname: 10.0.231.54 <b><-- new node
gluster05</b><br>
Uuid: 408d88d6-0448-41e8-94a3-bf9f98255d9c<br>
<b>State: Peer Rejected (Connected)</b><br>
<br>
Hostname: 10.0.231.55 <b><-- new node gluster06</b><br>
Uuid: 9c155c8e-2cd1-4cfc-83af-47129b582fd3<br>
<b>State: Peer Rejected (Connected)</b><br>
</div>
</blockquote>
<br>
</span> Looks like your configuration files are
mismatching, ie the checksum calculation differs on this
two node than the others,<br>
<br>
Did you had any failed commit ?<br>
<br>
Compare your /var/lib/glusterd/<volname>/info of
the failed node against good one, mostly you could see
some difference.<br>
<br>
can you paste the /var/lib/glusterd/<volname>/info
?<br>
<br>
Regards<br>
Rafi KC<br>
<br>
<br>
<blockquote type="cite"><span>
<div dir="ltr">
<div><b><br>
</b></div>
<div>I followed the write-up here: <a href="http://www.gluster.org/community/documentation/index.php/Resolving_Peer_Rejected" target="_blank">http://www.gluster.org/community/documentation/index.php/Resolving_Peer_Rejected</a>
and the two new nodes peer'd properly but after
a reboot of the two new nodes I'm seeing the
same Peer Rejected (Connected) State.</div>
<div><br>
</div>
<div>I've attached logs from an existing node, and
the two new nodes.</div>
<div><br>
</div>
<div>Thanks for any suggestions,</div>
<div>Steve</div>
<div><br>
</div>
<div>
<div><br>
</div>
</div>
</div>
<br>
<fieldset></fieldset>
<br>
</span>
<pre>_______________________________________________
Gluster-users mailing list
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a>
<a href="http://www.gluster.org/mailman/listinfo/gluster-users" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a></pre>
</blockquote>
<br>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</blockquote>
<br>
</div></div></div>
</blockquote></div><br></div>