<p dir="ltr">I have had multiple issues with geo-replication.  It seems to work OK initially, the replica gets up to date, and not long after (e.g. a couple of days), the replication goes into a faulty state and won&#39;t get out of it.<br></p>
<p dir="ltr">I have tried a few times now, and last attempt I re-created the slave volume and setup the replication again.  Same symptoms again.</p>
<p dir="ltr"> </p>
<p dir="ltr">I use Gluster 3.7.3, and you will find my setup and log messages at the bottom of the email.</p>
<p dir="ltr"> </p>
<p dir="ltr">Any idea what could cause this and how to fix it?</p>
<p dir="ltr">Thanks,<br>
Thibault.</p>
<p dir="ltr">ps: my setup and log messages:</p>
<p dir="ltr"> </p>
<p dir="ltr">Master:</p>
<p dir="ltr"> </p>
<p dir="ltr">Volume Name: home<br>
Type: Replicate<br>
Volume ID: 2299a204-a1dc-449d-8556-bc65197373c7<br>
Status: Started<br>
Number of Bricks: 1 x 2 = 2<br>
Transport-type: tcp<br>
Bricks:<br>
Brick1: server4.uberit.net:/gluster/home-brick-1<br>
Brick2: server5.uberit.net:/gluster/home-brick-1<br>
Options Reconfigured:<br>
performance.readdir-ahead: on<br>
geo-replication.indexing: on<br>
geo-replication.ignore-pid-check: on<br>
changelog.changelog: on</p>
<p dir="ltr"> </p>
<p dir="ltr">Slave:</p>
<p dir="ltr">Volume Name: homegs<br>
Type: Distribute<br>
Volume ID: 746dfdc3-650d-4468-9fdd-d621dd215b94<br>
Status: Started<br>
Number of Bricks: 1<br>
Transport-type: tcp<br>
Bricks:<br>
Brick1: remoteserver1.uberit.net:/gluster/homegs-brick-1/brick<br>
Options Reconfigured:<br>
performance.readdir-ahead: on</p>
<p dir="ltr"> </p>
<p dir="ltr">The geo-replication status and config (I think I ended up with only defaults values) are:</p>
<p dir="ltr"> </p>
<p dir="ltr"># gluster volume geo-replication home ssh://remoteserver1::homegs status</p>
<p dir="ltr">MASTER NODE       MASTER VOL    MASTER BRICK             SLAVE USER    SLAVE                           SLAVE NODE    STATUS    CRAWL STATUS    LAST_SYNCED<br>
---------------------------------------------------------------------------------------------------------------------------------------------------------<br>
server5          home          /gluster/home-brick-1    root          ssh://remoteserver1::homegs      N/A           Faulty    N/A             N/A<br>
server4          home          /gluster/home-brick-1    root          ssh://remoteserver1::homegs      N/A           Faulty    N/A             N/A</p>
<p dir="ltr"> </p>
<p dir="ltr"># gluster volume geo-replication home ssh://remoteserver1::homegs config<br>
special_sync_mode: partial<br>
state_socket_unencoded: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.socket<br>
gluster_log_file: /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.gluster.log<br>
ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem<br>
ignore_deletes: false<br>
change_detector: changelog<br>
gluster_command_dir: /usr/sbin/<br>
georep_session_working_dir: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/<br>
state_file: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.status<br>
remote_gsyncd: /nonexistent/gsyncd<br>
session_owner: 2299a204-a1dc-449d-8556-bc65197373c7<br>
changelog_log_file: /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-changes.log<br>
socketdir: /var/run/gluster<br>
working_dir: /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs<br>
state_detail_file: /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-detail.status<br>
ssh_command_tar: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem<br>
pid_file: /var/lib/glusterd/geo-replication/home_remoteserver_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.pid<br>
log_file: /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.log<br>
gluster_params: aux-gfid-mount acl<br>
volume_id: 2299a204-a1dc-449d-8556-bc65197373c7</p>
<p dir="ltr"> </p>
<p dir="ltr">The logs look like on the master on server1:</p>
<p dir="ltr"> </p>
<p dir="ltr">[2015-08-24 15:21:07.955600] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------<br>
[2015-08-24 15:21:07.955883] I [monitor(monitor):222:monitor] Monitor: starting gsyncd worker<br>
[2015-08-24 15:21:08.69528] I [gsyncd(/gluster/home-brick-1):649:main_i] &lt;top&gt;: syncing: gluster://localhost:home -&gt; ssh://root@vivlinuxinfra1.uberit.net:gluster://localhost:homegs<br>
[2015-08-24 15:21:08.70938] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining...<br>
[2015-08-24 15:21:11.255237] I [master(/gluster/home-brick-1):83:gmaster_builder] &lt;top&gt;: setting up xsync change detection mode<br>
[2015-08-24 15:21:11.255532] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using &#39;rsync&#39; as the sync engine<br>
[2015-08-24 15:21:11.256570] I [master(/gluster/home-brick-1):83:gmaster_builder] &lt;top&gt;: setting up changelog change detection mode<br>
[2015-08-24 15:21:11.256726] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using &#39;rsync&#39; as the sync engine<br>
[2015-08-24 15:21:11.257345] I [master(/gluster/home-brick-1):83:gmaster_builder] &lt;top&gt;: setting up changeloghistory change detection mode<br>
[2015-08-24 15:21:11.257534] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using &#39;rsync&#39; as the sync engine<br>
[2015-08-24 15:21:13.333628] I [master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40172.18.0.169%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync<br>
[2015-08-24 15:21:13.333870] I [resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time: 1440426073<br>
[2015-08-24 15:21:13.401132] I [master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with volume id 2299a204-a1dc-449d-8556-bc65197373c7 ...<br>
[2015-08-24 15:21:13.412795] I [master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1 seconds<br>
[2015-08-24 15:21:13.427340] I [master(/gluster/home-brick-1):1127:crawl] _GMaster: starting history crawl... turns: 1, stime: (1440411353, 0)<br>
[2015-08-24 15:21:14.432327] I [master(/gluster/home-brick-1):1156:crawl] _GMaster: slave&#39;s time: (1440411353, 0)<br>
[2015-08-24 15:21:14.890889] E [repce(/gluster/home-brick-1):207:__call__] RepceClient: call 20960:140215190427392:1440426074.56 (entry_ops) failed on peer with OSError<br>
[2015-08-24 15:21:14.891124] E [syncdutils(/gluster/home-brick-1):276:log_raise_exception] &lt;top&gt;: FAIL:<br>
Traceback (most recent call last):<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py&quot;, line 165, in main<br>
    main_i()<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py&quot;, line 659, in main_i<br>
    local.service_loop(*[r for r in [remote] if r])<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/resource.py&quot;, line 1438, in service_loop<br>
    g3.crawlwrap(oneshot=True)<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/master.py&quot;, line 584, in crawlwrap<br>
    self.crawl()<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/master.py&quot;, line 1165, in crawl<br>
    self.changelogs_batch_process(changes)<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/master.py&quot;, line 1074, in changelogs_batch_process<br>
    self.process(batch)<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/master.py&quot;, line 952, in process<br>
    self.process_change(change, done, retry)<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/master.py&quot;, line 907, in process_change<br>
    failures = self.slave.server.entry_ops(entries)<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/repce.py&quot;, line 226, in __call__<br>
    return self.ins(self.meth, *a)<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/repce.py&quot;, line 208, in __call__<br>
    raise res<br>
OSError: [Errno 5] Input/output error<br>
[2015-08-24 15:21:14.892291] I [syncdutils(/gluster/home-brick-1):220:finalize] &lt;top&gt;: exiting.<br>
[2015-08-24 15:21:14.893665] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.<br>
[2015-08-24 15:21:14.893879] I [syncdutils(agent):220:finalize] &lt;top&gt;: exiting.<br>
[2015-08-24 15:21:15.259360] I [monitor(monitor):282:monitor] Monitor: worker(/gluster/home-brick-1) died in startup phase</p>
<p dir="ltr"> </p>
<p dir="ltr">and on master server2:</p>
<p dir="ltr"> </p>
<p dir="ltr">[2015-08-24 15:21:07.650707] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------<br>
[2015-08-24 15:21:07.651144] I [monitor(monitor):222:monitor] Monitor: starting gsyncd worker<br>
[2015-08-24 15:21:07.764817] I [gsyncd(/gluster/home-brick-1):649:main_i] &lt;top&gt;: syncing: gluster://localhost:home -&gt; ssh://root@remoteserver1.uberit.net:gluster://localhost:homegs<br>
[2015-08-24 15:21:07.768552] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining...<br>
[2015-08-24 15:21:11.9820] I [master(/gluster/home-brick-1):83:gmaster_builder] &lt;top&gt;: setting up xsync change detection mode<br>
[2015-08-24 15:21:11.10199] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using &#39;rsync&#39; as the sync engine<br>
[2015-08-24 15:21:11.10946] I [master(/gluster/home-brick-1):83:gmaster_builder] &lt;top&gt;: setting up changelog change detection mode<br>
[2015-08-24 15:21:11.11115] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using &#39;rsync&#39; as the sync engine<br>
[2015-08-24 15:21:11.11744] I [master(/gluster/home-brick-1):83:gmaster_builder] &lt;top&gt;: setting up changeloghistory change detection mode<br>
[2015-08-24 15:21:11.11933] I [master(/gluster/home-brick-1):404:__init__] _GMaster: using &#39;rsync&#39; as the sync engine<br>
[2015-08-24 15:21:13.59192] I [master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync<br>
[2015-08-24 15:21:13.59454] I [resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time: 1440426073<br>
[2015-08-24 15:21:13.113203] I [master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with volume id 2299a204-a1dc-449d-8556-bc65197373c7 ...<br>
[2015-08-24 15:21:13.122018] I [master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1 seconds<br>
[2015-08-24 15:21:23.209912] E [repce(/gluster/home-brick-1):207:__call__] RepceClient: call 1561:140164806457088:1440426083.11 (keep_alive) failed on peer with OSError<br>
[2015-08-24 15:21:23.210119] E [syncdutils(/gluster/home-brick-1):276:log_raise_exception] &lt;top&gt;: FAIL:<br>
Traceback (most recent call last):<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py&quot;, line 306, in twrap<br>
    tf(*aa)<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/master.py&quot;, line 438, in keep_alive<br>
    cls.slave.server.keep_alive(vi)<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/repce.py&quot;, line 226, in __call__<br>
    return self.ins(self.meth, *a)<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/repce.py&quot;, line 208, in __call__<br>
    raise res<br>
OSError: [Errno 22] Invalid argument<br>
[2015-08-24 15:21:23.210975] I [syncdutils(/gluster/home-brick-1):220:finalize] &lt;top&gt;: exiting.<br>
[2015-08-24 15:21:23.212455] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.<br>
[2015-08-24 15:21:23.212707] I [syncdutils(agent):220:finalize] &lt;top&gt;: exiting.<br>
[2015-08-24 15:21:24.23336] I [monitor(monitor):282:monitor] Monitor: worker(/gluster/home-brick-1) died in startup phase</p>
<p dir="ltr"> </p>
<p dir="ltr">and on the slave (in a different timezone, one hour behind):</p>
<p dir="ltr"> </p>
<p dir="ltr">[2015-08-24 14:22:02.923098] I [resource(slave):844:service_loop] GLUSTER: slave listening<br>
[2015-08-24 14:22:07.606774] E [repce(slave):117:worker] &lt;top&gt;: call failed:<br>
Traceback (most recent call last):<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/repce.py&quot;, line 113, in worker<br>
    res = getattr(self.obj, rmeth)(*in_data[2:])<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/resource.py&quot;, line 731, in entry_ops<br>
    [ESTALE, EINVAL])<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py&quot;, line 475, in errno_wrap<br>
    return call(*arg)<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py&quot;, line 78, in lsetxattr<br>
    cls.raise_oserr()<br>
  File &quot;/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py&quot;, line 37, in raise_oserr<br>
    raise OSError(errn, os.strerror(errn))<br>
OSError: [Errno 5] Input/output error<br>
[2015-08-24 14:22:07.652092] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.<br>
[2015-08-24 14:22:07.652364] I [syncdutils(slave):220:finalize] &lt;top&gt;: exiting.<br><br><br><br></p>