<div dir="ltr">So I did a recursive touch yesterday hoping that it will fix the georep failures, but heal info shows no issues today, potentially because of what I did yesterday.<div><br><div>There is 144MB worth of changelogs, which I've zipped up uploaded to OneDrive. I'll email you the link off the mailing list.</div><div>I am not sure if you need the changelogs from the past or from the future to understand the issue better.</div><div><br></div><div>I've also zipped up all logs from the master grepping for ' (E|W) '</div><div><br></div><div>If needed, I can also zip up and upload the logs from the slave side too.</div><div><br></div><div>Everything worked fine while we've had a static set of files. There was around 26k images, all under a mb.</div><div>Recently we've moved thumbnail cache on to the same volume, which over time generated another 260k images in different resolutions, but once they are generated, they are more or less static again, unless we upload some new images (which we do rarely), so I wouldn't say that the workload is phenomenal, though after doing the recursive touch yesterday from ~650 to ~700, the number of failed items has increased.<br></div><div><br></div><div><br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Nov 25, 2015 at 5:34 AM, Aravinda <span dir="ltr"><<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
One more thing,<br>
<br>
Need not worry too much about the SKIPPED_GFIDs list. Due to entry
failure Geo-rep is unable to create the entry and successive rsync
fails for that file, but all the GFIDs which were in the same batch
is logged as failures. Which is not true, Rsync does partial sync
skips the failed GFIDs and syncs rest of the files. I am working on
fixing the logging issue.<br>
<pre cols="72">regards
Aravinda</pre><span class="">
<div>On 11/25/2015 10:51 AM, Aravinda wrote:<br>
</div>
</span><blockquote type="cite"><span class="">Hi,
<br>
<br>
Looks like GFID conflict in Slave. (Same filename with different
GFID exists in Slave undeleted may be due to unlink failure or any
other failure)
<br>
Need to identify the cause for GFID conflict. Please share the
workload details or share the changelogs from brick
backend(/data/media/.glusterfs/changelogs)
<br>
<br>
"ENTRY FAILED" shows file exists error but shows different GFID
<br>
<br>
[2015-11-20 11:40:14.93090] W
[master(/data/media):803:log_failures]
<br>
_GMaster: ENTRY FAILED: ({'uid': 33, 'gfid':
<br>
'31d66429-c700-4a10-bb32-35e1b36a479f', 'gid': 33, 'mode': 33206,
'entry':
<br>
'.gfid/b1dc6c6d-dac7-4da9-9577-4614942a72a0/official-nightmare-before-christmas-vampire-teddy-girls-dress-body-web.jpg',
<br></span>
'op': 'CREATE'},*17, 'df0e67f5-f2ce-45c3-b4f1-224aa3059ec7'*)
<br><div><div class="h5">
<br>
Also looks like Split brain issues in Slave. Refer this document
to resolve Split brain issues in Slave.
<br>
<br>
<a href="https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md" target="_blank">https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md</a>
<br>
<br>
regards
<br>
Aravinda
<br>
<br>
On 11/25/2015 03:08 AM, Audrius Butkevicius wrote:
<br>
<blockquote type="cite">So the version of rsync is 3.1.0, but the
bug mentioned only applies to
<br>
large files, where as in my case the files are less than a MB.
<br>
<br>
I've started digging through the logs and found a bunch of these
on the
<br>
slave:
<br>
<br>
[2015-11-20 11:40:46.730805] W
[fuse-bridge.c:1978:fuse_create_cbk]
<br>
0-glusterfs-fuse: 1882288:
/.gfid/31d66429-c700-4a10-bb32-35e1b36a479f =>
<br>
-1 (Operation not permitted)
<br>
[2015-11-20 12:39:59.269844] W
[fuse-bridge.c:1978:fuse_create_cbk]
<br>
0-glusterfs-fuse: 1918306:
/.gfid/6802a0c6-1f62-4213-a70d-7b46d9ff8f3a =>
<br>
-1 (Operation not permitted)
<br>
<br>
So something funky was happening for an hour 4 days ago. Given
the volume
<br>
is on EBS, maybe there was some glitch there.
<br>
<br>
I can also find the corresponding failures on the master:
<br>
<br>
[2015-11-20 11:40:14.93090] W
[master(/data/media):803:log_failures]
<br>
_GMaster: ENTRY FAILED: ({'uid': 33, 'gfid':
<br>
'31d66429-c700-4a10-bb32-35e1b36a479f', 'gid': 33, 'mode':
33206, 'entry':
<br>
'.gfid/b1dc6c6d-dac7-4da9-9577-4614942a72a0/official-nightmare-before-christmas-vampire-teddy-girls-dress-body-web.jpg',
<br>
'op': 'CREATE'}, 17, 'df0e67f5-f2ce-45c3-b4f1-224aa3059ec7')
<br>
[2015-11-20 11:40:14.265054] W
[master(/data/media):803:log_failures]
<br>
_GMaster: META FAILED: ({'go':
<br>
'.gfid/31d66429-c700-4a10-bb32-35e1b36a479f', 'stat': {'atime':
<br>
1448019600.232466, 'gid': 33, 'mtime': 1448019600.316466,
'mode': 33279,
<br>
'uid': 33}, 'op': 'META'}, 2)
<br>
<br>
If I grep for SKIPPED GFID I get the following:
<br>
<br>
[2015-11-20 11:40:40.704817] W
[master(/data/media):1014:process] _GMaster:
<br>
SKIPPED GFID =
<br>
192632af-28c5-4e03-a62d-458fe7f3b5f9,7ea8d7a8-524b-4dd0-b97a-dc7d3481f341,204f6112-0e8d-4f6d-855b-bf10f9c63b62,7e626e8f-edad-4f39-a6c6-547a1da34aa1,1f0d0208-1962-4eb1-91d4-cf7ed297d8e3,95d389c4-3258-4ca0-8fc4-26b8427b1eaf,425cedc6-6343-4326-8540-996d2d56dc9c,5955928b-2b8f-4cc9-a336-3eac4382789b,8932efcd-ba90-46ec-84c8-5e9e51cc84e9,2530275d-5f03-4143-9abf-d07cc79bf80a,73574466-86f3-4ab2-b5da-c31ac28c27c1,776e5e8f-5c6a-46b1-ad54-733e157d2097,008a69f3-217c-4dbc-a469-5a5bc8ecd589,dca8d8d9-03cf-4793-92e4-bfcfddd262f6,c85b7a29-73af-4f44-a07e-a44082d7a93a,6c1f56d6-4ea6-4910-9677-ea33edd35d28,0ea56588-87fa-4355-9403-e311525454fc,c8ce76c9-e21d-46ce-a2b5-14dfd0070f64,db9e6484-0e5e-4f6e-815b-3c2b273deee5,35d10752-43b5-4398-be5f-17cb9de73a6b,396e5faf-74a1-4849-97e3-009dbfb22836,d148e7d5-c2f3-4d06-8cd6-8588e6aac196,404d20c5-1c6c-4aad-98be-2c23930173b3,f1fae11c-db8e-4cd5-8e47-a3870316f89c,d8daa413-e57f-44fb-b907-b1a497f2dcfa,5f6ee8c2-84fb-432e-95cd-e428ab256e83,6bf54dcd-c3b4-4187-a390-eca!
841e46570,
335c07ca-d339-4d3a-aa88-3b5753d24fbf,8fdbac00-6628-4f22-8fb4-b7a6524cae49,31d66429-c700-4a10-bb32-35e1b36a479f
<br>
[2015-11-20 11:41:35.907850] W
[master(/data/media):1014:process] _GMaster:
<br>
SKIPPED GFID =
<br>
03069c7f-8eaa-45b0-92ed-50cb648cd912,788f5ed1-923e-4b86-9696-2a6de07ebb2e,43d12b40-b6e2-43c4-8883-85e89dc81321
<br>
[2015-11-20 12:11:55.492068] W
[master(/data/media):1014:process] _GMaster:
<br>
SKIPPED GFID =
<br>
eb02369f-7ca8-480a-b00c-768964410ed8,17045ac9-27dd-4bf9-9f90-d7b146070dd5,265e3d9c-1657-45cb-bbf6-db439eb18ccf,553c420f-b3cc-47f2-8d5f-cfc2ffdd1a92
<br>
[2015-11-20 12:12:53.372432] W
[master(/data/media):1014:process] _GMaster:
<br>
SKIPPED GFID =
<br>
66c5878e-8c00-4f7d-a3ad-4adec84a5e22,f4dc086d-9c2b-449c-9e31-bbae9ebcdea7,f99317b2-72e8-49e3-b676-647abad508b1
<br>
[2015-11-20 12:37:55.773813] W
[master(/data/media):1014:process] _GMaster:
<br>
SKIPPED GFID =
<br>
4af54f1c-e8e1-4915-9328-a458d5d35d5d,acbe1f12-87e8-4192-b864-d90030269bba,7d27a795-da63-4742-9e91-abd8fa543612,8d4e642d-fd40-44d6-8419-8d3459df7ce3
<br>
[2015-11-20 12:39:28.852575] W
[master(/data/media):1014:process] _GMaster:
<br>
SKIPPED GFID =
<br>
d90dc121-02e7-4a79-bc03-1bd8fddd9f48,54bb563f-ab44-4e91-a46b-764a122ce7fa,088141de-7545-40f9-b776-751738a89740,2dab3faf-4a6c-407a-88cd-cddef6f55299,d887806f-23b4-4389-a4dc-f9027702a2df,fc5a9bc8-ea62-4677-baed-16510541373a,33136ad2-c5b4-448c-991d-1e72fefef021,cf3e2675-e41b-4782-9478-91773eb0a4aa,6412d878-e0f1-4700-84df-05f4af35962f,ec3cf6e1-7f27-4650-b978-8a5a7f620389,d3651bb9-cd2d-4c5f-93e6-fe4fb1cdf5db,ecb0415e-1524-40f4-870e-1fd0f8371b1d,a118aaae-bd3e-4b19-a0e0-891aa9edb09a,7642d3f3-f1e5-4aca-bcfe-bdb3c44779a9,2e29f3f8-c460-48eb-9db5-b281b67cc2bf,e61db54b-3979-488a-8789-a5d0615c5a97,4212d840-9c22-4d9e-b61b-5e35271dfe80,dad1c60b-9da6-4e57-b014-daa1aca73ce3,93699a3d-40b8-4bbd-b78f-aabf965df57f,4fad7468-91f2-4deb-aaf7-6401068c9e6d,c9738295-46cc-4fe7-b359-dc94f5815ce9,91853c5c-4877-4c9e-9481-c86368942f78,59deed8e-d3d0-4ab7-854e-53a8dd455de0,20b86c13-7df1-4d13-bac1-7d628a00d6ce,b7b86a2d-7963-41a4-a423-14e25d1e78c4,3c17d7fe-bb7f-489c-a525-5c8b7bb93c3e,e230d207-7c68-4983-a958-f2d!
cfc1ce694,
fa8bf3c0-abae-446c-83c5-45ef8bcaa4b8,14089102-8106-45d9-a3f1-d1446b568f4e,6802a0c6-1f62-4213-a70d-7b46d9ff8f3a,0a253bbc-ef98-4da0-951f-e17c5a7f5858,ef054b76-986b-4a89-b8e6-b4988221aaa2,48c0a153-708c-44ee-b186-cf255936a02b,fa2646a6-807c-4e9d-8f2b-a9cdf2674e0c,1ed4a563-4f6a-4b5a-9866-89025fe7afd5,0f293cf7-bc32-4f8a-87d5-388a4bffb4af,f4126726-667b-451d-8214-a18bb3f468cd,e23dc8b3-da1c-4d18-aec9-22e0aa174d81,40b9f10d-7304-4c0b-8498-bef23b305d03,15c25d1e-2a62-495e-887f-14d0cb0527b1,67371804-9084-4801-b664-44e88bea8ac3,4750fa3f-d1a4-4472-b10d-3f75d0b451dc
<br>
[2015-11-23 09:18:10.43391] W [master(/data/media):1014:process]
_GMaster:
<br>
SKIPPED GFID =
<br>
228843f3-62f0-4687-b5eb-6d1e21257ad0,b0078359-fbf0-4709-8f40-8383a11d7875,60cff4d5-8b5d-4f7f-8bc1-27081a011458,bedb6ac4-208d-47e1-812c-5547c84ab841,da6810d9-4883-45e1-b73e-55a7ff17b5e7,e03b5c03-b25c-49ba-86f0-8a709a9c2658,053673a0-c1cc-4057-83fa-f97740cb5d4f,dbd6ea84-8f24-4a47-ac41-22c3fd788ecf,43caa3e7-ca04-47ab-b950-105606b313a4,62d8b1d0-fc89-4fb1-a41a-957dcb34d325,4e8fe1fa-60cd-47fa-bad6-f617c312f53b,6c3d6cf3-62ae-4ab8-9dc3-7815552401fe,f79be814-7e78-4985-bcdd-688da23d1808,c4186455-0f06-4b5d-89be-3c5ccbdeb6f0,f9c4ccdb-2337-479d-845d-ee4d85b69ece,bcd14726-1bab-4d97-8915-ec8bbe8faf8c,cca82341-a430-4a59-a900-1af66dcf7bb8,b7043a8e-4286-4831-91ec-c146e40bc6be,995ffeb6-a906-4078-88c6-404a2b38aad4,227f9987-5057-4133-848a-2b22aca5dde1,90b35242-32db-4570-8070-cf9dd49322a5,c6863c8f-1914-4a2d-814b-6e5853134faf,e2d19b1a-fc07-441c-b110-ca816b46fc40,9a3d0c0b-7d84-416f-9f3e-21b32a11ba1d,d8163f6b-8c40-418c-9c06-b3743af24e4e,522d7247-a75b-4af9-acb2-52a99eeced89,4b56ea9d-413a-4e24-b44e-433!
f7603ad6d
<br>
<br>
There are also the following lines on the master, which might
have some
<br>
impact:
<br>
<br>
E [MSGID: 108008] [afr-read-txn.c:89:afr_read_txn_refresh_done]
<br>
0-media-replicate-0: Failing READ on gfid
<br>
abdc7d5e-9187-4916-ae83-a8b615e32a17: split-brain observed.
[Input/output
<br>
error]
<br>
<br>
E [MSGID: 108008] [afr-read-txn.c:89:afr_read_txn_refresh_done]
<br>
0-media-replicate-0: Failing GETXATTR on gfid
<br>
abdc7d5e-9187-4916-ae83-a8b615e32a17: split-brain observed.
[Input/output
<br>
error]
<br>
<br>
E [mem-pool.c:417:mem_get0]
<br>
(-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x809a2)
[0x7f79e436b9a2]
<br>
-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg+0x79f)
<br>
[0x7f79e430cb1f]
<br>
-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(mem_get0+0x81)
<br>
[0x7f79e433e4a1] ) 0-mem-pool: invalid argument [Invalid
argument]
<br>
<br>
E [mem-pool.c:417:mem_get0]
<br>
(-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(recursive_rmdir+0x192)
<br>
[0x7f79e4329b32]
<br>
-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg+0x79f)
<br>
[0x7f79e430cb1f]
<br>
-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(mem_get0+0x81)
<br>
[0x7f79e433e4a1] ) 0-mem-pool: invalid argument [Invalid
argument]
<br>
<br>
E [resource(/data/media):222:errlog] Popen: command "ssh
<br>
-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
<br>
/var/lib/glusterd/geo-replication/secret.pem
-oControlMaster=auto -S
<br>
/tmp/gsyncd-aux-ssh-dpY5cI/8216bb7da58a00926f369bb7ac8c7e03.sock
<br>
<a href="mailto:root@us-west-gluster.server.com" target="_blank">root@us-west-gluster.server.com</a>
/usr/lib/x86_64-linux-gnu/glusterfs/gsyncd
<br>
--session-owner 6922055e-49a1-4afd-a3a0-a47960d6ba54 -N --listen
--timeout
<br>
120 gluster://localhost:media" returned with 143, saying:
<br>
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
<br>
21:57:19.772896] I [cli.c:721:main] 0-cli: Started running
<br>
/usr/sbin/gluster with version 3.7.5
<br>
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
<br>
21:57:19.772955] I [cli.c:608:cli_rpc_init] 0-cli: Connecting to
remote
<br>
glusterd at localhost
<br>
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
<br>
21:57:19.871930] I [MSGID: 101190]
<br>
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started
thread
<br>
with index 1
<br>
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
<br>
21:57:19.872018] I [socket.c:2355:socket_event_handler]
0-transport:
<br>
disconnecting now
<br>
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
<br>
21:57:19.872898] I [cli-rpc-ops.c:6348:gf_cli_getwd_cbk] 0-cli:
Received
<br>
resp to getwd
<br>
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
<br>
21:57:19.872963] I [input.c:36:cli_batch] 0-: Exiting with: 0
<br>
<br>
Status detail shows the following:
<br>
<br>
root@eu-gluster-1:/var/log/glusterfs/geo-replication/media#
gluster volume
<br>
geo-replication media
<a href="mailto:root@us-west-gluster.websitewebsitewebs.com::media" target="_blank">root@us-west-gluster.websitewebsitewebs.com::media</a>
<br>
status detail
<br>
<br>
MASTER NODE MASTER VOL MASTER
BRICK SLAVE
<br>
USER SLAVE SLAVE
NODE
<br>
STATUS CRAWL STATUS
LAST_SYNCED
<br>
ENTRY DATA META FAILURES CHECKPOINT TIME
CHECKPOINT
<br>
COMPLETED CHECKPOINT COMPLETION TIME
<br>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<br>
<a href="http://eu-gluster-1.websitewebsitewebs.com" target="_blank">eu-gluster-1.websitewebsitewebs.com</a> media
/data/media root
<br>
us-west-gluster.websitewebsitewebs.com::media
<br>
<a href="http://us-west-gluster.websitewebsitewebs.com" target="_blank">us-west-gluster.websitewebsitewebs.com</a> Active Changelog
Crawl
<br>
2015-11-24 20:59:25 0 0 0 633
N/A
<br>
N/A N/A
<br>
<a href="http://eu-gluster-2.websitewebsitewebs.com" target="_blank">eu-gluster-2.websitewebsitewebs.com</a> media
/data/media root
<br>
us-west-gluster.websitewebsitewebs.com::media
<br>
<a href="http://us-west-gluster.websitewebsitewebs.com" target="_blank">us-west-gluster.websitewebsitewebs.com</a> Passive
N/A N/A
<br>
N/A N/A N/A N/A N/A
<br>
N/A N/A
<br>
<br>
<br>
<br>
<br>
What is the right way to retry failed items?
<br>
Can I get a list of them somehow so that I could touch them in
hopes to fix
<br>
this?
<br>
I wonder why does it not retry the items automatically?
<br>
<br>
<br>
On Tue, Nov 24, 2015 at 6:11 AM, Venky Shankar
<a href="mailto:vshankar@redhat.com" target="_blank"><vshankar@redhat.com></a> wrote:
<br>
<br>
<blockquote type="cite">On Tue, Nov 24, 2015 at 1:23 AM, Audrius
Butkevicius
<br>
<a href="mailto:audrius.butkevicius@gmail.com" target="_blank"><audrius.butkevicius@gmail.com></a> wrote:
<br>
<blockquote type="cite">Hi,
<br>
<br>
I've got a geo-replicated gluster volume, with a few hundred
thousand
<br>
images, which get generated on demand.
<br>
<br>
I started getting replication failures in the status detail
view, but
<br>
</blockquote>
it's
<br>
<blockquote type="cite">not obvious to me where to find the
actual errors or how to actually fix
<br>
them.
<br>
</blockquote>
Chris here[1] mentioned about a bug in rsync (thanks!). Could
that be
<br>
the issue here?
<br>
<br>
Mind checking rsync version used?
<br>
<br>
[1]:
<br>
<a href="http://www.gluster.org/pipermail/gluster-users/2015-November/024423.html" target="_blank">http://www.gluster.org/pipermail/gluster-users/2015-November/024423.html</a>
<br>
<br>
<blockquote type="cite">The docs seem to be secretive about
this as well. It seems if I tear the
<br>
geo-replication down, and do a force create from scratch, it
goes back in
<br>
sync again, but as the files get generated, it starts
getting failures
<br>
</blockquote>
again
<br>
<blockquote type="cite">at some point.
<br>
<br>
Can someone provide me with information on how to check
which files are
<br>
causing failures, and what are the actual failures? Or point
me to the
<br>
relevant part in the docs?
<br>
<br>
Version 3.7.5-ubuntu1~trusty1
<br>
<br>
Related SO question:
<br>
<br>
</blockquote>
<a href="http://stackoverflow.com/questions/33839056/gluster-geo-replication-debugging-failures" target="_blank">http://stackoverflow.com/questions/33839056/gluster-geo-replication-debugging-failures</a>
<br>
<blockquote type="cite">Thanks,
<br>
<br>
Audrius.
<br>
<br>
<br>
_______________________________________________
<br>
Gluster-users mailing list
<br>
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a>
<br>
<a href="http://www.gluster.org/mailman/listinfo/gluster-users" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a>
<br>
</blockquote>
</blockquote>
<br>
<br>
_______________________________________________
<br>
Gluster-users mailing list
<br>
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a>
<br>
<a href="http://www.gluster.org/mailman/listinfo/gluster-users" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a>
<br>
</blockquote>
<br>
<br>
<br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
Gluster-users mailing list
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a>
<a href="http://www.gluster.org/mailman/listinfo/gluster-users" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a></pre>
</div></div></blockquote>
<br>
</div>
</blockquote></div><br></div>