<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:#0563C1;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:#954F72;

        text-decoration:underline;}

span.EmailStyle17

        {mso-style-type:personal-compose;

        font-family:"Calibri",sans-serif;

        color:windowtext;}

.MsoChpDefault

        {mso-style-type:export-only;}

@page WordSection1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]--></head><body lang=EN-US link="#0563C1" vlink="#954F72"><div class=WordSection1><p class=MsoNormal>I have been running gluster as a storage backend to OpenNebula for about a year and it has been running great. I have had an intermittent problem that has gotten worse over the last couple of days and I could use some help. <o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Setup<o:p></o:p></p><p class=MsoNormal>=====<o:p></o:p></p><p class=MsoNormal>Gluster: 3.7.11<o:p></o:p></p><p class=MsoNormal>Hyper Converged Setup - Gluster with KVM&#8217;s on the same machines with Gluster in a Slice on each server.<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Four Servers - Each with 4 Bricks<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Type: Distributed-Replicate<o:p></o:p></p><p class=MsoNormal>Number of Bricks: 4 x 3 = 12<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Bricks are 1TB SSD's<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Gluster Status:&nbsp; http://pastebin.com/Nux7VB4b<o:p></o:p></p><p class=MsoNormal>Gluster Info:&nbsp; http://pastebin.com/G5qR0kZq<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Gluster is supporting qcow2 images that the KVM&#8217;s are using.&nbsp; Image Sizes:&nbsp; 10GB up to 300GB images.<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>The volume is mounted on each node with glusterfs as a shared file system.&nbsp; The KVM's using the images are using libgfapi ( i.e. file=gluster://shchst01:24007/shchst01/d8fcfdb97bc462aca502d5fe703afc66 )<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Issue<o:p></o:p></p><p class=MsoNormal>======<o:p></o:p></p><p class=MsoNormal>This setup has been running well, with the exception of this intermittent problem.&nbsp; This only happens on one node.&nbsp; It has happened on other bricks (all on the same node) but more freqently on Node 2: Brick 4<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>It starts here:&nbsp; http://pastebin.com/YgeJ5VA9<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Dec 18 02:08:54 shchhv02 kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>This continues until:<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Dec 18 02:11:10 shchhv02 storage-shchst01[14728]: [2016-12-18 08:11:10.428138] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 4-shchst01-client-11: server xxx.xx.xx.11:49155 has not responded in the last 42 seconds, disconnecting.<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>storage log:&nbsp; http://pastebin.com/vxCdRnEg<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>[2016-12-18 08:11:10.435927] E [MSGID: 114031] [client-rpc-fops.c:2886:client3_3_opendir_cbk] 4-shchst01-client-11: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:11:10.436240] E [rpc-clnt.c:362:saved_frames_unwind] (--&gt; /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f06efbaeae2] (--&gt; /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f06ef97990e] (--&gt; /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f06ef979a1e] (--&gt; /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f06ef97b40a] (--&gt; /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f06ef97bc38] ))))) 4-shchst01-client-11: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-12-18 08:10:28.424311 (xid=0x36883d)<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:11:10.436255] W [rpc-clnt-ping.c:208:rpc_clnt_ping_cbk] 4-shchst01-client-11: socket disconnected<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:11:10.436369] E [rpc-clnt.c:362:saved_frames_unwind] (--&gt; /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f06efbaeae2] (--&gt; /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f06ef97990e] (--&gt; /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f06ef979a1e] (--&gt; /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f06ef97b40a] (--&gt; /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f06ef97bc38] ))))) 4-shchst01-client-11: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-12-18 08:10:38.370507 (xid=0x36883e)<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:11:10.436388] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 4-shchst01-client-11: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:11:10.436488] I [MSGID: 114018] [client.c:2030:client_rpc_notify] 4-shchst01-client-11: disconnected from shchst01-client-11. Client process will keep trying to connect to glusterd until brick's port is available<o:p></o:p></p><p class=MsoNormal>The message &quot;W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk] 4-shchst01-client-11: remote operation failed [Transport endpoint is not connected]&quot; repeated 3 times between [2016-12-18 08:11:10.432640] and [2016-12-18 08:11:10.433530]<o:p></o:p></p><p class=MsoNormal>The message &quot;W [MSGID: 114031] [client-rpc-fops.c:2669:client3_3_readdirp_cbk] 4-shchst01-client-11: remote operation failed [Transport endpoint is not connected]&quot; repeated 15 times between [2016-12-18 08:11:10.428844] and [2016-12-18 08:11:10.435727]<o:p></o:p></p><p class=MsoNormal>The message &quot;W [MSGID: 114061] [client-rpc-fops.c:4560:client3_3_fstat] 4-shchst01-client-11:&nbsp; (00000000-0000-0000-0000-000000000001) remote_fd is -1. EBADFD [File descriptor in bad state]&quot; repeated 11 times between [2016-12-18 08:11:10.433598] and [2016-12-18 08:11:10.435742]<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>brick 4 log:&nbsp; http://pastebin.com/kQcNyGk2<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>[2016-12-18 08:08:33.000483] I [dict.c:473:dict_get] (--&gt;/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac) [0x7f8504feccbc] --&gt;/usr/lib64/glusterfs/3.7.11/xlator/features/marker.so(marker_getxattr_cbk+0xa7) [0x7f84f5734917] --&gt;/lib64/libglusterfs.so.0(dict_get+0xac) [0x7f8504fdd0fc] ) 0-dict: !this || key=() [Invalid argument]<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:08:33.003178] I [dict.c:473:dict_get] (--&gt;/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac) [0x7f8504feccbc] --&gt;/usr/lib64/glusterfs/3.7.11/xlator/features/marker.so(marker_getxattr_cbk+0xa7) [0x7f84f5734917] --&gt;/lib64/libglusterfs.so.0(dict_get+0xac) [0x7f8504fdd0fc] ) 0-dict: !this || key=() [Invalid argument]<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:08:34.021937] I [dict.c:473:dict_get] (--&gt;/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac) [0x7f8504feccbc] --&gt;/usr/lib64/glusterfs/3.7.11/xlator/features/marker.so(marker_getxattr_cbk+0xa7) [0x7f84f5734917] --&gt;/lib64/libglusterfs.so.0(dict_get+0xac) [0x7f8504fdd0fc] ) 0-dict: !this || key=() [Invalid argument]<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:10:11.671642] E [server-helpers.c:390:server_alloc_frame] (--&gt;/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x2fb) [0x7f8504dad73b] --&gt;/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(server3_3_fxattrop+0x86) [0x7f84f48a9a76] --&gt;/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(get_frame_from_request+0x2fb) [0x7f84f489eedb] ) 0-server: invalid argument: client [Invalid argument]<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:10:11.671689] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:10:11.671808] I [login.c:81:gf_auth] 0-auth/login: allowed user names: b7391aaa-d0cb-4db6-9e4c-999310c97eb6<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:10:11.671820] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-shchst01-server: accepted client from shchhv03-13679-2016/12/17-22:57:24:920194-shchst01-client-11-0-2 (version: 3.7.11)<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:13:31.526854] W [socket.c:589:__socket_rwv] 0-tcp.shchst01-server: writev on xxx.xxx.xx.12:65527 failed (Broken pipe)<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:13:31.526909] I [socket.c:2356:socket_event_handler] 0-transport: disconnecting now<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:13:31.526935] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-shchst01-server: disconnecting connection from shchhv03-10686-2016/12/16-06:08:16:797591-shchst01-client-11-0-6<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:13:31.526976] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-shchst01-server: fd cleanup on /b40877dae051c076b95c160f2f639e45<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:13:31.527008] W [socket.c:589:__socket_rwv] 0-tcp.shchst01-server: writev on xxx.xxx.xx.12:65525 failed (Broken pipe)<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:13:31.527009] I [socket.c:3378:socket_submit_reply] 0-tcp.shchst01-server: not connected (priv-&gt;connected = -1)<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:13:31.527040] E [rpcsvc.c:1314:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x309470, Program: GlusterFS 3.3, ProgVers: 330, Proc: 16) to rpc-transport (tcp.shchst01-server)<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:13:31.527099] I [socket.c:2356:socket_event_handler] 0-transport: disconnecting now<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:13:31.527114] E [server.c:205:server_submit_reply] (--&gt;/usr/lib64/glusterfs/3.7.11/xlator/debug/io-stats.so(io_stats_fsync_cbk+0xc8) [0x7f84f4ada308] --&gt;/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(server_fsync_cbk+0x384) [0x7f84f48b0444] --&gt;/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(server_submit_reply+0x2f6) [0x7f84f489b086] ) 0-: Reply submission failed<o:p></o:p></p><p class=MsoNormal>[2016-12-18 08:13:31.527121] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-shchst01-server: disconnecting connection from shchhv02-15410-2016/12/17-06:07:39:376627-shchst01-client-11-0-6<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>statedump (brick 4), taken later in the day:&nbsp; http://pastebin.com/DEE3RbT8<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Temp Resolution Path<o:p></o:p></p><p class=MsoNormal>====================<o:p></o:p></p><p class=MsoNormal>There is a rise in load on the node, as well as on one particular KVM (on another node).&nbsp; If we catch the load rise and clear pagecache, it seems to clear and resolve.&nbsp; I have not been able to catch it enough to provide more details.<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>echo 1 &gt; /proc/sys/vm/drop_caches<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>There is something that I am missing.&nbsp; I would appreciate any help to get me to root cause and resolution.<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Thanks,<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>Gustave<o:p></o:p></p></div></body></html>