<div dir="ltr">Added Rafi, Raghavendra who work on RDMA<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Aug 8, 2016 at 7:58 AM, Dan Lavu <span dir="ltr">&lt;<a href="mailto:dan@redhat.com" target="_blank">dan@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div>Hello,<br><br></div>I&#39;m having some major problems with Gluster and oVirt, I&#39;ve been ripping my hair out with this, so if anybody can provide insight, that will be fantastic. I&#39;ve tried both transports TCP and RDMA... both are having instability problems. <br><br>So the first thing I&#39;m running into, intermittently, on one specific node, will get spammed with the following message;<br><br>&quot;[2016-08-08 00:42:50.837992] E [rpc-clnt.c:357:saved_frames_<wbr>unwind] (--&gt; /lib64/libglusterfs.so.0(_gf_<wbr>log_callingfn+0x1a3)[<wbr>0x7fb728b0f293] (--&gt; /lib64/libgfrpc.so.0(saved_<wbr>frames_unwind+0x1d1)[<wbr>0x7fb7288d73d1] (--&gt; /lib64/libgfrpc.so.0(saved_<wbr>frames_destroy+0xe)[<wbr>0x7fb7288d74ee] (--&gt; /lib64/libgfrpc.so.0(rpc_clnt_<wbr>connection_cleanup+0x7e)[<wbr>0x7fb7288d8d0e] (--&gt; /lib64/libgfrpc.so.0(rpc_clnt_<wbr>notify+0x88)[0x7fb7288d9528] ))))) 0-vmdata1-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-08 00:42:43.620710 (xid=0x6800b)&quot;<br><br></div>Then the infiniband device will get bounced and VMs will get stuck. <br><br></div><div>Another problem I&#39;m seeing, once a day, or every two days, an oVirt node will hang on gluster mounts. Issuing a df to check the mounts will just stall, this occurs hourly if RDMA is used. I can log into the hypervisor remount the gluster volumes most of the time.<br></div><div><br></div><div></div>This is on Fedora 23; Gluster 3.8.1-1, the Infiniband gear is 40Gb/s QDR Qlogic, using the ib_qib module, this configuration was working with our old infinihost III. I couldn&#39;t get OFED to compile so all the infiniband modules are Fedora installed. <br><br></div>So a volume looks like the following, (please if there is anything I need to adjust, the settings was pulled from several examples)<br><div> <br>Volume Name: vmdata_ha<br>Type: Replicate<br>Volume ID: 325a5fda-a491-4c40-8502-<wbr>f89776a3c642<br>Status: Started<br>Number of Bricks: 1 x (2 + 1) = 3<br>Transport-type: tcp,rdma<br>Bricks:<br>Brick1: deadpool.ib.runlevelone.lan:/<wbr>gluster/vmdata_ha<br>Brick2: spidey.ib.runlevelone.lan:/<wbr>gluster/vmdata_ha<br>Brick3: groot.ib.runlevelone.lan:/<wbr>gluster/vmdata_ha (arbiter)<br>Options Reconfigured:<br>performance.least-prio-<wbr>threads: 4<br>performance.low-prio-threads: 16<br>performance.normal-prio-<wbr>threads: 24<br>performance.high-prio-threads: 24<br>cluster.self-heal-window-size: 32<br>cluster.self-heal-daemon: on<br>performance.md-cache-timeout: 1<br>performance.cache-max-file-<wbr>size: 2MB<br>performance.io-thread-count: 32<br>network.ping-timeout: 5<br>performance.write-behind-<wbr>window-size: 4MB<br>performance.cache-size: 256MB<br>performance.cache-refresh-<wbr>timeout: 10<br>server.allow-insecure: on<br>network.remote-dio: enable<br>performance.io-cache: off<br>performance.read-ahead: off<br>performance.quick-read: off<br>storage.owner-gid: 36<br>storage.owner-uid: 36<br>performance.readdir-ahead: on<br>nfs.disable: on<br>config.transport: tcp,rdma<br>performance.stat-prefetch: off<br>cluster.eager-lock: enable<br><br>Volume Name: vmdata1<br>Type: Distribute<br>Volume ID: 3afefcb3-887c-4315-b9dc-<wbr>f4e890f786eb<br>Status: Started<br>Number of Bricks: 2<br>Transport-type: tcp,rdma<br>Bricks:<br>Brick1: spidey.ib.runlevelone.lan:/<wbr>gluster/vmdata1<br>Brick2: deadpool.ib.runlevelone.lan:/<wbr>gluster/vmdata1<br>Options Reconfigured:<br>config.transport: tcp,rdma<br>network.remote-dio: enable<br>performance.io-cache: off<br>performance.read-ahead: off<br>performance.quick-read: off<br>nfs.disable: on<br>storage.owner-gid: 36<br>storage.owner-uid: 36<br>performance.readdir-ahead: on<br>server.allow-insecure: on<br>performance.stat-prefetch: off<br>performance.cache-refresh-<wbr>timeout: 10<br>performance.cache-size: 256MB<br>performance.write-behind-<wbr>window-size: 4MB<br>network.ping-timeout: 5<br>performance.io-thread-count: 32<br>performance.cache-max-file-<wbr>size: 2MB<br>performance.md-cache-timeout: 1<br>performance.high-prio-threads: 24<br>performance.normal-prio-<wbr>threads: 24<br>performance.low-prio-threads: 16<br>performance.least-prio-<wbr>threads: 4<br><br><br></div><div>/etc/glusterfs/glusterd.vol<br>volume management<br>    type mgmt/glusterd<br>    option working-directory /var/lib/glusterd<br>    option transport-type socket,tcp<br>    option transport.socket.keepalive-<wbr>time 10<br>    option transport.socket.keepalive-<wbr>interval 2<br>    option transport.socket.read-fail-log off<br>    option ping-timeout 0<br>    option event-threads 1<br>#    option rpc-auth-allow-insecure on<br>    option transport.socket.bind-address 0.0.0.0 <br>#   option transport.address-family inet6<br>#   option base-port 49152<br>end-volume<br><br></div><div>I think that&#39;s a good start, thank you so much for taking the time to look at this. You can find me on freenode, nick side_control if you want to chat, I&#39;m GMT -5. <br></div><div><div><br></div><div>Cheers,<br><br></div>Dan<br></div></div>
<br>______________________________<wbr>_________________<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
<a href="http://www.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Pranith<br></div></div>
</div>