<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>
</head>
<body ocsi="0" fpstyle="1">
<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">Are there any typical reasons for glusterfsd falsely reporting memory allocation failure when attempting to create a new IB QP? I'm getting a high rate of similar cases but can't
push any hardware or non-gluster software error into the open. Recovering the volume after a crash is not a problem; what self-heal doesn't automagically handle rebalancing takes care of just fine.<br>
<br>
Below is the glusterfs-glusterd log snippet for a typical crash. This happens with no particular pattern on any gluster server except the first in the series (which is also the one the clients specify in their mounts and thus go to for the vol info file).
The crash may occur during a 'hello world' of 1p per node across the cluster but not do it during the final and most agressive rank of an OpenMPI All-to-All benchmark, or vice versa; there's no particular correlation with MPI traffic load, IB/RDMA traffic
pattern, client population and/or activity, etc.<br>
<br>
In all failure cases all IPoIB, Ethernet, RDMA, and IBCV tests completed without issue and returned the appropriate bandwidth/latency/pathing. All servers are running auditd and gmond, which show no indication of memory issues or any other failure. All servers
have run Pandora repeatedly without triggering any hardware failures. There are no complaints in the global OpenSM instances for either IB fabric at the management points, or on the PTP SMD GUID-locked instances running on the gluster servers and talking
to the backing storage controllers.<br>
<br>
Any ideas?<br>
<br>
---------<br>
[2015-05-08 23:19:26.660870] C [rdma.c:2951:gf_rdma_create_qp] 0-rdma.management: rdma.management: could not create QP (Cannot allocate memory)<br>
[2015-05-08 23:19:26.660966] W [rdma.c:818:gf_rdma_cm_handle_connect_request] 0-rdma.management: could not create QP (peer:10.149.0.63:1013 me:10.149.1.142:24008)<br>
pending frames:<br>
patchset: git://git.gluster.com/glusterfs.git<br>
signal received: 11<br>
time of crash:<br>
2015-05-08 23:19:26<br>
configuration details:<br>
argp 1<br>
backtrace 1<br>
dlfcn 1<br>
libpthread 1<br>
llistxattr 1<br>
setfsid 1<br>
spinlock 1<br>
epoll.h 1<br>
xattr.h 1<br>
st_atim.tv_nsec 1<br>
package-string: glusterfs 3.6.2<br>
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x39e3a20136]<br>
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x39e3a3abbf]<br>
/lib64/libc.so.6[0x39e1a326a0]<br>
/usr/lib64/glusterfs/3.6.2/xlator/mgmt/glusterd.so(glusterd_rpcsvc_notify+0x69)[0x7fefd149ec59]<br>
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x105)[0x39e32081d5]<br>
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x1a0)[0x39e3209cd0]<br>
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x39e320b6d8]<br>
/usr/lib64/glusterfs/3.6.2/rpc-transport/rdma.so(+0x5941)[0x7fefd0251941]<br>
/usr/lib64/glusterfs/3.6.2/rpc-transport/rdma.so(+0xb6d9)[0x7fefd02576d9]<br>
/lib64/libpthread.so.0[0x39e26079d1]<br>
/lib64/libc.so.6(clone+0x6d)[0x39e1ae88fd]<br>
---------<br>
<br>
##<br>
## Soft bits are:<br>
##<br>
<br>
RHEL 6.6<br>
kernel 2.6.32-528.el6.bz1159925.x86_64<br>
(this is the 6.7 pre-release kernel with the latest ib_sm updates<br>
for the occasional mgroup bcast/join issues, see RH BZ)<br>
glibc-2.12-1.149.el6_6.7.x86_64<br>
compat-opensm-libs-3.3.5-3.el6.x86_64<br>
opensm-3.3.17-1.el6.x86_64<br>
opensm-libs-3.3.17-1.el6.x86_64<br>
opensm-multifabric-0.1-sgi710r3.rhel6.x86_64<br>
(this is vendor stubs to do IB subnet_id and GUID specific opensm<br>
master/standby instances integrated with cluster management)<br>
glusterfs-3.6.2-1.el6.x86_64<br>
glusterfs-debuginfo-3.6.2-1.el6.x86_64<br>
glusterfs-devel-3.6.2-1.el6.x86_64<br>
glusterfs-libs-3.6.2-1.el6.x86_64<br>
glusterfs-extra-xlators-3.6.2-1.el6.x86_64<br>
glusterfs-api-devel-3.6.2-1.el6.x86_64<br>
glusterfs-fuse-3.6.2-1.el6.x86_64<br>
glusterfs-server-3.6.2-1.el6.x86_64<br>
glusterfs-cli-3.6.2-1.el6.x86_64<br>
glusterfs-api-3.6.2-1.el6.x86_64<br>
glusterfs-rdma-3.6.2-1.el6.x86_64<br>
<br>
Volume in question:<br>
<br>
[root@phoenix-smc ~]# ssh service4 gluster vol info home<br>
Warning: No xauth data; using fake authentication data for X11 forwarding.<br>
<br>
Volume Name: home<br>
Type: Distribute<br>
Volume ID: f03fcaf0-3889-45ac-a06a-a4d60d5a673d<br>
Status: Started<br>
Number of Bricks: 28<br>
Transport-type: rdma<br>
Bricks:<br>
Brick1: service4-ib1:/mnt/l1_s4_ost0000_0000/brick<br>
Brick2: service4-ib1:/mnt/l1_s4_ost0001_0001/brick<br>
Brick3: service4-ib1:/mnt/l1_s4_ost0002_0002/brick<br>
Brick4: service5-ib1:/mnt/l1_s5_ost0003_0003/brick<br>
Brick5: service5-ib1:/mnt/l1_s5_ost0004_0004/brick<br>
Brick6: service5-ib1:/mnt/l1_s5_ost0005_0005/brick<br>
Brick7: service5-ib1:/mnt/l1_s5_ost0006_0006/brick<br>
Brick8: service6-ib1:/mnt/l1_s6_ost0007_0007/brick<br>
Brick9: service6-ib1:/mnt/l1_s6_ost0008_0008/brick<br>
Brick10: service6-ib1:/mnt/l1_s6_ost0009_0009/brick<br>
Brick11: service7-ib1:/mnt/l1_s7_ost000a_0010/brick<br>
Brick12: service7-ib1:/mnt/l1_s7_ost000b_0011/brick<br>
Brick13: service7-ib1:/mnt/l1_s7_ost000c_0012/brick<br>
Brick14: service7-ib1:/mnt/l1_s7_ost000d_0013/brick<br>
Brick15: service10-ib1:/mnt/l1_s10_ost000e_0014/brick<br>
Brick16: service10-ib1:/mnt/l1_s10_ost000f_0015/brick<br>
Brick17: service10-ib1:/mnt/l1_s10_ost0010_0016/brick<br>
Brick18: service11-ib1:/mnt/l1_s11_ost0011_0017/brick<br>
Brick19: service11-ib1:/mnt/l1_s11_ost0012_0018/brick<br>
Brick20: service11-ib1:/mnt/l1_s11_ost0013_0019/brick<br>
Brick21: service11-ib1:/mnt/l1_s11_ost0014_0020/brick<br>
Brick22: service12-ib1:/mnt/l1_s12_ost0015_0021/brick<br>
Brick23: service12-ib1:/mnt/l1_s12_ost0016_0022/brick<br>
Brick24: service12-ib1:/mnt/l1_s12_ost0017_0023/brick<br>
Brick25: service13-ib1:/mnt/l1_s13_ost0018_0024/brick<br>
Brick26: service13-ib1:/mnt/l1_s13_ost0019_0025/brick<br>
Brick27: service13-ib1:/mnt/l1_s13_ost001a_0026/brick<br>
Brick28: service13-ib1:/mnt/l1_s13_ost001b_0027/brick<br>
Options Reconfigured:<br>
performance.stat-prefetch: off<br>
[root@phoenix-smc ~]# ssh service4 gluster vol status home<br>
Warning: No xauth data; using fake authentication data for X11 forwarding.<br>
Status of volume: home<br>
Gluster process Port Online Pid<br>
------------------------------------------------------------------------------<br>
Brick service4-ib1:/mnt/l1_s4_ost0000_0000/brick 49156 Y 8028<br>
Brick service4-ib1:/mnt/l1_s4_ost0001_0001/brick 49157 Y 8040<br>
Brick service4-ib1:/mnt/l1_s4_ost0002_0002/brick 49158 Y 8052<br>
Brick service5-ib1:/mnt/l1_s5_ost0003_0003/brick 49163 Y 6526<br>
Brick service5-ib1:/mnt/l1_s5_ost0004_0004/brick 49164 Y 6533<br>
Brick service5-ib1:/mnt/l1_s5_ost0005_0005/brick 49165 Y 6540<br>
Brick service5-ib1:/mnt/l1_s5_ost0006_0006/brick 49166 Y 6547<br>
Brick service6-ib1:/mnt/l1_s6_ost0007_0007/brick 49155 Y 8027<br>
Brick service6-ib1:/mnt/l1_s6_ost0008_0008/brick 49156 Y 8039<br>
Brick service6-ib1:/mnt/l1_s6_ost0009_0009/brick 49157 Y 8051<br>
Brick service7-ib1:/mnt/l1_s7_ost000a_0010/brick 49160 Y 9067<br>
Brick service7-ib1:/mnt/l1_s7_ost000b_0011/brick 49161 Y 9074<br>
Brick service7-ib1:/mnt/l1_s7_ost000c_0012/brick 49162 Y 9081<br>
Brick service7-ib1:/mnt/l1_s7_ost000d_0013/brick 49163 Y 9088<br>
Brick service10-ib1:/mnt/l1_s10_ost000e_0014/brick 49155 Y 8108<br>
Brick service10-ib1:/mnt/l1_s10_ost000f_0015/brick 49156 Y 8120<br>
Brick service10-ib1:/mnt/l1_s10_ost0010_0016/brick 49157 Y 8132<br>
Brick service11-ib1:/mnt/l1_s11_ost0011_0017/brick 49160 Y 8070<br>
Brick service11-ib1:/mnt/l1_s11_ost0012_0018/brick 49161 Y 8082<br>
Brick service11-ib1:/mnt/l1_s11_ost0013_0019/brick 49162 Y 8094<br>
Brick service11-ib1:/mnt/l1_s11_ost0014_0020/brick 49163 Y 8106<br>
Brick service12-ib1:/mnt/l1_s12_ost0015_0021/brick 49155 Y 8072<br>
Brick service12-ib1:/mnt/l1_s12_ost0016_0022/brick 49156 Y 8084<br>
Brick service12-ib1:/mnt/l1_s12_ost0017_0023/brick 49157 Y 8096<br>
Brick service13-ib1:/mnt/l1_s13_ost0018_0024/brick 49156 Y 8156<br>
Brick service13-ib1:/mnt/l1_s13_ost0019_0025/brick 49157 Y 8168<br>
Brick service13-ib1:/mnt/l1_s13_ost001a_0026/brick 49158 Y 8180<br>
Brick service13-ib1:/mnt/l1_s13_ost001b_0027/brick 49159 Y 8192<br>
NFS Server on localhost 2049 Y 8065<br>
NFS Server on service6-ib1 2049 Y 8064<br>
NFS Server on service13-ib1 2049 Y 8205<br>
NFS Server on service11-ib1 2049 Y 11833<br>
NFS Server on service12-ib1 2049 Y 8109<br>
NFS Server on service10-ib1 2049 Y 8145<br>
NFS Server on service5-ib1 2049 Y 6554<br>
NFS Server on service7-ib1 2049 Y 15140<br>
<br>
Task Status of Volume home<br>
------------------------------------------------------------------------------<br>
Task : Rebalance <br>
ID : 88f1e627-c7cc-40fc-b4a8-7672a6151712<br>
Status : completed <br>
<br>
[root@phoenix-smc ~]#<br>
<br>
</div>
</body>
</html>