[Gluster-users] GlusterFS 3.3 span volume, op(RENAME(8)) fails, "Transport endpoint is not connected"

Wed Jun 13 17:37:09 UTC 2012

Can you get a process state dump of the brick process hosting
'0-vol_home-client-2' subvolume? That should give some clues about what
happened to the missing rename call.

Avati

On Wed, Jun 13, 2012 at 7:02 AM, Jeff White <jaw171 at pitt.edu> wrote:

>  I recently upgraded my dev cluster to 3.3.  To do this I copied the data
> out of the old volume into a bare disk, wiped out everything about Gluster,
> installed the 3.3 packages, create a new volume (I wanted to change my
> brick layout), then copied the data back into the new volume.  Previously
> everything worked fine but now my users are complaining of random errors
> when compiling software.
>
> I enabled debug logging for the clients and I see this:
>
> x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:12:02.783526] D [mem-pool.c:457:mem_get]
> (-->/usr/lib64/libglusterfs.so.0(dict_unserialize+0x28d) [0x36e361413d]
> (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) [0
> x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:12:02.783584] D [mem-pool.c:457:mem_get]
> (-->/usr/lib64/libglusterfs.so.0(dict_unserialize+0x28d) [0x36e361413d]
> (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) [0
> x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:12:45.726083] D [client-handshake.c:184:client_start_ping]
> 0-vol_home-client-0: returning as transport is already disconnected OR
> there are no frames (0 || 0)
> [2012-06-12 17:12:45.726154] D [client-handshake.c:184:client_start_ping]
> 0-vol_home-client-3: returning as transport is already disconnected OR
> there are no frames (0 || 0)
> [2012-06-12 17:12:45.726171] D [client-handshake.c:184:client_start_ping]
> 0-vol_home-client-1: returning as transport is already disconnected OR
> there are no frames (0 || 0)
> *[2012-06-12 17:15:35.888437] E [rpc-clnt.c:208:call_bail]
> 0-vol_home-client-2: bailing out frame type(GlusterFS 3.1) op(RENAME(8))
> xid = 0x2015421x sent = 2012-06-12 16:45:26.237621. timeout = 1800*
> [2012-06-12 17:15:35.888507] W
> [client3_1-fops.c:2385:client3_1_rename_cbk] 0-vol_home-client-2: remote
> operation failed: Transport endpoint is not connected
> [2012-06-12 17:15:35.888529] W [dht-rename.c:478:dht_rename_cbk]
> 0-vol_home-dht:
> /sam/senthil/genboree/SupportingPkgs/gcc-3.4.6/x86_64-unknown-linux-gnu/32/libjava/java/net/SocketException.class.tmp:
> rename on vol_home-client-2 failed (Transport endpoint is not connected)
> [2012-06-12 17:15:35.889803] W [fuse-bridge.c:1516:fuse_rename_cbk]
> 0-glusterfs-fuse: 2776710:
> /sam/senthil/genboree/SupportingPkgs/gcc-3.4.6/x86_64-unknown-linux-gnu/32/libjava/java/net/SocketException.class.tmp
> ->
> /sam/senthil/genboree/SupportingPkgs/gcc-3.4.6/x86_64-unknown-linux-gnu/32/libjava/java/net/SocketException.class
> => -1 (Transport endpoint is not connected)
> [2012-06-12 17:15:35.890002] D [mem-pool.c:457:mem_get]
> (-->/usr/lib64/libglusterfs.so.0(dict_new+0xb) [0x36e3613d6b]
> (-->/usr/lib64/libglusterfs.so.0(get_new_dict_full+0x27) [0x36e3613c67]
> (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) [0x36e364018b])))
> 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:15:35.890167] D [mem-pool.c:457:mem_get]
> (-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d)
> [0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163)
> [0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:15:35.890258] D [mem-pool.c:457:mem_get]
> (-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d)
> [0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163)
> [0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:15:35.890311] D [mem-pool.c:457:mem_get]
> (-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d)
> [0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163)
> [0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:15:35.890363] D [mem-pool.c:457:mem_get]
> (-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d)
> [0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163)
> [0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> ** and so on, more of the same...
>
> If I enable debug logging on the bricks I see thousands of these lines
> every minute and I'm forced to disable the logging:
>
> [2012-06-12 15:32:45.760598] D [io-threads.c:268:iot_schedule]
> 0-vol_home-io-threads: LOOKUP scheduled as fast fop
>
> Here's my config:
>
> # gluster volume info
> Volume Name: vol_home
> Type: Distribute
> Volume ID: 07ec60be-ec0c-4579-a675-069bb34c12ab
> Status: Started
> Number of Bricks: 4
> Transport-type: tcp
> Bricks:
> Brick1: storage0-dev.cssd.pitt.edu:/brick/0
> Brick2: storage1-dev.cssd.pitt.edu:/brick/2
> Brick3: storage0-dev.cssd.pitt.edu:/brick/1
> Brick4: storage1-dev.cssd.pitt.edu:/brick/3
> Options Reconfigured:
> diagnostics.brick-log-level: INFO
> diagnostics.client-log-level: INFO
> features.limit-usage: /home/cssd/jaw171:50GB,/cssd:200GB,/cssd/jaw171:75GB
> nfs.rpc-auth-allow: 10.54.50.*,127.*
> auth.allow: 10.54.50.*,127.*
> performance.io-cache: off
> cluster.min-free-disk: 5
> performance.cache-size: 128000000
> features.quota: on
> nfs.disable: on
>
> # rpm -qa | grep gluster
> glusterfs-fuse-3.3.0-1.el6.x86_64
> glusterfs-server-3.3.0-1.el6.x86_64
> glusterfs-3.3.0-1.el6.x86_64
>
> Name resolution is fine on everything, everything can ping everything else
> by name, no firewalls are running anywhere, there's no disk errors on the
> storage nodes.
>
> Did the way I copied data out of one volume and back into another cause
> this (some xattr problem)?  What else could be causing this problem?  I'm
> looking to go production with GlusterFS on a 242 (soon to grow) node HPC
> cluster at the end of this month.
>
> Also, one of my co-workers improved upon an existing remote quota viewer
> written in Python.  I'll post the code soon for those interested.
>
> --
> Jeff White - Linux/Unix Systems Engineer
> University of Pittsburgh - CSSD
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120613/f81e5164/attachment.html>