From manu at netbsd.org Tue May 1 02:18:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 1 May 2012 04:18:53 +0200 Subject: [Gluster-devel] Fwd: Re: Rejected NetBSD patches In-Reply-To: <4F9EED0C.2080203@redhat.com> Message-ID: <1kjeekq.1nkt3n11wtalkgM%manu@netbsd.org> Kaleb S. KEITHLEY wrote: > I haven't seen anything so far that needs to discriminate between NetBSD > and FreeBSD, but if we come across one, we can use __NetBSD__ and > __FreeBSD__ inside GF_BSD_HOST_OS. If you look at the code, NetBSD makes is way using GF_BSD_HOST_OS or GF_LINUX_HOST_OS, depending of the situation. NetBSD and FreeBSD forked 19 years ago, they had time to diverge. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 1 03:21:28 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 1 May 2012 05:21:28 +0200 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjdvf9.1o294sj12c16nlM%manu@netbsd.org> Message-ID: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> Emmanuel Dreyfus wrote: > I got a crash client-side. It happens in pthread_spin_lock() and I > recall fixing that kind of issue for a uninitialized lock. I added printf, and inode is NULL in mdc_inode_pre() therefore this is not an uninitializd lock problem. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 1 05:31:57 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 1 May 2012 07:31:57 +0200 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> Message-ID: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Emmanuel Dreyfus wrote: > I added printf, and inode is NULL in mdc_inode_pre() therefore this is > not an uninitializd lock problem. Indeed, this this the mdc_local_t structure that seems uninitialized: (gdb) frame 3 #3 0xbaa0ecb5 in mdc_setattr_cbk (frame=0xbb7e32a0, cookie=0xbb7a4380, this=0xba3e3000, op_ret=0, op_errno=0, prebuf=0xb940a57c, postbuf=0xb940a5e4, xdata=0x0) at md-cache.c:1365 1365 mdc_inode_iatt_set (this, local->loc.inode, postbuf); (gdb) print *(mdc_local_t *)frame->local $6 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, fd = 0xb8f9d230, linkname = 0x0, xattr = 0x0} And indeed local->loc it is not initialized in mdc_fsetattr(). I suspect there is away of obteining it from fd, but this is getting beyond by knowledge of glusterfs internals. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vbellur at redhat.com Wed May 2 04:21:08 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Wed, 02 May 2012 09:51:08 +0530 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> References: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Message-ID: <4FA0B634.5090605@redhat.com> On 05/01/2012 11:01 AM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > >> I added printf, and inode is NULL in mdc_inode_pre() therefore this is >> not an uninitializd lock problem. > > Indeed, this this the mdc_local_t structure that seems uninitialized: > > (gdb) frame 3 > #3 0xbaa0ecb5 in mdc_setattr_cbk (frame=0xbb7e32a0, cookie=0xbb7a4380, > this=0xba3e3000, op_ret=0, op_errno=0, prebuf=0xb940a57c, > postbuf=0xb940a5e4, xdata=0x0) at md-cache.c:1365 > 1365 mdc_inode_iatt_set (this, local->loc.inode, postbuf); > > (gdb) print *(mdc_local_t *)frame->local > $6 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, > gfid = '\000', pargfid = '\000' > }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, > parent = 0x0, gfid = '\000', > pargfid = '\000'}, fd = 0xb8f9d230, > linkname = 0x0, xattr = 0x0} > > And indeed local->loc it is not initialized in mdc_fsetattr(). I suspect > there is away of obteining it from fd, but this is getting beyond by > knowledge of glusterfs internals. > > Do you have a test case that causes this crash? Vijay From anand.avati at gmail.com Wed May 2 05:29:22 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 1 May 2012 22:29:22 -0700 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> References: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Message-ID: Can you confirm if this fixes (obvious bug) - diff --git a/xlators/performance/md-cache/src/md-cache.c b/xlators/performance/md-cache/src/md-cache.c index 9ef599a..66c0bf3 100644 --- a/xlators/performance/md-cache/src/md-cache.c +++ b/xlators/performance/md-cache/src/md-cache.c @@ -1423,7 +1423,7 @@ mdc_fsetattr (call_frame_t *frame, xlator_t *this, fd_t *fd, local->fd = fd_ref (fd); - STACK_WIND (frame, mdc_setattr_cbk, + STACK_WIND (frame, mdc_fsetattr_cbk, FIRST_CHILD(this), FIRST_CHILD(this)->fops->fsetattr, fd, stbuf, valid, xdata); On Mon, Apr 30, 2012 at 10:31 PM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > > > I added printf, and inode is NULL in mdc_inode_pre() therefore this is > > not an uninitializd lock problem. > > Indeed, this this the mdc_local_t structure that seems uninitialized: > > (gdb) frame 3 > #3 0xbaa0ecb5 in mdc_setattr_cbk (frame=0xbb7e32a0, cookie=0xbb7a4380, > this=0xba3e3000, op_ret=0, op_errno=0, prebuf=0xb940a57c, > postbuf=0xb940a5e4, xdata=0x0) at md-cache.c:1365 > 1365 mdc_inode_iatt_set (this, local->loc.inode, postbuf); > > (gdb) print *(mdc_local_t *)frame->local > $6 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, > gfid = '\000' , pargfid = '\000' > }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, > parent = 0x0, gfid = '\000' , > pargfid = '\000' }, fd = 0xb8f9d230, > linkname = 0x0, xattr = 0x0} > > And indeed local->loc it is not initialized in mdc_fsetattr(). I suspect > there is away of obteining it from fd, but this is getting beyond by > knowledge of glusterfs internals. > > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kshlmster at gmail.com Wed May 2 05:35:02 2012 From: kshlmster at gmail.com (Kaushal M) Date: Wed, 2 May 2012 11:05:02 +0530 Subject: [Gluster-devel] 3.3 and address family In-Reply-To: References: <1kj84l9.19kzk6dfdsrtsM%manu@netbsd.org> Message-ID: Didn't send the last message to list. Resending. On Wed, May 2, 2012 at 10:58 AM, Kaushal M wrote: > Hi Emmanuel, > > Took a look at your patch for fixing this problem. It solves the it for > the brick glusterfsd processes. But glusterd also spawns and communicates > with nfs server & self-heal daemon processes. The proper xlator-option is > not set for these. This might be the cause. These processes are started in > glusterd_nodesvc_start() in glusterd-utils, which is where you could look > into. > > Thanks, > Kaushal > > On Fri, Apr 27, 2012 at 10:31 PM, Emmanuel Dreyfus wrote: > >> Hi >> >> I am still trying on 3.3.0qa39, and now I have an address family issue: >> gluserfs defaults to inet6 transport while the machine is not configured >> for IPv6. >> >> I added option transport.address-family inet in glusterfs/glusterd.vol >> and now glusterd starts with an IPv4 address, but unfortunately, >> communication with spawned glusterfsd do not stick to the same address >> family: I can see packets going from ::1.1023 to ::1.24007 and they are >> rejected since I used transport.address-family inet. >> >> I need to tell glusterfs to use the same address family. I already did a >> patch for exactly the same problem some time ago, this is not very >> difficult, but it would save me some time if someone could tell me where >> should I look at in the code. >> >> -- >> Emmanuel Dreyfus >> http://hcpnet.free.fr/pubz >> manu at netbsd.org >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at nongnu.org >> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Wed May 2 09:30:32 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 2 May 2012 09:30:32 +0000 Subject: [Gluster-devel] qa39 crash In-Reply-To: References: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Message-ID: <20120502093032.GI3677@homeworld.netbsd.org> On Tue, May 01, 2012 at 10:29:22PM -0700, Anand Avati wrote: > Can you confirm if this fixes (obvious bug) - I do not crash anymore, but I spotted another bug, I do not know if it is related: removing owner write access to a non empty file open with write access fails with EPERMo Here is my test case. It works fine with glusterfs-3.2.6 but fchmod() fails with EPERM on 3.3.0qa39 #include #include #include #include #include #include int main(void) { int fd; char buf[16]; if ((fd = open("test.tmp", O_RDWR|O_CREAT, 0644)) == -1) err(EX_OSERR, "fopen failed"); if (write(fd, buf, sizeof(buf)) != sizeof(buf)) err(EX_OSERR, "write failed"); if (fchmod(fd, 0444) == -1) err(EX_OSERR, "fchmod failed"); if (close(fd) == -1) err(EX_OSERR, "close failed"); return EX_OK; } -- Emmanuel Dreyfus manu at netbsd.org From xhernandez at datalab.es Wed May 2 10:55:37 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Wed, 02 May 2012 12:55:37 +0200 Subject: [Gluster-devel] Some questions about requisites of translators Message-ID: <4FA112A9.1080101@datalab.es> Hello, I'm wondering if there are any requisites that translators must satisfy to work correctly inside glusterfs. In particular I need to know two things: 1. Are translators required to respect the order in which they receive the requests ? This is specially important in translators such as performance/io-threads or caching ones. It seems that these translators can reorder requests. If this is the case, is there any way to force some order between requests ? can inodelk/entrylk be used to force the order ? 2. Are translators required to propagate callback arguments even if the result of the operation is an error ? and if an internal translator error occurs ? When a translator has multiple subvolumes, I've seen that some arguments, such as xdata, are replaced with NULL. This can be understood, but are regular translators (those that only have one subvolume) allowed to do that or must they preserve the value of xdata, even in the case of an internal error ? If this is not a requisite, xdata loses it's function of delivering back extra information. Thank you very much, Xavi From anand.avati at gmail.com Sat May 5 06:02:30 2012 From: anand.avati at gmail.com (Anand Avati) Date: Fri, 4 May 2012 23:02:30 -0700 Subject: [Gluster-devel] Some questions about requisites of translators In-Reply-To: <4FA112A9.1080101@datalab.es> References: <4FA112A9.1080101@datalab.es> Message-ID: On Wed, May 2, 2012 at 3:55 AM, Xavier Hernandez wrote: > Hello, > > I'm wondering if there are any requisites that translators must satisfy to > work correctly inside glusterfs. > > In particular I need to know two things: > > 1. Are translators required to respect the order in which they receive the > requests ? > > This is specially important in translators such as performance/io-threads > or caching ones. It seems that these translators can reorder requests. If > this is the case, is there any way to force some order between requests ? > can inodelk/entrylk be used to force the order ? > > Translators are not expected to maintain ordering of requests. The only translator which takes care of ordering calls is write-behind. After acknowledging back write requests it has to make sure future requests see the true "effect" as though the previous write actually completed. To that end, it queues future "dependent" requests till the write acknowledgement is received from the server. inodelk/entrylk calls help achieve synchronization among clients (by getting into a critical section) - just like a mutex. It is an arbitrator. It does not help for ordering of two calls. If one call must strictly complete after another call from your translator's point of view (i.e, if it has such a requirement), then the latter call's STACK_WIND must happen in the callback of the former's STACK_UNWIND path. There are no guarantees maintained by the system to ensure that a second STACK_WIND issued right after a first STACK_WIND will complete and callback in the same order. Write-behind does all its ordering gimmicks only because it STACK_UNWINDs a write call prematurely and therefore must maintain the causal effects by means of queueing new requests behind the downcall towards the server. > 2. Are translators required to propagate callback arguments even if the > result of the operation is an error ? and if an internal translator error > occurs ? > > Usually no. If op_ret is -1, only op_errno is expected to be a usable value. Rest of the callback parameters are junk. > When a translator has multiple subvolumes, I've seen that some arguments, > such as xdata, are replaced with NULL. This can be understood, but are > regular translators (those that only have one subvolume) allowed to do that > or must they preserve the value of xdata, even in the case of an internal > error ? > > It is best to preserve the arguments unless you know specifically what you are doing. In case of error, all the non-op_{ret,errno} arguments are typically junk, including xdata. > If this is not a requisite, xdata loses it's function of delivering back > extra information. > > Can you explain? Are you seeing a use case for having a valid xdata in the callback even with op_ret == -1? Thanks, Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From tsato at valinux.co.jp Mon May 7 04:17:45 2012 From: tsato at valinux.co.jp (Tomoaki Sato) Date: Mon, 07 May 2012 13:17:45 +0900 Subject: [Gluster-devel] showmount reports many entries (Re: glusterfs-3.3.0qa39 released) In-Reply-To: <4F9A98E8.80400@gluster.com> References: <20120427053612.E08671804F5@build.gluster.com> <4F9A6422.3010000@valinux.co.jp> <4F9A98E8.80400@gluster.com> Message-ID: <4FA74CE9.8010805@valinux.co.jp> (2012/04/27 22:02), Vijay Bellur wrote: > On 04/27/2012 02:47 PM, Tomoaki Sato wrote: >> Vijay, >> >> I have been testing gluster-3.3.0qa39 NFS with 4 CentOS 6.2 NFS clients. >> The test set is like following: >> 1) All 4 clients mount 64 directories. (total 192 directories) >> 2) 192 procs runs on the 4 clients. each proc create a new unique file and write 1GB data to the file. (total 192GB) >> 3) All 4 clients umount 64 directories. >> >> The test finished successfully but showmount command reported many entries in spite of there were no NFS clients remain. >> Then I have restarted gluster related daemons. >> After restarting, showmount command reports no entries. >> Any insight into this is much appreciated. > > > http://review.gluster.com/2973 should fix this. Can you please confirm? > > > Thanks, > Vijay Vijay, I have confirmed that following instructions with c3a16c32. # showmount one Hosts on one: # mkdir /tmp/mnt # mount one:/one /tmp/mnt # showmount one Hosts on one: 172.17.200.108 # umount /tmp/mnt # showmount one Hosts on one: # And the test set has started running. It will take a couple of days to finish. by the way, I did following instructions to build RPM packages on a CentOS 5.6 x86_64 host. # yum install python-ctypes ncureses-devel readline-devel libibverbs-devel # git clone -b c3a16c32 ssh://@git.gluster.com/glusterfs.git glusterfs-3git # tar zcf /usr/src/redhat/SOURCES/glusterfs-3bit.tar.gz glusterfs-3git # rpmbuild -bb /usr/src/redhat/SOURCES/glusterfs-3git.tar.gz Thanks, Tomo Sato From manu at netbsd.org Mon May 7 04:39:22 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 7 May 2012 04:39:22 +0000 Subject: [Gluster-devel] Fixing Address family mess Message-ID: <20120507043922.GA10874@homeworld.netbsd.org> Hi Quick summary of the problem: when using transport-type socket with transport.address-family unspecified, glusterfs binds sockets with AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the kernel prefers. At mine it uses AF_INET6, while the machine is not configured to use IPv6. As a result, glusterfs client cannot connect to glusterfs server. A workaround is to use option transport.address-family inet in glusterfsd/glusterd.vol but that option must also be specified in all volume files for all bricks and FUSE client, which is unfortunate because they are automatically generated. I proposed a patch so that glusterd transport.address-family setting is propagated to various places: http://review.gluster.com/3261 That did not meet consensus. Jeff Darcy notes that we should be able to listen both on AF_INET and AF_INET6 sockets at the same time. I had a look at the code, and indeed it could easily be done. The only trouble is how to specify the listeners. For now option transport defaults to socket,rdma. I suggest we add socket families in that specification. We would then have this default: option transport socket/inet,socket/inet6,rdma With the following semantics: socket -> AF_UNSPEC socket (backward comaptibility) socket/inet -> AF_INET socket socket/inet6 -> AF_INET6 socket socket/sdp -> AF_SDP socket rdma -> sameas before Any opinion on that plan? Please comment before I writa code, it will save me some time is the proposal is wrong. -- Emmanuel Dreyfus manu at netbsd.org From xhernandez at datalab.es Mon May 7 08:07:52 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 07 May 2012 10:07:52 +0200 Subject: [Gluster-devel] Some questions about requisites of translators In-Reply-To: References: <4FA112A9.1080101@datalab.es> Message-ID: <4FA782D8.2000100@datalab.es> On 05/05/2012 08:02 AM, Anand Avati wrote: > > > On Wed, May 2, 2012 at 3:55 AM, Xavier Hernandez > > wrote: > > Hello, > > I'm wondering if there are any requisites that translators must > satisfy to work correctly inside glusterfs. > > In particular I need to know two things: > > 1. Are translators required to respect the order in which they > receive the requests ? > > This is specially important in translators such as > performance/io-threads or caching ones. It seems that these > translators can reorder requests. If this is the case, is there > any way to force some order between requests ? can inodelk/entrylk > be used to force the order ? > > > Translators are not expected to maintain ordering of requests. The > only translator which takes care of ordering calls is write-behind. > After acknowledging back write requests it has to make sure future > requests see the true "effect" as though the previous write actually > completed. To that end, it queues future "dependent" requests till the > write acknowledgement is received from the server. > > inodelk/entrylk calls help achieve synchronization among clients (by > getting into a critical section) - just like a mutex. It is an > arbitrator. It does not help for ordering of two calls. If one call > must strictly complete after another call from your translator's point > of view (i.e, if it has such a requirement), then the latter call's > STACK_WIND must happen in the callback of the former's STACK_UNWIND > path. There are no guarantees maintained by the system to ensure that > a second STACK_WIND issued right after a first STACK_WIND will > complete and callback in the same order. Write-behind does all its > ordering gimmicks only because it STACK_UNWINDs a write call > prematurely and therefore must maintain the causal effects by means of > queueing new requests behind the downcall towards the server. Good to know > 2. Are translators required to propagate callback arguments even > if the result of the operation is an error ? and if an internal > translator error occurs ? > > > Usually no. If op_ret is -1, only op_errno is expected to be a usable > value. Rest of the callback parameters are junk. > > When a translator has multiple subvolumes, I've seen that some > arguments, such as xdata, are replaced with NULL. This can be > understood, but are regular translators (those that only have one > subvolume) allowed to do that or must they preserve the value of > xdata, even in the case of an internal error ? > > > It is best to preserve the arguments unless you know specifically what > you are doing. In case of error, all the non-op_{ret,errno} arguments > are typically junk, including xdata. > > If this is not a requisite, xdata loses it's function of > delivering back extra information. > > > Can you explain? Are you seeing a use case for having a valid xdata in > the callback even with op_ret == -1? > As a part of a translator that I'm developing that works with multiple subvolumes, I need to implement some healing support to mantain data coherency (similar to AFR). After some thought, I decided that it could be advantageous to use a dedicated healing translator located near the bottom of the translators stack on the servers. This translator won't work by itself, it only adds support to be used by a higher level translator, which have to manage the logic of the healing and decide when a node needs to be healed. To do this, sometimes I need to return an error because an operation cannot be completed due to some condition related with healing itself (not with the underlying storage). However I need to send some specific healing information to let the upper translator know how it has to handle the detected condition. I cannot send a success answer because intermediate translators could take the fake data as valid and they could begin to operate incorrectly or even create inconsistencies. The other alternative is to use op_errno to encode the extra data, but this will also be difficult, even impossible in some cases, due to the amount of data and the complexity to combine it with an error code without mislead intermediate translators with strange or invalid error codes. I talked with John Mark about this translator and he suggested me to discuss it over the list. Therefore I'll initiate another thread to expose in more detail how it works and I would appreciate very much your opinion, and that of the other developers, about it. Especially if it can really be faster/safer that other solutions or not, or if you find any problem or have any suggestion to improve it. I think it could also be used by AFR and any future translator that may need some healing capabilities. Thank you very much, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From vijay at build.gluster.com Mon May 7 08:15:50 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Mon, 7 May 2012 01:15:50 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa40 released Message-ID: <20120507081553.5AA00100C5@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa40/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa40.tar.gz This release is made off v3.3.0qa40 From vijay at gluster.com Mon May 7 10:31:09 2012 From: vijay at gluster.com (Vijay Bellur) Date: Mon, 07 May 2012 16:01:09 +0530 Subject: [Gluster-devel] glusterfs-3.3.0qa40 released In-Reply-To: <20120507081553.5AA00100C5@build.gluster.com> References: <20120507081553.5AA00100C5@build.gluster.com> Message-ID: <4FA7A46D.2050506@gluster.com> This release is done by reverting commit 7d0397c2144810c8a396e00187a6617873c94002 as replace-brick and quota were not functioning with that commit. Hence the tag for this qa release would not be available in github. If you are interested in creating an equivalent of this qa release from git, it would be c4dadc74fd1d1188f123eae7f2b6d6f5232e2a0f - commit 7d0397c2144810c8a396e00187a6617873c94002. Thanks, Vijay On 05/07/2012 01:45 PM, Vijay Bellur wrote: > http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa40/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa40.tar.gz > > This release is made off v3.3.0qa40 > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From jdarcy at redhat.com Mon May 7 13:16:38 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 07 May 2012 09:16:38 -0400 Subject: [Gluster-devel] Fixing Address family mess In-Reply-To: <20120507043922.GA10874@homeworld.netbsd.org> References: <20120507043922.GA10874@homeworld.netbsd.org> Message-ID: <4FA7CB36.6040701@redhat.com> On 05/07/2012 12:39 AM, Emmanuel Dreyfus wrote: > Quick summary of the problem: when using transport-type socket with > transport.address-family unspecified, glusterfs binds sockets with > AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the > kernel prefers. At mine it uses AF_INET6, while the machine is not > configured to use IPv6. As a result, glusterfs client cannot connect > to glusterfs server. > > A workaround is to use option transport.address-family inet in > glusterfsd/glusterd.vol but that option must also be specified in > all volume files for all bricks and FUSE client, which is > unfortunate because they are automatically generated. I proposed a > patch so that glusterd transport.address-family setting is propagated > to various places: http://review.gluster.com/3261 > > That did not meet consensus. Jeff Darcy notes that we should be able > to listen both on AF_INET and AF_INET6 sockets at the same time. I > had a look at the code, and indeed it could easily be done. The only > trouble is how to specify the listeners. For now option transport > defaults to socket,rdma. I suggest we add socket families in that > specification. We would then have this default: > option transport socket/inet,socket/inet6,rdma > > With the following semantics: > socket -> AF_UNSPEC socket (backward comaptibility) > socket/inet -> AF_INET socket > socket/inet6 -> AF_INET6 socket > socket/sdp -> AF_SDP socket > rdma -> sameas before > > Any opinion on that plan? Please comment before I writa code, it will > save me some time is the proposal is wrong. I think it looks like the right solution. I understand that keeping the address-family multiplexing entirely in the socket code would be more complex, since it changes the relationship between transport instances and file descriptors (and threads in the SSL/multi-thread case). That's unfortunate, but far from the most unfortunate thing about our transport code. I do wonder whether we should use '/' as the separator, since it kind of implies the same kind of relationships between names and paths that we use for translator names - e.g. cluster/dht is actually used as part of the actual path for dht.so - and in this case that relationship doesn't actually exist. Another idea, which I don't actually like any better but which I'll suggest for completeness, would be to express the list of address families via an option: option transport.socket.address-family inet6 Now that I think about it, another benefit is that it supports multiple instances of the same address family with different options, e.g. to support segregated networks. Obviously we lack higher-level support for that right now, but if that should ever change then it would be nice to have the right low-level infrastructure in place for it. From jdarcy at redhat.com Mon May 7 14:43:47 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 07 May 2012 10:43:47 -0400 Subject: [Gluster-devel] ZkFarmer Message-ID: <4FA7DFA3.1030300@redhat.com> I've long felt that our ways of dealing with cluster membership and staging of config changes is not quite as robust and scalable as we might want. Accordingly, I spent a bit of time a couple of weeks ago looking into the possibility of using ZooKeeper to do some of this stuff. Yeah, it brings in a heavy Java dependency, but when I looked at some lighter-weight alternatives they all seemed to be lacking in more important ways. Basically the idea was to do this: * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or point everyone at an existing ZooKeeper cluster. * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" merely updates ZK, and "peer status" merely reads from it). * Store config information in ZK *once* instead of regenerating volfiles etc. on every node (and dealing with the ugly cases where a node was down when the config change happened). * Set watches on ZK nodes to be notified when config changes happen, and respond appropriately. I eventually ran out of time and moved on to other things, but this or something like it (e.g. using Riak Core) still seems like a better approach than what we have. In that context, it looks like ZkFarmer[1] might be a big help. AFAICT someone else was trying to solve almost exactly the same kind of server/config problem that we have, and wrapped their solution into a library. Is this a direction other devs might be interested in pursuing some day, if/when time allows? [1] https://github.com/rs/zkfarmer From johnmark at redhat.com Mon May 7 19:35:54 2012 From: johnmark at redhat.com (John Mark Walker) Date: Mon, 07 May 2012 15:35:54 -0400 (EDT) Subject: [Gluster-devel] Server outage - review.gluster.com - please stand by In-Reply-To: <5299ff98-4714-4702-8f26-0d6f62441fe3@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: Greetings, Our iWeb server, which hosts review.gluster.com, is currently down. I have filed an urgent request to reboot the server in question. If you notice anything else working poorly, aside from review.gluster.com, please let me know ASAP. I'll send a note when services are back to normal. -JM From ian.latter at midnightcode.org Mon May 7 22:17:41 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 08 May 2012 08:17:41 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205072217.q47MHfmr003867@singularity.tronunltd.com> Is there anything written up on why you/all want every node to be completely conscious of every other node? I could see a couple of architectures that might work better (be more scalable) if the config minutiae were either not necessary to be shared or shared in only cases where the config minutiae were a dependency. RE ZK, I have an issue with it not being a binary at the linux distribution level. This is the reason I don't currently have Gluster's geo replication module in place .. ----- Original Message ----- >From: "Jeff Darcy" >To: >Subject: [Gluster-devel] ZkFarmer >Date: Mon, 07 May 2012 10:43:47 -0400 > > I've long felt that our ways of dealing with cluster membership and staging of > config changes is not quite as robust and scalable as we might want. > Accordingly, I spent a bit of time a couple of weeks ago looking into the > possibility of using ZooKeeper to do some of this stuff. Yeah, it brings in a > heavy Java dependency, but when I looked at some lighter-weight alternatives > they all seemed to be lacking in more important ways. Basically the idea was > to do this: > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or > point everyone at an existing ZooKeeper cluster. > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" > merely updates ZK, and "peer status" merely reads from it). > > * Store config information in ZK *once* instead of regenerating volfiles etc. > on every node (and dealing with the ugly cases where a node was down when the > config change happened). > > * Set watches on ZK nodes to be notified when config changes happen, and > respond appropriately. > > I eventually ran out of time and moved on to other things, but this or > something like it (e.g. using Riak Core) still seems like a better approach > than what we have. In that context, it looks like ZkFarmer[1] might be a big > help. AFAICT someone else was trying to solve almost exactly the same kind of > server/config problem that we have, and wrapped their solution into a library. > Is this a direction other devs might be interested in pursuing some day, > if/when time allows? > > > [1] https://github.com/rs/zkfarmer > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From ab at gluster.com Mon May 7 22:55:22 2012 From: ab at gluster.com (Anand Babu Periasamy) Date: Mon, 7 May 2012 15:55:22 -0700 Subject: [Gluster-devel] Fixing Address family mess In-Reply-To: <4FA7CB36.6040701@redhat.com> References: <20120507043922.GA10874@homeworld.netbsd.org> <4FA7CB36.6040701@redhat.com> Message-ID: On Mon, May 7, 2012 at 6:16 AM, Jeff Darcy wrote: > On 05/07/2012 12:39 AM, Emmanuel Dreyfus wrote: >> Quick summary of the problem: when using transport-type socket with >> transport.address-family unspecified, glusterfs binds sockets with >> AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the >> kernel prefers. At mine it uses AF_INET6, while the machine is not >> configured to use IPv6. As a result, glusterfs client cannot connect >> to glusterfs server. >> >> A workaround is to use option transport.address-family inet in >> glusterfsd/glusterd.vol but that option must also be specified in >> all volume files for all bricks and FUSE client, which is >> unfortunate because they are automatically generated. I proposed a >> patch so that glusterd transport.address-family setting is propagated >> to various places: http://review.gluster.com/3261 >> >> That did not meet consensus. Jeff Darcy notes that we should be able >> to listen both on AF_INET and AF_INET6 sockets at the same time. I >> had a look at the code, and indeed it could easily be done. The only >> trouble is how to specify the listeners. For now option transport >> defaults to socket,rdma. I suggest we add socket families in that >> specification. We would then have this default: >> ? ?option transport socket/inet,socket/inet6,rdma >> >> With the following semantics: >> ? ?socket -> AF_UNSPEC socket (backward comaptibility) >> ? ?socket/inet -> AF_INET socket >> ? ?socket/inet6 -> AF_INET6 socket >> ? ?socket/sdp -> AF_SDP socket >> ? ?rdma -> sameas before >> >> Any opinion on that plan? Please comment before I writa code, it will >> save me some time is the proposal is wrong. > > I think it looks like the right solution. I understand that keeping the > address-family multiplexing entirely in the socket code would be more complex, > since it changes the relationship between transport instances and file > descriptors (and threads in the SSL/multi-thread case). ?That's unfortunate, > but far from the most unfortunate thing about our transport code. > > I do wonder whether we should use '/' as the separator, since it kind of > implies the same kind of relationships between names and paths that we use for > translator names - e.g. cluster/dht is actually used as part of the actual path > for dht.so - and in this case that relationship doesn't actually exist. Another > idea, which I don't actually like any better but which I'll suggest for > completeness, would be to express the list of address families via an option: > > ? ? ? ?option transport.socket.address-family inet6 > > Now that I think about it, another benefit is that it supports multiple > instances of the same address family with different options, e.g. to support > segregated networks. ?Obviously we lack higher-level support for that right > now, but if that should ever change then it would be nice to have the right > low-level infrastructure in place for it. > Yes this should be controlled through volume options. "transport.address-family" is the right place to set it. Possible values are "inet, inet6, unix, inet-sdp". I would have named those user facing options as "ipv4, ipv6, sdp, all". If transport.address-family is not set. then if remote-host is set default to AF_INET (ipv4) if if transport.socket.connect-path is set default to AF_UNIX (unix) AF_UNSPEC is should be be taken as IPv4/IPv6. It is named appropriately. Default should be ipv4. I have not tested the patch. It is simply to explain how the changes should look like. I ignored legacy translators. When we implement concurrent support for multiple address-family (likely via mult-process model) we can worry about combinations. I agree. Combinations should look like "inet | inet6 | .." and not "inet / inet6 /.." -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: glusterfs-af-default-ipv4.diff Type: application/octet-stream Size: 9194 bytes Desc: not available URL: From jdarcy at redhat.com Tue May 8 00:43:31 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 07 May 2012 20:43:31 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205072217.q47MHfmr003867@singularity.tronunltd.com> References: <201205072217.q47MHfmr003867@singularity.tronunltd.com> Message-ID: <4FA86C33.6020901@redhat.com> On 05/07/2012 06:17 PM, Ian Latter wrote: > Is there anything written up on why you/all want every > node to be completely conscious of every other node? > > I could see a couple of architectures that might work > better (be more scalable) if the config minutiae were > either not necessary to be shared or shared in only > cases where the config minutiae were a dependency. Well, these aren't exactly minutiae. Everything at file or directory level is fully distributed and will remain so. We're talking only about stuff at the volume or server level, which is very little data but very broad in scope. Trying to segregate that only adds complexity and subtracts convenience, compared to having it equally accessible to (or through) any server. > RE ZK, I have an issue with it not being a binary at > the linux distribution level. This is the reason I don't > currently have Gluster's geo replication module in > place .. What exactly is your objection to interpreted or JIT compiled languages? Performance? Security? It's an unusual position, to say the least. From glusterdevel at louiszuckerman.com Tue May 8 03:52:02 2012 From: glusterdevel at louiszuckerman.com (Louis Zuckerman) Date: Mon, 7 May 2012 23:52:02 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <4FA7DFA3.1030300@redhat.com> References: <4FA7DFA3.1030300@redhat.com> Message-ID: Here's another ZooKeeper management framework that may be useful. It's called Curator, developed by Netflix, and recently released as open source. It probably has a bit more inertia than ZkFarmer too. http://techblog.netflix.com/2011/11/introducing-curator-netflix-zookeeper.html https://github.com/Netflix/curator HTH -louis On Mon, May 7, 2012 at 10:43 AM, Jeff Darcy wrote: > I've long felt that our ways of dealing with cluster membership and > staging of > config changes is not quite as robust and scalable as we might want. > Accordingly, I spent a bit of time a couple of weeks ago looking into the > possibility of using ZooKeeper to do some of this stuff. Yeah, it brings > in a > heavy Java dependency, but when I looked at some lighter-weight > alternatives > they all seemed to be lacking in more important ways. Basically the idea > was > to do this: > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, > or > point everyone at an existing ZooKeeper cluster. > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" > merely updates ZK, and "peer status" merely reads from it). > > * Store config information in ZK *once* instead of regenerating volfiles > etc. > on every node (and dealing with the ugly cases where a node was down when > the > config change happened). > > * Set watches on ZK nodes to be notified when config changes happen, and > respond appropriately. > > I eventually ran out of time and moved on to other things, but this or > something like it (e.g. using Riak Core) still seems like a better approach > than what we have. In that context, it looks like ZkFarmer[1] might be a > big > help. AFAICT someone else was trying to solve almost exactly the same > kind of > server/config problem that we have, and wrapped their solution into a > library. > Is this a direction other devs might be interested in pursuing some day, > if/when time allows? > > > [1] https://github.com/rs/zkfarmer > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Tue May 8 04:27:24 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 08 May 2012 14:27:24 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205080427.q484RO09004857@singularity.tronunltd.com> > > Is there anything written up on why you/all want every > > node to be completely conscious of every other node? > > > > I could see a couple of architectures that might work > > better (be more scalable) if the config minutiae were > > either not necessary to be shared or shared in only > > cases where the config minutiae were a dependency. > > Well, these aren't exactly minutiae. Everything at file or directory level is > fully distributed and will remain so. We're talking only about stuff at the > volume or server level, which is very little data but very broad in scope. > Trying to segregate that only adds complexity and subtracts convenience, > compared to having it equally accessible to (or through) any server. Sorry, I didn't have time this morning to add more detail. Note that my concern isn't bandwidth, its flexibility; the less knowledge needed the more I can do crazy things in user land, like running boxes in different data centres and randomly power things up and down, randomly re- address, randomly replace in-box hardware, load balance, NAT, etc. It makes a dynamic environment difficult to construct, for example, when Gluster rejects the same volume-id being presented to an existing cluster from a new GFID. But there's no need to go even that complicated, let me pull out an example of where shared knowledge may be unnecessary; The work that I was doing in Gluster (pre glusterd) drove out one primary "server" which fronted a Replicate volume of both its own Distribute volume and that of another server or two - themselves serving a single Distribute volume. So the client connected to one server for one volume and the rest was black box / magic (from the client's perspective - big fast storage in many locations); in that case it could be said that servers needed some shared knowledge, while the clients didn't. The equivalent configuration in a glusterd world (from my experiments) pushed all of the distribute knowledge out to the client and I haven't had a response as to how to add a replicate on distributed volumes in this model, so I've lost replicate. But in this world, the client must know about everything and the server is simply a set of served/presented disks (as volumes). In this glusterd world, then, why does any server need to know of any other server, if the clients are doing all of the heavy lifting? The additional consideration is where the server both consumes and presents, but this would be captured in the client side view. i.e. given where glusterd seems to be driving, this knowledge seems to be needed on the client side (within glusterfs, not glusterfsd). To my mind this breaks the gluster architecture that I read about 2009, but I need to stress that I didn't get a reply to the glusterd architecture question that I posted about a month ago; so I don't know if glusterd is currently limiting deployment options because; - there is an intention to drive the heavy lifting to the client (for example for performance reasons in big deployments), or; - there are known limitations in the existing bricks/ modules (for example moving files thru distribute), or; - there is ultimately (long term) more flexibility seen in this model (and we're at a midway point between pre glusterd and post so it doesn't feel that way yet), or; - there is an intent to drive out a particular market outcome or match an existing storage model (the gluster presentation was driving towards cloud, and maybe those vendors don't use server side implementations), etc. As I don't have a clear/big picture in my mind; if I'm not considering all of the impacts, then my apologies. > > RE ZK, I have an issue with it not being a binary at > > the linux distribution level. This is the reason I don't > > currently have Gluster's geo replication module in > > place .. > > What exactly is your objection to interpreted or JIT compiled languages? > Performance? Security? It's an unusual position, to say the least. > Specifically, primarily, space. Saturn builds GlusterFS capacity from a 48 Megabyte Linux distribution and adding many Megabytes of Perl and/or Python and/or PHP and/or Java for a single script is impractical. My secondary concern is licensing (specifically in the Java run-time environment case). Hadoop forced my hand; GNU's JRE/compiler wasn't up to the task of running Hadoop when I last looked at it (about 2 or 3 years ago now) - well, it could run a 2007 or so version but not current ones at that time - so now I work with Gluster .. Going back to ZkFarmer; Considering other architectures; it depends on how you slice and dice the problem as to how much external support you need; > I've long felt that our ways of dealing with cluster > membership and staging of config changes is not > quite as robust and scalable as we might want. By way of example; The openMosix kernel extensions maintained their own information exchange between cluster nodes; if a node (ip) was added via the /proc interface, it was "in" the cluster. Therefore cluster membership was the hand-off/interface. It could be as simple as a text list on each node, or it could be left to a user space daemon which could then gate cluster membership - this suited everyone with a small cluster. The native daemon (omdiscd) used multicast packets to find nodes and then stuff those IP's into the /proc interface - this suited everyone with a private/dedicated cluster. A colleague and I wrote a TCP variation to allow multi-site discovery with SSH public key exchanges and IPSEC tunnel establishment as part of the gating process - this suited those with a distributed/ part-time cluster. To ZooKeeper's point (http://zookeeper.apache.org/), the discovery protocol that we created was weak and I've since found a model/algorithm that allows for far more robust discovery. The point being that, depending on the final cluster architecture for gluster (i.e. all are nodes are peers and thus all are cluster members, nodes are client or server and both are cluster members, nodes are client or server and only clients [or servers] are cluster members, etc) there may be simpler cluster management options .. Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From ab at gluster.com Tue May 8 04:33:50 2012 From: ab at gluster.com (Anand Babu Periasamy) Date: Mon, 7 May 2012 21:33:50 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <4FA7DFA3.1030300@redhat.com> References: <4FA7DFA3.1030300@redhat.com> Message-ID: On Mon, May 7, 2012 at 7:43 AM, Jeff Darcy wrote: > I've long felt that our ways of dealing with cluster membership and staging of > config changes is not quite as robust and scalable as we might want. > Accordingly, I spent a bit of time a couple of weeks ago looking into the > possibility of using ZooKeeper to do some of this stuff. ?Yeah, it brings in a > heavy Java dependency, but when I looked at some lighter-weight alternatives > they all seemed to be lacking in more important ways. ?Basically the idea was > to do this: > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or > point everyone at an existing ZooKeeper cluster. > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" > merely updates ZK, and "peer status" merely reads from it). > > * Store config information in ZK *once* instead of regenerating volfiles etc. > on every node (and dealing with the ugly cases where a node was down when the > config change happened). > > * Set watches on ZK nodes to be notified when config changes happen, and > respond appropriately. > > I eventually ran out of time and moved on to other things, but this or > something like it (e.g. using Riak Core) still seems like a better approach > than what we have. ?In that context, it looks like ZkFarmer[1] might be a big > help. ?AFAICT someone else was trying to solve almost exactly the same kind of > server/config problem that we have, and wrapped their solution into a library. > ?Is this a direction other devs might be interested in pursuing some day, > if/when time allows? > > > [1] https://github.com/rs/zkfarmer Real issue is here is: GlusterFS is a fully distributed system. It is OK for config files to be in one place (centralized). It is easier to manage and backup. Avati still claims that making distributed copies are not a problem (volume operations are fast, versioned and checksumed). Also the code base for replicating 3 way or all-node is same. We all need to come to agreement on the demerits of replicating the volume spec on every node. If we are convinced to keep the config info in one place, ZK is certainly one a good idea. I personally hate Java dependency. I still struggle with Java dependencies for browser and clojure. I can digest that if we are going to adopt Java over Python for future external modules. Alternatively we can also look at creating a replicated meta system volume. What ever we adopt, we should keep dependencies and installation steps to the bare minimum and simple. -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein From ab at gluster.com Tue May 8 04:56:10 2012 From: ab at gluster.com (Anand Babu Periasamy) Date: Mon, 7 May 2012 21:56:10 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205080427.q484RO09004857@singularity.tronunltd.com> References: <201205080427.q484RO09004857@singularity.tronunltd.com> Message-ID: On Mon, May 7, 2012 at 9:27 PM, Ian Latter wrote: > >> > Is there anything written up on why you/all want every >> > node to be completely conscious of every other node? >> > >> > I could see a couple of architectures that might work >> > better (be more scalable) if the config minutiae were >> > either not necessary to be shared or shared in only >> > cases where the config minutiae were a dependency. >> >> Well, these aren't exactly minutiae. ?Everything at file > or directory level is >> fully distributed and will remain so. ?We're talking only > about stuff at the >> volume or server level, which is very little data but very > broad in scope. >> Trying to segregate that only adds complexity and > subtracts convenience, >> compared to having it equally accessible to (or through) > any server. > > Sorry, I didn't have time this morning to add more detail. > > Note that my concern isn't bandwidth, its flexibility; the > less knowledge needed the more I can do crazy things > in user land, like running boxes in different data centres > and randomly power things up and down, randomly re- > address, randomly replace in-box hardware, load > balance, NAT, etc. ?It makes a dynamic environment > difficult to construct, for example, when Gluster rejects > the same volume-id being presented to an existing > cluster from a new GFID. > > But there's no need to go even that complicated, let > me pull out an example of where shared knowledge > may be unnecessary; > > The work that I was doing in Gluster (pre glusterd) drove > out one primary "server" which fronted a Replicate > volume of both its own Distribute volume and that of > another server or two - themselves serving a single > Distribute volume. ?So the client connected to one > server for one volume and the rest was black box / > magic (from the client's perspective - big fast storage > in many locations); in that case it could be said that > servers needed some shared knowledge, while the > clients didn't. > > The equivalent configuration in a glusterd world (from > my experiments) pushed all of the distribute knowledge > out to the client and I haven't had a response as to how > to add a replicate on distributed volumes in this model, > so I've lost replicate. ?But in this world, the client must > know about everything and the server is simply a set > of served/presented disks (as volumes). ?In this > glusterd world, then, why does any server need to > know of any other server, if the clients are doing all of > the heavy lifting? > > The additional consideration is where the server both > consumes and presents, but this would be captured in > the client side view. ?i.e. given where glusterd seems > to be driving, this knowledge seems to be needed on > the client side (within glusterfs, not glusterfsd). > > To my mind this breaks the gluster architecture that I > read about 2009, but I need to stress that I didn't get > a reply to the glusterd architecture question that I > posted about a month ago; ?so I don't know if glusterd > is currently limiting deployment options because; > ?- there is an intention to drive the heavy lifting to the > ? ?client (for example for performance reasons in big > ? ?deployments), or; > ?- there are known limitations in the existing bricks/ > ? ?modules (for example moving files thru distribute), > ? ?or; > ?- there is ultimately (long term) more flexibility seen > ? ?in this model (and we're at a midway point between > ? ?pre glusterd and post so it doesn't feel that way > ? ?yet), or; > ?- there is an intent to drive out a particular market > ? ?outcome or match an existing storage model (the > ? ?gluster presentation was driving towards cloud, > ? ?and maybe those vendors don't use server side > ? ?implementations), etc. > > As I don't have a clear/big picture in my mind; if I'm > not considering all of the impacts, then my apologies. > > >> > RE ZK, I have an issue with it not being a binary at >> > the linux distribution level. ?This is the reason I don't >> > currently have Gluster's geo replication module in >> > place .. >> >> What exactly is your objection to interpreted or JIT > compiled languages? >> Performance? ?Security? ?It's an unusual position, to say > the least. >> > > Specifically, primarily, space. ?Saturn builds GlusterFS > capacity from a 48 Megabyte Linux distribution and > adding many Megabytes of Perl and/or Python and/or > PHP and/or Java for a single script is impractical. > > My secondary concern is licensing (specifically in the > Java run-time environment case). ?Hadoop forced my > hand; GNU's JRE/compiler wasn't up to the task of > running Hadoop when I last looked at it (about 2 or 3 > years ago now) - well, it could run a 2007 or so > version but not current ones at that time - so now I > work with Gluster .. > > > > Going back to ZkFarmer; > > Considering other architectures; it depends on how > you slice and dice the problem as to how much > external support you need; > ?> I've long felt that our ways of dealing with cluster > ?> membership and staging of config changes is not > ?> quite as robust and scalable as we might want. > > By way of example; > ?The openMosix kernel extensions maintained their > own information exchange between cluster nodes; if > a node (ip) was added via the /proc interface, it was > "in" the cluster. ?Therefore cluster membership was > the hand-off/interface. > ?It could be as simple as a text list on each node, or > it could be left to a user space daemon which could > then gate cluster membership - this suited everyone > with a small cluster. > ?The native daemon (omdiscd) used multicast > packets to find nodes and then stuff those IP's into > the /proc interface - this suited everyone with a > private/dedicated cluster. > ?A colleague and I wrote a TCP variation to allow > multi-site discovery with SSH public key exchanges > and IPSEC tunnel establishment as part of the > gating process - this suited those with a distributed/ > part-time cluster. ?To ZooKeeper's point > (http://zookeeper.apache.org/), the discovery > protocol that we created was weak and I've since > found a model/algorithm that allows for far more > robust discovery. > > ?The point being that, depending on the final cluster > architecture for gluster (i.e. all are nodes are peers > and thus all are cluster members, nodes are client > or server and both are cluster members, nodes are > client or server and only clients [or servers] are > cluster members, etc) there may be simpler cluster > management options .. > > > Cheers, > Reason to keep the volume spec files on all servers is simply to be fully distributed. No one node or set of nodes should hold the cluster hostage. Code to keep them in sync over 2 nodes or 20 nodes is essentially the same. We are revisiting this situation now because we want to scale to 1000s of nodes potentially. Gluster CLI operations should not time out or slow down. If ZK requires proprietary JRE for stability, Java will be NO NO!. We may not need ZK at all. If we simply decide to centralize the config, GlusterFS has enough code to handle them. Again Avati will argue that it is exactly the same code as now. My point is to keep things simple as we scale. Even if the code base is same, we should still restrict it to N selected nodes. It is matter of adding config option. -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein From ian.latter at midnightcode.org Tue May 8 05:21:37 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 08 May 2012 15:21:37 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205080521.q485Lb9d005117@singularity.tronunltd.com> > No one node or set of nodes should hold the > cluster hostage. Agreed - this is fundamental. > We are revisiting this situation now because we > want to scale to 1000s of nodes potentially. Good, I hate upper bounds on architectures :) Though I haven't tested my own implementation, I understand that one implementation of the discovery protocol that I've used, scaled to 20,000 hosts across three sites in two countries; this is the the type of robust outcome that can be manipulated at the macro scale - i.e. without manipulating per-node details. > Gluster CLI operations should not time out or > slow down. This is critical - not just the CLI but also the storage interface (in a redundant environment); infrastructure wears and fails, thus failing infrastructure should be regarded as the norm/ default. > If ZK requires proprietary JRE for stability, > Java will be NO NO!. *Fantastic* > My point is to keep things simple as we scale. I couldn't agree more. In that principle I ask that each dependency on cluster knowledge be considered carefully with a minimalist approach. -- Ian Latter Late night coder .. http://midnightcode.org/ From vbellur at redhat.com Tue May 8 09:15:13 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Tue, 08 May 2012 14:45:13 +0530 Subject: [Gluster-devel] Server outage - review.gluster.com - please stand by In-Reply-To: References: Message-ID: <4FA8E421.3090108@redhat.com> On 05/08/2012 01:05 AM, John Mark Walker wrote: > Greetings, > > Our iWeb server, which hosts review.gluster.com, is currently down. I have filed an urgent request to reboot the server in question. > > If you notice anything else working poorly, aside from review.gluster.com, please let me know ASAP. > > I'll send a note when services are back to normal. All services are back to normal. Please let us know if you notice any issue. Thanks, Vijay From xhernandez at datalab.es Tue May 8 09:34:35 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Tue, 08 May 2012 11:34:35 +0200 Subject: [Gluster-devel] A healing translator Message-ID: <4FA8E8AB.2040604@datalab.es> Hello developers, I would like to expose some ideas we are working on to create a new kind of translator that should be able to unify and simplify to some extent the healing procedures of complex translators. Currently, the only translator with complex healing capabilities that we are aware of is AFR. We are developing another translator that will also need healing capabilities, so we thought that it would be interesting to create a new translator able to handle the common part of the healing process and hence to simplify and avoid duplicated code in other translators. The basic idea of the new translator is to handle healing tasks nearer the storage translator on the server nodes instead to control everything from a translator on the client nodes. Of course the heal translator is not able to handle healing entirely by itself, it needs a client translator which will coordinate all tasks. The heal translator is intended to be used by translators that work with multiple subvolumes. I will try to explain how it works without entering into too much details. There is an important requisite for all client translators that use healing: they must have exactly the same list of subvolumes and in the same order. Currently, I think this is not a problem. The heal translator treats each file as an independent entity, and each one can be in 3 modes: 1. Normal mode This is the normal mode for a copy or fragment of a file when it is synchronized and consistent with the same file on other nodes (for example with other replicas. It is the client translator who decides if it is synchronized or not). 2. Healing mode This is the mode used when a client detects an inconsistency in the copy or fragment of the file stored on this node and initiates the healing procedures. 3. Provider mode (I don't like very much this name, though) This is the mode used by client translators when an inconsistency is detected in this file, but the copy or fragment stored in this node is considered good and it will be used as a source to repair the contents of this file on other nodes. Initially, when a file is created, it is set in normal mode. Client translators that make changes must guarantee that they send the modification requests in the same order to all the servers. This should be done using inodelk/entrylk. When a change is sent to a server, the client must include a bitmap mask of the clients to which the request is being sent. Normally this is a bitmap containing all the clients, however, when a server fails for some reason some bits will be cleared. The heal translator uses this bitmap to early detect failures on other nodes from the point of view of each client. When this condition is detected, the request is aborted with an error and the client is notified with the remaining list of valid nodes. If the client considers the request can be successfully server with the remaining list of nodes, it can resend the request with the updated bitmap. The heal translator also updates two file attributes for each change request to mantain the "version" of the data and metadata contents of the file. A similar task is currently made by AFR using xattrop. This would not be needed anymore, speeding write requests. The version of data and metadata is returned to the client for each read request, allowing it to detect inconsistent data. When a client detects an inconsistency, it initiates healing. First of all, it must lock the entry and inode (when necessary). Then, from the data collected from each node, it must decide which nodes have good data and which ones have bad data and hence need to be healed. There are two possible cases: 1. File is not a regular file In this case the reconstruction is very fast and requires few requests, so it is done while the file is locked. In this case, the heal translator does nothing relevant. 2. File is a regular file For regular files, the first step is to synchronize the metadata to the bad nodes, including the version information. Once this is done, the file is set in healing mode on bad nodes, and provider mode on good nodes. Then the entry and inode are unlocked. When a file is in provider mode, it works as in normal mode, but refuses to start another healing. Only one client can be healing a file. When a file is in healing mode, each normal write request from any client are handled as if the file were in normal mode, updating the version information and detecting possible inconsistencies with the bitmap. Additionally, the healing translator marks the written region of the file as "good". Each write request from the healing client intended to repair the file must be marked with a special flag. In this case, the area that wants to be written is filtered by the list of "good" ranges (if there are any intersection with a good range, it is removed from the request). The resulting set of ranges are propagated to the lower translator and added to the list of "good" ranges but the version information is not updated. Read requests are only served if the range requested is entirely contained into the "good" regions list. There are some additional details, but I think this is enough to have a general idea of its purpose and how it works. The main advantages of this translator are: 1. Avoid duplicated code in client translators 2. Simplify and unify healing methods in client translators 3. xattrop is not needed anymore in client translators to keep track of changes 4. Full file contents are repaired without locking the file 5. Better detection and prevention of some split brain situations as soon as possible I think it would be very useful. It seems to me that it works correctly in all situations, however I don't have all the experience that other developers have with the healing functions of AFR, so I will be happy to answer any question or suggestion to solve problems it may have or to improve it. What do you think about it ? Thank you, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From jdarcy at redhat.com Tue May 8 12:57:31 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Tue, 08 May 2012 08:57:31 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: References: <4FA7DFA3.1030300@redhat.com> Message-ID: <4FA9183B.5080708@redhat.com> On 05/08/2012 12:33 AM, Anand Babu Periasamy wrote: > Real issue is here is: GlusterFS is a fully distributed system. It is > OK for config files to be in one place (centralized). It is easier to > manage and backup. Avati still claims that making distributed copies > are not a problem (volume operations are fast, versioned and > checksumed). It's also grossly inefficient at 100-node scale. I'll also need some convincing before I believe that nodes which are down during a config change will catch up automatically and reliably in all cases. I think this is even more of an issue with membership than with config data. All-to-all pings are just not acceptable at 100-node or greater scale. We need something better, and more importantly designing cluster membership protocols is just not a business we should even be in. We shouldn't be devoting our own time to that when we can just use something designed by people who have that as their focus. > Also the code base for replicating 3 way or all-node is > same. We all need to come to agreement on the demerits of replicating > the volume spec on every node. It's somewhat similar to how we replicate data - we need enough copies to survive a certain number of anticipated failures. > If we are convinced to keep the config info in one place, ZK is > certainly one a good idea. I personally hate Java dependency. I still > struggle with Java dependencies for browser and clojure. I can digest > that if we are going to adopt Java over Python for future external > modules. Alternatively we can also look at creating a replicated meta > system volume. What ever we adopt, we should keep dependencies and > installation steps to the bare minimum and simple. I personally hate the Java dependency too. I'd much rather have something in C/Go/Python/Erlang but couldn't find anything that had the same (useful) feature set. I also considered the idea of storing config in a hand-crafted GlusterFS volume, using our own mechanisms for distributing/finding and replicating data. That's at least an area where we can claim some expertise. Such layering does create a few interesting issues, but nothing intractable. The big drawback is that it only solves the config-data problem; a solution which combines that with cluster membership is IMO preferable. The development drag of having to maintain that functionality ourselves, and hook every new feature into the not-very-convenient APIs that have predictably resulted, is considerable. From jdarcy at redhat.com Tue May 8 12:42:19 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Tue, 08 May 2012 08:42:19 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205080427.q484RO09004857@singularity.tronunltd.com> References: <201205080427.q484RO09004857@singularity.tronunltd.com> Message-ID: <4FA914AB.8030209@redhat.com> On 05/08/2012 12:27 AM, Ian Latter wrote: > The equivalent configuration in a glusterd world (from > my experiments) pushed all of the distribute knowledge > out to the client and I haven't had a response as to how > to add a replicate on distributed volumes in this model, > so I've lost replicate. This doesn't seem to be a problem with replicate-first vs. distribute-first, but with client-side vs. server-side deployment of those translators. You *can* construct your own volfiles that do these things on the servers. It will work, but you won't get a lot of support for it. The issue here is that we have only a finite number of developers, and a near-infinite number of configurations. We can't properly qualify everything. One way we've tried to limit that space is by preferring distribute over replicate, because replicate does a better job of shielding distribute from brick failures than vice versa. Another is to deploy both on the clients, following the scalability rule of pushing effort to the most numerous components. The code can support other arrangements, but the people might not. BTW, a similar concern exists with respect to replication (i.e. AFR) across data centers. Performance is going to be bad, and there's not going to be much we can do about it. > But in this world, the client must > know about everything and the server is simply a set > of served/presented disks (as volumes). In this > glusterd world, then, why does any server need to > know of any other server, if the clients are doing all of > the heavy lifting? First, because config changes have to apply across servers. Second, because server machines often spin up client processes for things like repair or rebalance. From ian.latter at midnightcode.org Tue May 8 23:08:32 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Wed, 09 May 2012 09:08:32 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205082308.q48N8WQg008425@singularity.tronunltd.com> > On 05/08/2012 12:27 AM, Ian Latter wrote: > > The equivalent configuration in a glusterd world (from > > my experiments) pushed all of the distribute knowledge > > out to the client and I haven't had a response as to how > > to add a replicate on distributed volumes in this model, > > so I've lost replicate. > > This doesn't seem to be a problem with replicate-first vs. distribute-first, > but with client-side vs. server-side deployment of those translators. You > *can* construct your own volfiles that do these things on the servers. It will > work, but you won't get a lot of support for it. The issue here is that we > have only a finite number of developers, and a near-infinite number of > configurations. We can't properly qualify everything. One way we've tried to > limit that space is by preferring distribute over replicate, because replicate > does a better job of shielding distribute from brick failures than vice versa. > Another is to deploy both on the clients, following the scalability rule of > pushing effort to the most numerous components. The code can support other > arrangements, but the people might not. Sure, I have my own vol files that do (did) what I wanted and I was supporting myself (and users); the question (and the point) is what is the GlusterFS *intent*? I'll write an rsyncd wrapper myself, to run on top of Gluster, if the intent is not allow the configuration I'm after (arbitrary number of disks in one multi-host environment replicated to an arbitrary number of disks in another multi-host environment, where ideally each environment need not sum to the same data capacity, presented in a single contiguous consumable storage layer to an arbitrary number of unintelligent clients, that is as fault tolerant as I choose it to be including the ability to add and offline/online and remove storage as I so choose) .. or switch out the whole solution if Gluster is heading away from my needs. I just need to know what the direction is .. I may even be able to help get you there if you tell me :) > BTW, a similar concern exists with respect to replication (i.e. AFR) across > data centers. Performance is going to be bad, and there's not going to be much > we can do about it. Hmm .. that depends .. these sorts of statements need context/qualification (in bandwidth and latency terms). For example the last multi-site environment that I did architecture for was two DCs set 32kms apart with a redundant 20Gbps layer-2 (ethernet) stretch between them - latency was 1ms average, 2ms max (the fiber actually took a 70km path). Didn't run Gluster on it, but we did stretch a number things that "couldn't" be stretched. > > But in this world, the client must > > know about everything and the server is simply a set > > of served/presented disks (as volumes). In this > > glusterd world, then, why does any server need to > > know of any other server, if the clients are doing all of > > the heavy lifting? > > First, because config changes have to apply across servers. Second, because > server machines often spin up client processes for things like repair or > rebalance. Yep, but my reading is that the config's that the servers need are local - to make a disk a share (volume), and that as you've described the rest are "client processes" (even when on something built as a "server"), so if you catered for all clients then you'd be set? I.e. AFR now runs in the client? And I am sick of the word-wrap on this client .. I think you've finally convinced me to fix it ... what's normal these days - still 80 chars? -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Wed May 9 00:57:49 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 8 May 2012 17:57:49 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205082308.q48N8WQg008425@singularity.tronunltd.com> References: <201205082308.q48N8WQg008425@singularity.tronunltd.com> Message-ID: On Tue, May 8, 2012 at 4:08 PM, Ian Latter wrote: > > On 05/08/2012 12:27 AM, Ian Latter wrote: > > > The equivalent configuration in a glusterd world (from > > > my experiments) pushed all of the distribute knowledge > > > out to the client and I haven't had a response as to how > > > to add a replicate on distributed volumes in this model, > > > so I've lost replicate. > > > > This doesn't seem to be a problem with replicate-first vs. > distribute-first, > > but with client-side vs. server-side deployment of those > translators. You > > *can* construct your own volfiles that do these things on > the servers. It will > > work, but you won't get a lot of support for it. The > issue here is that we > > have only a finite number of developers, and a > near-infinite number of > > configurations. We can't properly qualify everything. > One way we've tried to > > limit that space is by preferring distribute over > replicate, because replicate > > does a better job of shielding distribute from brick > failures than vice versa. > > Another is to deploy both on the clients, following the > scalability rule of > > pushing effort to the most numerous components. The code > can support other > > arrangements, but the people might not. > > Sure, I have my own vol files that do (did) what I wanted > and I was supporting myself (and users); the question > (and the point) is what is the GlusterFS *intent*? The "intent" (more or less - I hate to use the word as it can imply a commitment to what I am about to say, but there isn't one) is to keep the bricks (server process) dumb and have the intelligence on the client side. This is a "rough goal". There are cases where replication on the server side is inevitable (in the case of NFS access) but we keep the software architecture undisturbed by running a client process on the server machine to achieve it. We do plan to support "replication on the server" in the future while still retaining the existing software architecture as much as possible. This is particularly useful in Hadoop environment where the jobs expect write performance of a single copy and expect copy to happen in the background. We have the proactive self-heal daemon running on the server machines now (which again is a client process which happens to be physically placed on the server) which gives us many interesting possibilities - i.e, with simple changes where we fool the client side replicate translator at the time of transaction initiation that only the closest server is up at that point of time and write to it alone, and have the proactive self-heal daemon perform the extra copies in the background. This would be consistent with other readers as they get directed to the "right" version of the file by inspecting the changelogs while the background replication is in progress. The intention of the above example is to give a general sense of how we want to evolve the architecture (i.e, the "intention" you were referring to) - keep the clients intelligent and servers dumb. If some intelligence needs to be built on the physical server, tackle it by loading a client process there (there are also "pathinfo xattr" kind of internal techniques to figure out locality of the clients in a generic way without bringing "server sidedness" into them in a harsh way) I'll > write an rsyncd wrapper myself, to run on top of Gluster, > if the intent is not allow the configuration I'm after > (arbitrary number of disks in one multi-host environment > replicated to an arbitrary number of disks in another > multi-host environment, where ideally each environment > need not sum to the same data capacity, presented in a > single contiguous consumable storage layer to an > arbitrary number of unintelligent clients, that is as fault > tolerant as I choose it to be including the ability to add > and offline/online and remove storage as I so choose) .. > or switch out the whole solution if Gluster is heading > away from my needs. I just need to know what the > direction is .. I may even be able to help get you there if > you tell me :) > > There are good and bad in both styles (distribute on top v/s replicate on top). Replicate on top gives you much better flexibility of configuration. Distribute on top is easier for us developers. As a user I would like replicate on top as well. But the problem today is that replicate (and self-heal) does not understand "partial failure" of its subvolumes. If one of the subvolume of replicate is a distribute, then today's replicate only understands complete failure of the distribute set or it assumes everything is completely fine. An example is self-healing of directory entries. If a file is "missing" in one subvolume because a distribute node is temporarily down, replicate has no clue why it is missing (or that it should keep away from attempting to self-heal). Along the same lines, it does not know that once a server is taken off from its distribute subvolume for good that it needs to start recreating missing files. The effort to fix this seems to be big enough to disturb the inertia of status quo. If this is fixed, we can definitely adopt a replicate-on-top mode in glusterd. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From abperiasamy at gmail.com Wed May 9 01:05:37 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Tue, 8 May 2012 18:05:37 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205082308.q48N8WQg008425@singularity.tronunltd.com> References: <201205082308.q48N8WQg008425@singularity.tronunltd.com> Message-ID: On Tue, May 8, 2012 at 4:08 PM, Ian Latter wrote: >> On 05/08/2012 12:27 AM, Ian Latter wrote: >> > The equivalent configuration in a glusterd world (from >> > my experiments) pushed all of the distribute knowledge >> > out to the client and I haven't had a response as to how >> > to add a replicate on distributed volumes in this model, >> > so I've lost replicate. >> >> This doesn't seem to be a problem with replicate-first vs. > distribute-first, >> but with client-side vs. server-side deployment of those > translators. ?You >> *can* construct your own volfiles that do these things on > the servers. ?It will >> work, but you won't get a lot of support for it. ?The > issue here is that we >> have only a finite number of developers, and a > near-infinite number of >> configurations. ?We can't properly qualify everything. > One way we've tried to >> limit that space is by preferring distribute over > replicate, because replicate >> does a better job of shielding distribute from brick > failures than vice versa. >> Another is to deploy both on the clients, following the > scalability rule of >> pushing effort to the most numerous components. ?The code > can support other >> arrangements, but the people might not. > > Sure, I have my own vol files that do (did) what I wanted > and I was supporting myself (and users); the question > (and the point) is what is the GlusterFS *intent*? ?I'll > write an rsyncd wrapper myself, to run on top of Gluster, > if the intent is not allow the configuration I'm after > (arbitrary number of disks in one multi-host environment > replicated to an arbitrary number of disks in another > multi-host environment, where ideally each environment > need not sum to the same data capacity, presented in a > single contiguous consumable storage layer to an > arbitrary number of unintelligent clients, that is as fault > tolerant as I choose it to be including the ability to add > and offline/online and remove storage as I so choose) .. > or switch out the whole solution if Gluster is heading > away from my ?needs. ?I just need to know what the > direction is .. I may even be able to help get you there if > you tell me :) Rsync'ing the vol spec files is the simplest and elegant approach. It is how glusterfs originally handled config files. How ever elastic volume management (online volume management operations) requires synchronized online changes to volume spec files. This requires GlusterFS to manage volume specification files internally. That is why we brought glusterd in 3.1. Real question is: do we want to keep the volume spec files on all nodes (fully distributed) or few selected nodes. > >> BTW, a similar concern exists with respect to replication > (i.e. AFR) across >> data centers. ?Performance is going to be bad, and there's > not going to be much >> we can do about it. > > Hmm .. that depends .. these sorts of statements need > context/qualification (in bandwidth and latency terms). ?For > example the last multi-site environment that I did > architecture for was two DCs set 32kms apart with a > redundant 20Gbps layer-2 (ethernet) stretch between > them - latency was 1ms average, 2ms max (the fiber > actually took a 70km path). ?Didn't run Gluster on it, but > we did stretch a number things that "couldn't" be stretched. > > >> > But in this world, the client must >> > know about everything and the server is simply a set >> > of served/presented disks (as volumes). ?In this >> > glusterd world, then, why does any server need to >> > know of any other server, if the clients are doing all of >> > the heavy lifting? >> >> First, because config changes have to apply across > servers. ?Second, because >> server machines often spin up client processes for things > like repair or >> rebalance. > > Yep, but my reading is that the config's that the servers > need are local - to make a disk a share (volume), and > that as you've described the rest are "client processes" > (even when on something built as a "server"), so if you > catered for all clients then you'd be set? ?I.e. AFR now > runs in the client? > > > And I am sick of the word-wrap on this client .. I think > you've finally convinced me to fix it ... what's normal > these days - still 80 chars? I used to line-wrap (gnus and cool emacs extensions). It doesn't make sense to line wrap any more. Let the email client handle it depending on the screen size of the device (mobile / tablet / desktop). -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From anand.avati at gmail.com Wed May 9 01:33:50 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 8 May 2012 18:33:50 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: References: <4FA7DFA3.1030300@redhat.com> Message-ID: On Mon, May 7, 2012 at 9:33 PM, Anand Babu Periasamy wrote: > On Mon, May 7, 2012 at 7:43 AM, Jeff Darcy wrote: > > I've long felt that our ways of dealing with cluster membership and > staging of > > config changes is not quite as robust and scalable as we might want. > > Accordingly, I spent a bit of time a couple of weeks ago looking into the > > possibility of using ZooKeeper to do some of this stuff. Yeah, it > brings in a > > heavy Java dependency, but when I looked at some lighter-weight > alternatives > > they all seemed to be lacking in more important ways. Basically the > idea was > > to do this: > > > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper > servers, or > > point everyone at an existing ZooKeeper cluster. > > > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer > probe" > > merely updates ZK, and "peer status" merely reads from it). > > > > * Store config information in ZK *once* instead of regenerating volfiles > etc. > > on every node (and dealing with the ugly cases where a node was down > when the > > config change happened). > > > > * Set watches on ZK nodes to be notified when config changes happen, and > > respond appropriately. > > > > I eventually ran out of time and moved on to other things, but this or > > something like it (e.g. using Riak Core) still seems like a better > approach > > than what we have. In that context, it looks like ZkFarmer[1] might be > a big > > help. AFAICT someone else was trying to solve almost exactly the same > kind of > > server/config problem that we have, and wrapped their solution into a > library. > > Is this a direction other devs might be interested in pursuing some day, > > if/when time allows? > > > > > > [1] https://github.com/rs/zkfarmer > > Real issue is here is: GlusterFS is a fully distributed system. It is > OK for config files to be in one place (centralized). It is easier to > manage and backup. Avati still claims that making distributed copies > are not a problem (volume operations are fast, versioned and > checksumed). Also the code base for replicating 3 way or all-node is > same. We all need to come to agreement on the demerits of replicating > the volume spec on every node. > My claim is somewhat similar to what you said literally, but slightly different in meaning. What I mean is, while it is true keeping multiple copies of the volfile is more expensive/resource consuming in theory, what is the breaking point in terms of number of servers where it begins to matter? There are trivial (low lying) enhancements which are possible (for e.g, store volfiles of a volume only on participating servers instead of all servers) which could address a class of concerns. There are clear advantages in having volfiles in all the participating nodes at least - it takes away dependency on order of booting of servers in your data centre. If volfiles are available locally you dont have to wait/retry for the "central servers" to come up first. Whether this is volfiles managed by glusterd, or "storage servers" of ZK, it is a big advantage to have the startup of a given server decoupled from the others (of course the coupling comes in at an operational level at the time of volume modifications, but that is much more acceptable). If the storage of volfiles on all servers really seems unnecessary, we should first come up with real hard numbers - number of servers v/s latency of volume operations and then figure out at what point it starts becoming unacceptably slow. Maybe a good solution is to just propagate the volfiles in the background while still retaining version info than introducing a more intrusive change? But we really need the numbers first. > > If we are convinced to keep the config info in one place, ZK is > certainly one a good idea. I personally hate Java dependency. I still > struggle with Java dependencies for browser and clojure. I can digest > that if we are going to adopt Java over Python for future external > modules. Alternatively we can also look at creating a replicated meta > system volume. What ever we adopt, we should keep dependencies and > installation steps to the bare minimum and simple. > > It is true other projects have figured out the problem of membership and configuration management and specialize at doing that. That is very good for the entire computing community as a whole. If there are components we can incorporate and build upon their work, that is very desirable. At the same time we also need to check what other baggage we inherit along with the specialized expertise we take on. One of the biggest strengths of Gluster has been its "lightweight"edness and lack of dependencies - which in turn has driven our adoption significantly which in turn results in higher feedback and bug reports etc. (i.e, it is not an isolated strength in itself). Enforcing a Java dependency down the throat of users who want a simple distributed filesystem (yes, the moment we stop thinking of gluster as a "simple" distributed filesystem - even though it may be an oxymoron technically, but I guess you know what I mean :) it's a slippery slope towards it becoming "yet another" distributed filesystem.) The simplicity is what "makes" gluster to a large extent what it is. This makes the developer's life miserable to a fair degree, but it anyways always is, one way or another ;) I am not against adopting external projects. There are good reasons many times to do so. If there are external projects which are "compatible in personality" with gluster and helps us avoid reinventing the wheel, we must definitely do so. If they are not compatible, I'm sure there are lessons and ideas we can adopt, if not code. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Wed May 9 04:18:46 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 9 May 2012 04:18:46 +0000 Subject: [Gluster-devel] ZkFarmer In-Reply-To: References: <4FA7DFA3.1030300@redhat.com> Message-ID: <20120509041846.GB18684@homeworld.netbsd.org> On Mon, May 07, 2012 at 09:33:50PM -0700, Anand Babu Periasamy wrote: > I personally hate Java dependency. Me too. I know Java programs are supposed to have decent performances, but my experiences had always been terrible. Please do not add a dependency on Java. -- Emmanuel Dreyfus manu at netbsd.org From manu at netbsd.org Wed May 9 04:41:47 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 9 May 2012 04:41:47 +0000 Subject: [Gluster-devel] glusterfs-3.3.0qa40 released In-Reply-To: <20120507081553.5AA00100C5@build.gluster.com> References: <20120507081553.5AA00100C5@build.gluster.com> Message-ID: <20120509044147.GC18684@homeworld.netbsd.org> On Mon, May 07, 2012 at 01:15:50AM -0700, Vijay Bellur wrote: > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa40.tar.gz Hi There is a small issue with python: the machine that runs autoconf only has python 2.5 installed, and as a result, the generated configure script fails to detect an installed python 2.6 or higher. Here is an example at mine, where python 2.7 is installed: checking for a Python interpreter with version >= 2.4... none configure: error: no suitable Python interpreter found That can be fixed by patching configure, but it would be nice if gluster builds could contain the check with latest python. -- Emmanuel Dreyfus manu at netbsd.org From renqiang at 360buy.com Wed May 9 04:46:08 2012 From: renqiang at 360buy.com (=?gb2312?B?yM7Hvw==?=) Date: Wed, 9 May 2012 12:46:08 +0800 Subject: [Gluster-devel] How to repair a 1TB disk in 30 mins Message-ID: <000301cd2d9e$a6b07fc0$f4117f40$@com> Dear All: I have a question. When I have a large cluster, maybe more than 10PB data, if a file have 3 copies and each disk have 1TB capacity, So we need about 30,000 disks. All disks are very cheap and are easily damaged. We must repair a 1TB disk in 30 mins?As far as I know?in gluster architecture?all data in the damaged disk will be repaired to the new disk which is used to replace the damaged disk. As a result of the writing speed of disk, when we repair 1TB disk in gluster, we need more than 5 hours. Can we do it in 30 mins? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Wed May 9 05:35:40 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Wed, 09 May 2012 15:35:40 +1000 Subject: [Gluster-devel] automake Message-ID: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Hello, I have built a new module and I can't seem to get the changed makefiles to be built. I have not used "configure" in any of my projects and I'm not seeing an answer from my google searches. The error that I get is during the "make" where glusterfs-3.2.6/missing errors at line 52 "automake-1.9: command not found". This is a newer RedHat environment and it has automake 1.11 .. if I cp 1.11 to 1.9 I get other errors ... libtool is reporting that the automake version is 1.11.1. I believe that it is getting the 1.9 version from Gluster ... How do I get a new Makefile.am and Makefile.in to work in this structure? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From harsha at gluster.com Wed May 9 06:03:00 2012 From: harsha at gluster.com (Harshavardhana) Date: Tue, 8 May 2012 23:03:00 -0700 Subject: [Gluster-devel] automake In-Reply-To: <201205090535.q495Ze5E009996@singularity.tronunltd.com> References: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Message-ID: Ian, Please re-run the ./autogen.sh and use again. Make sure you have added entries in 'configure.ac' and 'Makefile.am' for the respective module name and directory. -Harsha On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > Hello, > > > ?I have built a new module and I can't seem to > get the changed makefiles to be built. ?I have not > used "configure" in any of my projects and I'm > not seeing an answer from my google searches. > > ?The error that I get is during the "make" where > glusterfs-3.2.6/missing errors at line 52 > "automake-1.9: command not found". > > ?This is a newer RedHat environment and it has > automake 1.11 .. if I cp 1.11 to 1.9 I get other > errors ... libtool is reporting that the automake > version is 1.11.1. ?I believe that it is getting the > 1.9 version from Gluster ... > > ?How do I get a new Makefile.am and Makefile.in > to work in this structure? > > > > Cheers, > > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Wed May 9 06:05:54 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Wed, 09 May 2012 16:05:54 +1000 Subject: [Gluster-devel] automake Message-ID: <201205090605.q4965sPn010223@singularity.tronunltd.com> You're a champion. Thanks Harsha. ----- Original Message ----- >From: "Harshavardhana" >To: "Ian Latter" >Subject: Re: [Gluster-devel] automake >Date: Tue, 08 May 2012 23:03:00 -0700 > > Ian, > > Please re-run the ./autogen.sh and use again. > > Make sure you have added entries in 'configure.ac' and 'Makefile.am' > for the respective module name and directory. > > -Harsha > > On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > > Hello, > > > > > > ?I have built a new module and I can't seem to > > get the changed makefiles to be built. ?I have not > > used "configure" in any of my projects and I'm > > not seeing an answer from my google searches. > > > > ?The error that I get is during the "make" where > > glusterfs-3.2.6/missing errors at line 52 > > "automake-1.9: command not found". > > > > ?This is a newer RedHat environment and it has > > automake 1.11 .. if I cp 1.11 to 1.9 I get other > > errors ... libtool is reporting that the automake > > version is 1.11.1. ?I believe that it is getting the > > 1.9 version from Gluster ... > > > > ?How do I get a new Makefile.am and Makefile.in > > to work in this structure? > > > > > > > > Cheers, > > > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Wed May 9 06:08:41 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 8 May 2012 23:08:41 -0700 Subject: [Gluster-devel] automake In-Reply-To: <201205090535.q495Ze5E009996@singularity.tronunltd.com> References: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Message-ID: You might want to read autobook for the general theory behind autotools. Here's a quick summary - aclocal prepares the running of autotools. autoheader prepares autotools to generate a config.h to be consumed by C code configure.ac is the "source" to discover the build system and accept user parameters autoconf converts configure.ac to configure Makefile.am is the "source" to define what is to be built and how. automake converts Makefile.am to Makefile.in till here everything is scripted in ./autogen.sh running configure creates Makefile out of Makefile.in now run make :) Avati On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > Hello, > > > I have built a new module and I can't seem to > get the changed makefiles to be built. I have not > used "configure" in any of my projects and I'm > not seeing an answer from my google searches. > > The error that I get is during the "make" where > glusterfs-3.2.6/missing errors at line 52 > "automake-1.9: command not found". > > This is a newer RedHat environment and it has > automake 1.11 .. if I cp 1.11 to 1.9 I get other > errors ... libtool is reporting that the automake > version is 1.11.1. I believe that it is getting the > 1.9 version from Gluster ... > > How do I get a new Makefile.am and Makefile.in > to work in this structure? > > > > Cheers, > > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abperiasamy at gmail.com Wed May 9 07:21:35 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Wed, 9 May 2012 00:21:35 -0700 Subject: [Gluster-devel] automake In-Reply-To: References: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Message-ID: On Tue, May 8, 2012 at 11:08 PM, Anand Avati wrote: > You might want to read autobook for the general theory behind autotools. > Here's a quick summary - > > aclocal prepares the running of autotools. > autoheader prepares autotools to generate a config.h to be consumed by C > code > configure.ac is the "source" to discover the build system and accept user > parameters > autoconf converts configure.ac to configure > Makefile.am is the "source" to define what is to be built and how. > automake converts Makefile.am to Makefile.in > > till here everything is scripted in ./autogen.sh > > running configure creates Makefile out of Makefile.in > > now run make :) > Best way to learn autotools is copy-paste-customize. In general, if you are starting a new project, Debian has a nice little tool called "autoproject". It will auto generate autoconf and automake files. Then you start customizing it. GNU project should really merge all these tools in to one simple coherent system. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From abperiasamy at gmail.com Wed May 9 07:54:43 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Wed, 9 May 2012 00:54:43 -0700 Subject: [Gluster-devel] How to repair a 1TB disk in 30 mins In-Reply-To: <000301cd2d9e$a6b07fc0$f4117f40$@com> References: <000301cd2d9e$a6b07fc0$f4117f40$@com> Message-ID: On Tue, May 8, 2012 at 9:46 PM, ?? wrote: > Dear All: > > ? I have a question. When I have a large cluster, maybe more than 10PB data, > if a file have 3 copies and each disk have 1TB capacity, So we need about > 30,000 disks. All disks are very cheap and are easily damaged. We must > ?repair a 1TB disk in 30 mins?As far as I know?in gluster architecture?all > data in the damaged disk will be repaired to the new disk which is used to > replace the damaged disk. As a result of the writing speed of disk, when we > repair 1TB disk in gluster, we need more than 5 hours. Can we do it in 30 > mins? 5 hours is based on SATA 1TB disk copying at ~50MB/s across small and large files + folders. This means, you literally attached the disk to the system and manually transferring the data. I can't think of any other faster way to transfer data on 1TB 7200RPM SATA/SAS disks without bending space-time ;). Larger disks and RAID arrays only makes this worse. This is exactly why we implemented passive self-heal in the first place. GlusterFS heals files on demand (as they are accessed), so applications have least down time or disruption. There is plenty of time to heal the cold data in background. All we should care is minimal down time. Self-heal in 3.3 has some major improvements. It got significantly faster, because healing is performed on the server side entirely (server to server). It can perform granular healing on large files (previously checksum operations used to pause or timeout the VMs). Active-healing (Replicate now remembers pending files and heals them when the failed node comes back. Previously you have to perform name-space wide recursive directory listing). Most importantly self-healing is no longer a blackbox. heal-info can show pending and currently-healing files. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From renqiang at 360buy.com Wed May 9 09:29:34 2012 From: renqiang at 360buy.com (=?utf-8?B?5Lu75by6?=) Date: Wed, 9 May 2012 17:29:34 +0800 Subject: [Gluster-devel] =?utf-8?b?562U5aSNOiAgSG93IHRvIHJlcGFpciBhIDFU?= =?utf-8?q?B_disk_in_30_mins?= In-Reply-To: References: <000301cd2d9e$a6b07fc0$f4117f40$@com> Message-ID: <002601cd2dc6$3f68f4f0$be3aded0$@com> Thank you very much? And I have some questions? 1?What's the capacity of the largest cluster online ?And how many nodes in it? And What is it used for? 2?When we excute 'ls' in a directory,it's very slow,if the cluster has too many bricks and too many nodes.Can we do it well? -----????----- ???: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] ????: 2012?5?9? 15:55 ???: renqiang ??: gluster-devel at nongnu.org ??: Re: [Gluster-devel] How to repair a 1TB disk in 30 mins On Tue, May 8, 2012 at 9:46 PM, ?? wrote: > Dear All: > > I have a question. When I have a large cluster, maybe more than 10PB data, > if a file have 3 copies and each disk have 1TB capacity, So we need about > 30,000 disks. All disks are very cheap and are easily damaged. We must > repair a 1TB disk in 30 mins?As far as I know?in gluster architecture?all > data in the damaged disk will be repaired to the new disk which is used to > replace the damaged disk. As a result of the writing speed of disk, when we > repair 1TB disk in gluster, we need more than 5 hours. Can we do it in 30 > mins? 5 hours is based on SATA 1TB disk copying at ~50MB/s across small and large files + folders. This means, you literally attached the disk to the system and manually transferring the data. I can't think of any other faster way to transfer data on 1TB 7200RPM SATA/SAS disks without bending space-time ;). Larger disks and RAID arrays only makes this worse. This is exactly why we implemented passive self-heal in the first place. GlusterFS heals files on demand (as they are accessed), so applications have least down time or disruption. There is plenty of time to heal the cold data in background. All we should care is minimal down time. Self-heal in 3.3 has some major improvements. It got significantly faster, because healing is performed on the server side entirely (server to server). It can perform granular healing on large files (previously checksum operations used to pause or timeout the VMs). Active-healing (Replicate now remembers pending files and heals them when the failed node comes back. Previously you have to perform name-space wide recursive directory listing). Most importantly self-healing is no longer a blackbox. heal-info can show pending and currently-healing files. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From ian.latter at midnightcode.org Thu May 10 05:47:06 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Thu, 10 May 2012 15:47:06 +1000 Subject: [Gluster-devel] Hide Feature Message-ID: <201205100547.q4A5l6eH015066@singularity.tronunltd.com> Hello, I have published an untested "hide" module (compiled against glusterfs-3.2.6); A simple method for hiding an underlying directory structure from parent/up-stream bricks within GlusterFS. In 2012 this code was spawned from my incomplete 2009 dedupe brick code which used this method to protect its internal hash database from the user, above. http://midnightcode.org/projects/saturn/code/hide-0.5.tgz I am serious when I mean untested - I've not even loaded the module under Gluster, it simply compiles. Let me know if there are tweaks that should be made or considered. Enjoy. -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 05:55:55 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Thu, 10 May 2012 15:55:55 +1000 Subject: [Gluster-devel] Fuse operations Message-ID: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> Hello, I published the Hide module in order to open a discussion around Fuse operations; http://fuse.sourceforge.net/doxygen/structfuse__operations.html In the dedupe module I want to secure the hash database from direct parent/use manipulation. The approach that I took was to find every GlusterFS file operation (fop) that took a loc_t parameter (as discovered via every xlator that is included in the tarball), in order to do path matching and then pass-through the call or return an error. The problem is that I can't find GlusterFS examples for all of the Fuse operators and, when I stray from the examples (like getattr and utiments), gluster tells me that there are no such xlator fops (at compile time - from the wind and unwind macros). So, I guess; 1) Are all Fuse/FS ops handled by Gluster? 2) Where can I find a complete list of the Gluster fops, and not just those that have been used in existing modules? 3) Is it safe to path match on loc_t? (i.e. is it fully resolved such that I won't find /etc/././././passwd)? This I could test .. Thanks, -- Ian Latter Late night coder .. http://midnightcode.org/ From jdarcy at redhat.com Thu May 10 13:39:21 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 10 May 2012 09:39:21 -0400 Subject: [Gluster-devel] Hide Feature In-Reply-To: <201205100547.q4A5l6eH015066@singularity.tronunltd.com> References: <201205100547.q4A5l6eH015066@singularity.tronunltd.com> Message-ID: <20120510093921.4a9f581a@jdarcy-dt.usersys.redhat.com> On Thu, 10 May 2012 15:47:06 +1000 "Ian Latter" wrote: > I have published an untested "hide" module (compiled > against glusterfs-3.2.6); > > A simple method for hiding an underlying directory > structure from parent/up-stream bricks within > GlusterFS. In 2012 this code was spawned from > my incomplete 2009 dedupe brick code which used > this method to protect its internal hash database > from the user, above. > > http://midnightcode.org/projects/saturn/code/hide-0.5.tgz > > > I am serious when I mean untested - I've not even > loaded the module under Gluster, it simply compiles. > > > Let me know if there are tweaks that should be made > or considered. A couple of comments: * It should be sufficient to fail lookup for paths that match your pattern. If that fails, the caller will never get to any others. You can use the quota translator as an example for something like this. * If you want to continue supporting this yourself, then you can just leave the code as it is, though in that case you'll want to consider building it "out of tree" as I describe in my "Translator 101" post[1] or do for some of my own translators[2]. Otherwise you'll need to submit it as a patch through Gerrit according to our standard workflow[3]. You'll also need to fix some of the idiosyncratic indentation. I don't remember the current policy wrt copyright assignment, but that might be required too. [1] http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/ [2] https://github.com/jdarcy/negative-lookup [3] http://www.gluster.org/community/documentation/index.php/Development_Work_Flow From jdarcy at redhat.com Thu May 10 13:58:51 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 10 May 2012 09:58:51 -0400 Subject: [Gluster-devel] Fuse operations In-Reply-To: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> References: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> Message-ID: <20120510095851.2f034889@jdarcy-dt.usersys.redhat.com> On Thu, 10 May 2012 15:55:55 +1000 "Ian Latter" wrote: > So, I guess; > 1) Are all Fuse/FS ops handled by Gluster? > 2) Where can I find a complete list of the > Gluster fops, and not just those that have > been used in existing modules? GlusterFS operations for a translator are all defined in an xlator_fops structure. When building translators, it can also be convenient to look at the default_xxx and default_xxx_cbk functions for each fop you implement. Also, I forgot to mention in my comments on your "hide" translator that you can often use the default_xxx_cbk callback when you call STACK_WIND, instead of having to define your own trivial one. FUSE operations are listed by the fuse_opcode enum. You can check for yourself how closely this matches our list. They do have a few ops of their own, we have a few of their own, and a few of theirs actually map to our xlator_cbks instead of xlator_fops. The points of non-correspondence seem to be interrupt, bmap, poll and ioctl. Maybe Csaba can elaborate on what we do (or plan to do) about these. > 3) Is it safe to path match on loc_t? (i.e. is > it fully resolved such that I won't find > /etc/././././passwd)? This I could test .. Name/path resolution is an area that has changed pretty recently, so I'll let Avati or Amar field that one. From anand.avati at gmail.com Thu May 10 19:36:26 2012 From: anand.avati at gmail.com (Anand Avati) Date: Thu, 10 May 2012 12:36:26 -0700 Subject: [Gluster-devel] Fuse operations In-Reply-To: <20120510095851.2f034889@jdarcy-dt.usersys.redhat.com> References: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> <20120510095851.2f034889@jdarcy-dt.usersys.redhat.com> Message-ID: On Thu, May 10, 2012 at 6:58 AM, Jeff Darcy wrote: > On Thu, 10 May 2012 15:55:55 +1000 > "Ian Latter" wrote: > > > So, I guess; > > 1) Are all Fuse/FS ops handled by Gluster? > > 2) Where can I find a complete list of the > > Gluster fops, and not just those that have > > been used in existing modules? > > GlusterFS operations for a translator are all defined in an xlator_fops > structure. When building translators, it can also be convenient to > look at the default_xxx and default_xxx_cbk functions for each fop you > implement. Also, I forgot to mention in my comments on your "hide" > translator that you can often use the default_xxx_cbk callback when you > call STACK_WIND, instead of having to define your own trivial one. > > FUSE operations are listed by the fuse_opcode enum. You can check for > yourself how closely this matches our list. They do have a few ops of > their own, we have a few of their own, and a few of theirs actually map > to our xlator_cbks instead of xlator_fops. The points of > non-correspondence seem to be interrupt, bmap, poll and ioctl. Maybe > Csaba can elaborate on what we do (or plan to do) about these. > > We might support interrupt sometime. Bmap - probably never. Poll, maybe. Ioctl - depeneds on what type of ioctl and requirement. > > 3) Is it safe to path match on loc_t? (i.e. is > > it fully resolved such that I won't find > > /etc/././././passwd)? This I could test .. > > Name/path resolution is an area that has changed pretty recently, so > I'll let Avati or Amar field that one. > The ".." interpretation is done by the client side VFS. Internal path construction does not use ".." and are always normalized. There are new situations where we now support non-absolute paths, but those are for GFID based addressing and ".." does not come into picture there. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnmark at redhat.com Thu May 10 21:41:08 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 10 May 2012 17:41:08 -0400 (EDT) Subject: [Gluster-devel] Bugzilla upgrade & planned outage - May 22 In-Reply-To: Message-ID: Pasting an email from bugzilla-announce: Red Hat Bugzilla (bugzilla.redhat.com) will be unavailable on May 22nd starting at 6 p.m. EDT [2200 UTC] to perform an upgrade from Bugzilla 3.6 to Bugzilla 4.2. We are hoping to be complete in no more than 3 hours barring any problems. Any services relying on bugzilla.redhat.com may not work properly during this time. Please be aware in case you need use of those services during the outage. Also *PLEASE* make sure any scripts or other external applications that rely on bugzilla.redhat.com are tested against our test server before the upgrade if you have not done so already. Let the Bugzilla Team know immediately of any issues found by reporting the bug in bugzilla.redhat.com against the Bugzilla product, version 4.2. A summary of the RPC changes is also included below. RPC changes from upstream Bugzilla 4.2: - Bug.* returns arrays for components, versions and aliases - Bug.* returns target_release array - Bug.* returns flag information (from Bugzilla 4.4) - Bug.search supports searching on keywords, dependancies, blocks - Bug.search supports quick searches, saved searches and advanced searches - Group.get has been added - Component.* and Flag.* have been added - Product.get has a component_names option to return just the component names. RPC changes from Red Hat Bugzilla 3.6: - This list may be incomplete. - This list excludes upstream changes from 3.6 that we inherited - Bug.update calls may use different column names. For example, in 3.6 you updated the 'short_desc' key if you wanted to change the summary. Now you must use the 'summary' key. This may be an inconeniance, but will make it much more maintainable in the long run. - Bug.search_new new becomes Bug.search. The 3.6 version of Bug.search is no longer available. - Product.* has been changed to match upstream code - Group.create has been added - RedHat.* and bugzilla.* calls that mirror official RPC calls are officially depreciated, and will be removed approximately two months after Red Hat Bugzilla 4.2 is released. To test against the new beta Bugzilla server, go to https://partner-bugzilla.redhat.com/ Thanks, JM From ian.latter at midnightcode.org Thu May 10 22:25:02 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 08:25:02 +1000 Subject: [Gluster-devel] automake Message-ID: <201205102225.q4AMP2X2018428@singularity.tronunltd.com> Thanks Avati, Yes, when I said that I hadn't use "configure" I meant "autotools" (though I didn't know it :) I think almost every project I download and build from scratch uses configure .. the last time I looked at the autotools was a few years ago now, maybe its time for a re-look .. my libraries are getting big enough to warrant it I suppose. Hadn't seen autogen before .. thanks for your help. Cheers, ----- Original Message ----- >From: "Anand Avati" >To: "Ian Latter" >Subject: Re: [Gluster-devel] automake >Date: Tue, 08 May 2012 23:08:41 -0700 > > You might want to read autobook for the general theory behind autotools. > Here's a quick summary - > > aclocal prepares the running of autotools. > autoheader prepares autotools to generate a config.h to be consumed by C > code > configure.ac is the "source" to discover the build system and accept user > parameters > autoconf converts configure.ac to configure > Makefile.am is the "source" to define what is to be built and how. > automake converts Makefile.am to Makefile.in > > till here everything is scripted in ./autogen.sh > > running configure creates Makefile out of Makefile.in > > now run make :) > > Avati > > On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > > > Hello, > > > > > > I have built a new module and I can't seem to > > get the changed makefiles to be built. I have not > > used "configure" in any of my projects and I'm > > not seeing an answer from my google searches. > > > > The error that I get is during the "make" where > > glusterfs-3.2.6/missing errors at line 52 > > "automake-1.9: command not found". > > > > This is a newer RedHat environment and it has > > automake 1.11 .. if I cp 1.11 to 1.9 I get other > > errors ... libtool is reporting that the automake > > version is 1.11.1. I believe that it is getting the > > 1.9 version from Gluster ... > > > > How do I get a new Makefile.am and Makefile.in > > to work in this structure? > > > > > > > > Cheers, > > > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 22:26:22 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 08:26:22 +1000 Subject: [Gluster-devel] automake Message-ID: <201205102226.q4AMQMEC018461@singularity.tronunltd.com> > > You might want to read autobook for the general theory behind autotools. > > Here's a quick summary - > > > > aclocal prepares the running of autotools. > > autoheader prepares autotools to generate a config.h to be consumed by C > > code > > configure.ac is the "source" to discover the build system and accept user > > parameters > > autoconf converts configure.ac to configure > > Makefile.am is the "source" to define what is to be built and how. > > automake converts Makefile.am to Makefile.in > > > > till here everything is scripted in ./autogen.sh > > > > running configure creates Makefile out of Makefile.in > > > > now run make :) > > > > Best way to learn autotools is copy-paste-customize. In general, if > you are starting a new project, Debian has a nice little tool called > "autoproject". It will auto generate autoconf and automake files. Then > you start customizing it. > > GNU project should really merge all these tools in to one simple > coherent system. My build environment is Fedora but I'm assuming its there too .. if I get some time I'll have a poke around .. Thanks for the info, appreciate it. -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 22:44:32 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 08:44:32 +1000 Subject: [Gluster-devel] Hide Feature Message-ID: <201205102244.q4AMiW2Z018543@singularity.tronunltd.com> Sorry for the re-send Jeff, I managed to screw up the CC so the list didn't get it; > > Let me know if there are tweaks that should be made > > or considered. > > A couple of comments: > > * It should be sufficient to fail lookup for paths that > match your pattern. If that fails, the caller will > never get to any others. You can use the quota > translator as an example for something like this. Ok, this is interesting. So if someone calls another fop .. say "open" ... against my brick/module, something (Fuse?) will make another, dependent, call to lookup first? If that's true then I can cut this all down to size. > * If you want to continue supporting this yourself, > then you can just leave the code as it is, though in > that case you'll want to consider building it "out of > tree" as I describe in my "Translator 101" post[1] > or do for some of my own translators[2]. > Otherwise you'll need to submit it as a patch > through Gerrit according to our standard workflow[3]. Thanks for the Translator articles/posts, I hadn't seen those. Per my previous patches, I'll publish code on my site under the GPL and you guys (Gluster/RedHat) can run them through whatever processes you choose. If it gets included in the GlusterFS package, then that's fine. If it gets ignored by the GlusterFS package, then that's fine also. > You'll also need to fix some of the idiosyncratic > indentation. I don't remember the current policy wrt > copyright assignment, but that might be required too. The weird indentation style used is not mine .. its what I gathered from the Gluster code that I read through. > [1] > http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/ > > [2] https://github.com/jdarcy/negative-lookup > > [3] > http://www.gluster.org/community/documentation/index.php/Development_Work_Flow > -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 23:39:58 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 09:39:58 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205102339.q4ANdwg8018739@singularity.tronunltd.com> > > Sure, I have my own vol files that do (did) what I wanted > > and I was supporting myself (and users); the question > > (and the point) is what is the GlusterFS *intent*? > > > The "intent" (more or less - I hate to use the word as it can imply a > commitment to what I am about to say, but there isn't one) is to keep the > bricks (server process) dumb and have the intelligence on the client side. > This is a "rough goal". There are cases where replication on the server > side is inevitable (in the case of NFS access) but we keep the software > architecture undisturbed by running a client process on the server machine > to achieve it. [There's a difference between intent and plan/roadmap] Okay. Unfortunately I am unable to leverage this - I tried to serve a Fuse->GlusterFS client mount point (of a Distribute volume) as a GlusterFS posix brick (for a Replicate volume) and it wouldn't play ball .. > We do plan to support "replication on the server" in the future while still > retaining the existing software architecture as much as possible. This is > particularly useful in Hadoop environment where the jobs expect write > performance of a single copy and expect copy to happen in the background. > We have the proactive self-heal daemon running on the server machines now > (which again is a client process which happens to be physically placed on > the server) which gives us many interesting possibilities - i.e, with > simple changes where we fool the client side replicate translator at the > time of transaction initiation that only the closest server is up at that > point of time and write to it alone, and have the proactive self-heal > daemon perform the extra copies in the background. This would be consistent > with other readers as they get directed to the "right" version of the file > by inspecting the changelogs while the background replication is in > progress. > > The intention of the above example is to give a general sense of how we > want to evolve the architecture (i.e, the "intention" you were referring > to) - keep the clients intelligent and servers dumb. If some intelligence > needs to be built on the physical server, tackle it by loading a client > process there (there are also "pathinfo xattr" kind of internal techniques > to figure out locality of the clients in a generic way without bringing > "server sidedness" into them in a harsh way) Okay .. But what happened to the "brick" architecture of stacking anything on anything? I think you point that out here ... > I'll > > write an rsyncd wrapper myself, to run on top of Gluster, > > if the intent is not allow the configuration I'm after > > (arbitrary number of disks in one multi-host environment > > replicated to an arbitrary number of disks in another > > multi-host environment, where ideally each environment > > need not sum to the same data capacity, presented in a > > single contiguous consumable storage layer to an > > arbitrary number of unintelligent clients, that is as fault > > tolerant as I choose it to be including the ability to add > > and offline/online and remove storage as I so choose) .. > > or switch out the whole solution if Gluster is heading > > away from my needs. I just need to know what the > > direction is .. I may even be able to help get you there if > > you tell me :) > > > > > There are good and bad in both styles (distribute on top v/s replicate on > top). Replicate on top gives you much better flexibility of configuration. > Distribute on top is easier for us developers. As a user I would like > replicate on top as well. But the problem today is that replicate (and > self-heal) does not understand "partial failure" of its subvolumes. If one > of the subvolume of replicate is a distribute, then today's replicate only > understands complete failure of the distribute set or it assumes everything > is completely fine. An example is self-healing of directory entries. If a > file is "missing" in one subvolume because a distribute node is temporarily > down, replicate has no clue why it is missing (or that it should keep away > from attempting to self-heal). Along the same lines, it does not know that > once a server is taken off from its distribute subvolume for good that it > needs to start recreating missing files. Hmm. I loved the brick idea. I don't like perverting it by trying to "see through" layers. In that context I can see two or three expected outcomes from someone building this type of stack (heh: a quick trick brick stack) - when a distribute child disappears; At the Distribute layer; 1) The distribute name space / stat space remains in tact, though the content is obviously not avail. 2) The distribute presentation is pure and true of its constituents, showing only the names / stats that are online/avail. In its standalone case, 2 is probably preferable as it allows clean add/start/stop/ remove capacity. At the Replicate layer; 3) replication occurs only where the name / stat space shows a gap 4) the replication occurs at any delta I don't think there's a real choice here, even if 3 were sensible, what would replicate do if there was a local name and even just a remote file size change, when there's no local content to update; it must be 4. In which case, I would expect that a replicate on top of a distribute with a missing child would suddenly see a delta that it would immediately set about repairing. > The effort to fix this seems to be big enough to disturb the inertia of > status quo. If this is fixed, we can definitely adopt a replicate-on-top > mode in glusterd. I'm not sure why there needs to be a "fix" .. wasn't the previous behaviour sensible? Or, if there is something to "change", then bolstering the distribute module might be enough - a combination of 1 and 2 above. Try this out: what if the Distribute layer maintained a full name space on each child, and didn't allow "recreation"? Say 3 children, one is broken/offline, so that /path/to/child/3/file is missing but is known to be missing (internally to Distribute). Then the Distribute brick can both not show the name space to the parent layers, but can also actively prevent manipulation of those files (the parent can neither stat /path/to/child/3/file nor unlink, nor create/write to it). If this change is meant to be permanent, then the administrative act of removing the child from distribute will then truncate the locked name space, allowing parents (be they users or other bricks, like Replicate) to act as they please (such as recreating the missing files). If you adhere to the principles that I thought I understood from 2009 or so then you should be able to let the users create unforeseen Gluster architectures without fear or impact. I.e. i) each brick is fully self contained * ii) physical bricks are the bread of a brick stack sandwich ** iii) any logical brick can appear above/below any other logical brick in a brick stack * Not mandating a 1:1 file mapping from layer to layer ** Eg: the Posix (bottom), Client (bottom), Server (top) and NFS (top) are all regarded as physical bricks. Thus it was my expectation that a dedupe brick (being logical) could either go above or below a distribute brick (also logical), for example. Or that an encryption brick could go on top of replicate which was on top of encryption which was on top of distribute which was on top of encryption on top of posix, for example. Or .. am I over simplifying the problem space? -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 23:52:43 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 09:52:43 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205102352.q4ANqhc6018790@singularity.tronunltd.com> Actually, I want to clarify this point; > But the problem today is that replicate (and > self-heal) does not understand "partial failure" > of its subvolumes. If one of the subvolume of > replicate is a distribute, then today's replicate > only understands complete failure of the > distribute set or it assumes everything is > completely fine. I haven't seen this in practice .. I have seen replicate attempt to repair anything that was "missing" and that both the replicate and the underlying bricks were still viable storage layers in that process ... ----- Original Message ----- >From: "Ian Latter" >To: "Anand Avati" >Subject: Re: [Gluster-devel] ZkFarmer >Date: Fri, 11 May 2012 09:39:58 +1000 > > > > Sure, I have my own vol files that do (did) what I wanted > > > and I was supporting myself (and users); the question > > > (and the point) is what is the GlusterFS *intent*? > > > > > > The "intent" (more or less - I hate to use the word as it > can imply a > > commitment to what I am about to say, but there isn't one) > is to keep the > > bricks (server process) dumb and have the intelligence on > the client side. > > This is a "rough goal". There are cases where replication > on the server > > side is inevitable (in the case of NFS access) but we keep > the software > > architecture undisturbed by running a client process on > the server machine > > to achieve it. > > [There's a difference between intent and plan/roadmap] > > Okay. Unfortunately I am unable to leverage this - I tried > to serve a Fuse->GlusterFS client mount point (of a > Distribute volume) as a GlusterFS posix brick (for a > Replicate volume) and it wouldn't play ball .. > > > We do plan to support "replication on the server" in the > future while still > > retaining the existing software architecture as much as > possible. This is > > particularly useful in Hadoop environment where the jobs > expect write > > performance of a single copy and expect copy to happen in > the background. > > We have the proactive self-heal daemon running on the > server machines now > > (which again is a client process which happens to be > physically placed on > > the server) which gives us many interesting possibilities > - i.e, with > > simple changes where we fool the client side replicate > translator at the > > time of transaction initiation that only the closest > server is up at that > > point of time and write to it alone, and have the > proactive self-heal > > daemon perform the extra copies in the background. This > would be consistent > > with other readers as they get directed to the "right" > version of the file > > by inspecting the changelogs while the background > replication is in > > progress. > > > > The intention of the above example is to give a general > sense of how we > > want to evolve the architecture (i.e, the "intention" you > were referring > > to) - keep the clients intelligent and servers dumb. If > some intelligence > > needs to be built on the physical server, tackle it by > loading a client > > process there (there are also "pathinfo xattr" kind of > internal techniques > > to figure out locality of the clients in a generic way > without bringing > > "server sidedness" into them in a harsh way) > > Okay .. But what happened to the "brick" architecture > of stacking anything on anything? I think you point > that out here ... > > > > I'll > > > write an rsyncd wrapper myself, to run on top of Gluster, > > > if the intent is not allow the configuration I'm after > > > (arbitrary number of disks in one multi-host environment > > > replicated to an arbitrary number of disks in another > > > multi-host environment, where ideally each environment > > > need not sum to the same data capacity, presented in a > > > single contiguous consumable storage layer to an > > > arbitrary number of unintelligent clients, that is as fault > > > tolerant as I choose it to be including the ability to add > > > and offline/online and remove storage as I so choose) .. > > > or switch out the whole solution if Gluster is heading > > > away from my needs. I just need to know what the > > > direction is .. I may even be able to help get you there if > > > you tell me :) > > > > > > > > There are good and bad in both styles (distribute on top > v/s replicate on > > top). Replicate on top gives you much better flexibility > of configuration. > > Distribute on top is easier for us developers. As a user I > would like > > replicate on top as well. But the problem today is that > replicate (and > > self-heal) does not understand "partial failure" of its > subvolumes. If one > > of the subvolume of replicate is a distribute, then > today's replicate only > > understands complete failure of the distribute set or it > assumes everything > > is completely fine. An example is self-healing of > directory entries. If a > > file is "missing" in one subvolume because a distribute > node is temporarily > > down, replicate has no clue why it is missing (or that it > should keep away > > from attempting to self-heal). Along the same lines, it > does not know that > > once a server is taken off from its distribute subvolume > for good that it > > needs to start recreating missing files. > > Hmm. I loved the brick idea. I don't like perverting it by > trying to "see through" layers. In that context I can see > two or three expected outcomes from someone building > this type of stack (heh: a quick trick brick stack) - when > a distribute child disappears; > > At the Distribute layer; > 1) The distribute name space / stat space > remains in tact, though the content is > obviously not avail. > 2) The distribute presentation is pure and true > of its constituents, showing only the names > / stats that are online/avail. > > In its standalone case, 2 is probably > preferable as it allows clean add/start/stop/ > remove capacity. > > At the Replicate layer; > 3) replication occurs only where the name / > stat space shows a gap > 4) the replication occurs at any delta > > I don't think there's a real choice here, even > if 3 were sensible, what would replicate do if > there was a local name and even just a remote > file size change, when there's no local content > to update; it must be 4. > > In which case, I would expect that a replicate > on top of a distribute with a missing child would > suddenly see a delta that it would immediately > set about repairing. > > > > The effort to fix this seems to be big enough to disturb > the inertia of > > status quo. If this is fixed, we can definitely adopt a > replicate-on-top > > mode in glusterd. > > I'm not sure why there needs to be a "fix" .. wasn't > the previous behaviour sensible? > > Or, if there is something to "change", then > bolstering the distribute module might be enough - > a combination of 1 and 2 above. > > Try this out: what if the Distribute layer maintained > a full name space on each child, and didn't allow > "recreation"? Say 3 children, one is broken/offline, > so that /path/to/child/3/file is missing but is known > to be missing (internally to Distribute). Then the > Distribute brick can both not show the name > space to the parent layers, but can also actively > prevent manipulation of those files (the parent > can neither stat /path/to/child/3/file nor unlink, nor > create/write to it). If this change is meant to be > permanent, then the administrative act of > removing the child from distribute will then > truncate the locked name space, allowing parents > (be they users or other bricks, like Replicate) to > act as they please (such as recreating the > missing files). > > If you adhere to the principles that I thought I > understood from 2009 or so then you should be > able to let the users create unforeseen Gluster > architectures without fear or impact. I.e. > > i) each brick is fully self contained * > ii) physical bricks are the bread of a brick > stack sandwich ** > iii) any logical brick can appear above/below > any other logical brick in a brick stack > > * Not mandating a 1:1 file mapping from layer > to layer > > ** Eg: the Posix (bottom), Client (bottom), > Server (top) and NFS (top) are all > regarded as physical bricks. > > Thus it was my expectation that a dedupe brick > (being logical) could either go above or below > a distribute brick (also logical), for example. > > Or that an encryption brick could go on top > of replicate which was on top of encryption > which was on top of distribute which was on > top of encryption on top of posix, for example. > > > Or .. am I over simplifying the problem space? > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From vbellur at redhat.com Fri May 11 07:06:38 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Fri, 11 May 2012 12:36:38 +0530 Subject: [Gluster-devel] release-3.3 branched out Message-ID: <4FACBA7E.6090801@redhat.com> A new branch release-3.3 has been created. You can checkout the branch via: $git checkout -b release-3.3 origin/release-3.3 rfc.sh has been updated to send patches to the appropriate branch. The plan is to have all 3.3.x releases happen off this branch. If you need any fix to be part of a 3.3.x release, please send out a backport of the same from master to release-3.3 after it has been accepted in master. Thanks, Vijay From manu at netbsd.org Fri May 11 07:29:20 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 11 May 2012 07:29:20 +0000 Subject: [Gluster-devel] release-3.3 branched out In-Reply-To: <4FACBA7E.6090801@redhat.com> References: <4FACBA7E.6090801@redhat.com> Message-ID: <20120511072920.GG18684@homeworld.netbsd.org> On Fri, May 11, 2012 at 12:36:38PM +0530, Vijay Bellur wrote: > A new branch release-3.3 has been created. You can checkout the branch via: Any chance someone merge my build fixes so that I can pullup to the new branch? http://review.gluster.com/3238 -- Emmanuel Dreyfus manu at netbsd.org From vbellur at redhat.com Fri May 11 07:43:13 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Fri, 11 May 2012 13:13:13 +0530 Subject: [Gluster-devel] release-3.3 branched out In-Reply-To: <20120511072920.GG18684@homeworld.netbsd.org> References: <4FACBA7E.6090801@redhat.com> <20120511072920.GG18684@homeworld.netbsd.org> Message-ID: <4FACC311.5020708@redhat.com> On 05/11/2012 12:59 PM, Emmanuel Dreyfus wrote: > On Fri, May 11, 2012 at 12:36:38PM +0530, Vijay Bellur wrote: >> A new branch release-3.3 has been created. You can checkout the branch via: > Any chance someone merge my build fixes so that I can pullup to the > new branch? > http://review.gluster.com/3238 Merged to master. Vijay From vijay at build.gluster.com Fri May 11 10:35:24 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Fri, 11 May 2012 03:35:24 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa41 released Message-ID: <20120511103527.5809B18009D@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa41/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa41.tar.gz This release is made off v3.3.0qa41 From 7220022 at gmail.com Sat May 12 15:22:57 2012 From: 7220022 at gmail.com (7220022) Date: Sat, 12 May 2012 19:22:57 +0400 Subject: [Gluster-devel] Gluster VSA for VMware ESX Message-ID: <012701cd3053$1d2e6110$578b2330$@gmail.com> Would love to test performance of Gluster Virtual Storage Appliance for VMware, but cannot get the demo. Emails and calls to Red Hat went unanswered. We've built a nice test system for the cluster at our lab, 8 modern servers running ESX4.1 and connected via 40gb InfiniBand fabric. Each server has 24 2.5" drives, SLC SSD and 10K SAS HDD-s connected to 6 LSI controllers with CacheCade (Pro 2.0 with write cache enabled,) 4 drives per controller. The plan is to test performance using bricks made of HDD-s cached with SSD-s, as well as HDD-s and SSD-s separately. Can anyone help getting the demo version of VSA? It's fine if it's a beta version, we just wanted to check the performance and scalability. -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Sun May 13 08:27:20 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 13 May 2012 10:27:20 +0200 Subject: [Gluster-devel] buffer corruption in io-stats Message-ID: <1kk12tm.1awqq7kf1joseM%manu@netbsd.org> I get a reproductible SIGSEGV with sources from latest git. iosfd is overwritten by the file path, it seems there is a confusion somewhere between iosfd->filename pointer value and pointed buffer (gdb) bt #1 0xbb92891b in free () from /lib/libc.so.12 #2 0xbbbb37a7 in __gf_free (free_ptr=0x74656e2f) at mem-pool.c:258 #3 0xb9a85378 in io_stats_release (this=0xba3e3000, fd=0xb8f9d098) at io-stats.c:2420 #4 0xbbbafcc0 in fd_destroy (fd=0xb8f9d098) at fd.c:507 #5 0xbbbafdf8 in fd_unref (fd=0xb8f9d098) at fd.c:543 #6 0xbbbaf7cf in gf_fdptr_put (fdtable=0xbb77d070, fd=0xb8f9d098) at fd.c:393 #7 0xbb821147 in fuse_release () from /usr/local/lib/glusterfs/3git/xlator/mount/fuse.so #8 0xbb82a2e1 in fuse_thread_proc () from /usr/local/lib/glusterfs/3git/xlator/mount/fuse.so (gdb) frame 3 #3 0xb9a85378 in io_stats_release (this=0xba3e3000, fd=0xb8f9d098) at io-stats.c:2420 2420 GF_FREE (iosfd->filename); (gdb) print *iosfd $2 = {filename = 0x74656e2f
, data_written = 3418922014271107938, data_read = 7813586423313035891, block_count_write = {4788563690262784356, 3330756270057407571, 7074933154630937908, 28265, 0 }, block_count_read = { 0 }, opened_at = {tv_sec = 1336897011, tv_usec = 145734}} (gdb) x/10s iosfd 0xbb70f800: "/netbsd/usr/src/tooldir.NetBSD-6.99.4-i386/bin" -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 13 14:42:45 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 13 May 2012 16:42:45 +0200 Subject: [Gluster-devel] python version Message-ID: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> Hi There is a problem with python version detection in the configure script. The machine on which autotools is ran prior releasing glusterfs expands AM_PATH_PYTHON into a script that fails to accept python > 2.4. As I understand, a solution is to concatenate latest automake-1.12/m4/python.m4 into glusterfs' aclocal.m4. That way python up to 3.1 shoul be accepted. Opinions? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From renqiang at 360buy.com Mon May 14 01:20:32 2012 From: renqiang at 360buy.com (=?gb2312?B?yM7Hvw==?=) Date: Mon, 14 May 2012 09:20:32 +0800 Subject: [Gluster-devel] balance stoped Message-ID: <018001cd316f$c25a6f90$470f4eb0$@com> Hi,All! May I ask you a question? When we do balance on a volume, it stopped when moving the 505th?s file 0f 1006 files. Now we cannot restart it and also cannot cancel it. How can I do, please? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Mon May 14 01:22:43 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Mon, 14 May 2012 11:22:43 +1000 Subject: [Gluster-devel] lseek Message-ID: <201205140122.q4E1MhU8000317@singularity.tronunltd.com> Hello, I'm looking for a seek (lseek) implementation in one of the modules and I can't see one. Do I need to care about seeking if my module changes the file size (i.e. compresses) in Gluster? I would have thought that I did except that I believe that what I'm reading is that Gluster returns a NONSEEKABLE flag on file open (fuse_kernel.h at line 149). Does this mitigate the need to correct the user seeks? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Mon May 14 07:48:17 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 14 May 2012 09:48:17 +0200 Subject: [Gluster-devel] lseek In-Reply-To: <201205140122.q4E1MhU8000317@singularity.tronunltd.com> References: <201205140122.q4E1MhU8000317@singularity.tronunltd.com> Message-ID: <4FB0B8C1.4020908@datalab.es> Hello Ian, there is no such thing as an explicit seek in glusterfs. Each readv, writev, (f)truncate and rchecksum have an offset parameter that tells you the position where the operation must be performed. If you make something that changes the size of the file you must make it in a way that it is transparent to upper translators. This means that all offsets you will receive are "real" (in your case, offsets in the uncompressed version of the file). You should calculate in some way the equivalent offset in the compressed version of the file and send it to the correspoding fop of the lower translators. In the same way, you must return in all iatt structures the real size of the file (not the compressed size). I'm not sure what is the intended use of NONSEEKABLE, but I think it is for special file types, like devices or similar that are sequential in nature. Anyway, this is a fuse flag that you can't return from a regular translator open fop. Xavi On 05/14/2012 03:22 AM, Ian Latter wrote: > Hello, > > > I'm looking for a seek (lseek) implementation in > one of the modules and I can't see one. > > Do I need to care about seeking if my module > changes the file size (i.e. compresses) in Gluster? > I would have thought that I did except that I believe > that what I'm reading is that Gluster returns a > NONSEEKABLE flag on file open (fuse_kernel.h at > line 149). Does this mitigate the need to correct > the user seeks? > > > Cheers, > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Mon May 14 09:51:59 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Mon, 14 May 2012 19:51:59 +1000 Subject: [Gluster-devel] lseek Message-ID: <201205140951.q4E9px5H001754@singularity.tronunltd.com> Hello Xavi, Ok - thanks. I was hoping that this was how read and write were working (i.e. with absolute offsets and not just getting relative offsets from the current seek point), however what of the raw seek command? len = lseek(fd, 0, SEEK_END); Upon successful completion, lseek() returns the resulting offset location as measured in bytes from the beginning of the file. Any idea on where the return value comes from? I will need to fake up a file size for this command .. ----- Original Message ----- >From: "Xavier Hernandez" >To: >Subject: Re: [Gluster-devel] lseek >Date: Mon, 14 May 2012 09:48:17 +0200 > > Hello Ian, > > there is no such thing as an explicit seek in glusterfs. Each readv, > writev, (f)truncate and rchecksum have an offset parameter that tells > you the position where the operation must be performed. > > If you make something that changes the size of the file you must make it > in a way that it is transparent to upper translators. This means that > all offsets you will receive are "real" (in your case, offsets in the > uncompressed version of the file). You should calculate in some way the > equivalent offset in the compressed version of the file and send it to > the correspoding fop of the lower translators. > > In the same way, you must return in all iatt structures the real size of > the file (not the compressed size). > > I'm not sure what is the intended use of NONSEEKABLE, but I think it is > for special file types, like devices or similar that are sequential in > nature. Anyway, this is a fuse flag that you can't return from a regular > translator open fop. > > Xavi > > On 05/14/2012 03:22 AM, Ian Latter wrote: > > Hello, > > > > > > I'm looking for a seek (lseek) implementation in > > one of the modules and I can't see one. > > > > Do I need to care about seeking if my module > > changes the file size (i.e. compresses) in Gluster? > > I would have thought that I did except that I believe > > that what I'm reading is that Gluster returns a > > NONSEEKABLE flag on file open (fuse_kernel.h at > > line 149). Does this mitigate the need to correct > > the user seeks? > > > > > > Cheers, > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Mon May 14 10:29:54 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 14 May 2012 12:29:54 +0200 Subject: [Gluster-devel] lseek In-Reply-To: <201205140951.q4E9px5H001754@singularity.tronunltd.com> References: <201205140951.q4E9px5H001754@singularity.tronunltd.com> Message-ID: <4FB0DEA2.3030805@datalab.es> Hello Ian, lseek calls are handled internally by the kernel and they never reach the user land for fuse calls. lseek only updates the current file offset that is stored inside the kernel file's structure. This value is what is passed to read/write fuse calls as an absolute offset. There isn't any problem in this behavior as long as you hide all size manipulations from fuse. If you write a translator that compresses a file, you should do so in a transparent manner. This means, basically, that: 1. Whenever you are asked to return the file size, you must return the size of the uncompressed file 2. Whenever you receive an offset, you must translate that offset to the corresponding offset in the compressed file and work with that 3. Whenever you are asked to read or write data, you must return the number of uncompressed bytes read or written (even if you have compressed the chunk of data to a smaller size and you have physically written less bytes). 4. All read requests must return uncompressed data (this seems obvious though) This guarantees that your manipulations are not seen in any way by any upper translator or even fuse, thus everything should work smoothly. If you respect these rules, lseek (and your translator) will work as expected. In particular, when a user calls lseek with SEEK_END, the kernel takes the size of the file from the internal kernel inode's structure. This size is obtained through a previous call to lookup or updated using the result of write operations. If you respect points 1 and 3, this value will be correct. In gluster there are a lot of fops that return a iatt structure. You must guarantee that all these functions return the correct size of the file in the field ia_size to be sure that everything works as expected. Xavi On 05/14/2012 11:51 AM, Ian Latter wrote: > Hello Xavi, > > > Ok - thanks. I was hoping that this was how read > and write were working (i.e. with absolute offsets > and not just getting relative offsets from the current > seek point), however what of the raw seek > command? > > len = lseek(fd, 0, SEEK_END); > > Upon successful completion, lseek() returns > the resulting offset location as measured in > bytes from the beginning of the file. > > Any idea on where the return value comes from? > I will need to fake up a file size for this command .. > > > > ----- Original Message ----- >> From: "Xavier Hernandez" >> To: >> Subject: Re: [Gluster-devel] lseek >> Date: Mon, 14 May 2012 09:48:17 +0200 >> >> Hello Ian, >> >> there is no such thing as an explicit seek in glusterfs. > Each readv, >> writev, (f)truncate and rchecksum have an offset parameter > that tells >> you the position where the operation must be performed. >> >> If you make something that changes the size of the file > you must make it >> in a way that it is transparent to upper translators. This > means that >> all offsets you will receive are "real" (in your case, > offsets in the >> uncompressed version of the file). You should calculate in > some way the >> equivalent offset in the compressed version of the file > and send it to >> the correspoding fop of the lower translators. >> >> In the same way, you must return in all iatt structures > the real size of >> the file (not the compressed size). >> >> I'm not sure what is the intended use of NONSEEKABLE, but > I think it is >> for special file types, like devices or similar that are > sequential in >> nature. Anyway, this is a fuse flag that you can't return > from a regular >> translator open fop. >> >> Xavi >> >> On 05/14/2012 03:22 AM, Ian Latter wrote: >>> Hello, >>> >>> >>> I'm looking for a seek (lseek) implementation in >>> one of the modules and I can't see one. >>> >>> Do I need to care about seeking if my module >>> changes the file size (i.e. compresses) in Gluster? >>> I would have thought that I did except that I believe >>> that what I'm reading is that Gluster returns a >>> NONSEEKABLE flag on file open (fuse_kernel.h at >>> line 149). Does this mitigate the need to correct >>> the user seeks? >>> >>> >>> Cheers, >>> >>> >>> >>> -- >>> Ian Latter >>> Late night coder .. >>> http://midnightcode.org/ >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at nongnu.org >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at nongnu.org >> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Mon May 14 11:18:22 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Mon, 14 May 2012 21:18:22 +1000 Subject: [Gluster-devel] lseek Message-ID: <201205141118.q4EBIMku002113@singularity.tronunltd.com> Hello Xavier, I don't have a problem with the principles, these were effectively how I was traveling (the notable difference is statfs which I want to pass-through unaffected, reporting the true file system capacity such that a du [stat] may sum to a greater value than a df [statfs]). In 2009 I had a mostly- functional hashing write function and a dubious read function (I stumbled when I had to open a file from within a fop). But I think what you're telling/showing me is that I have no deep understanding of the mapping of the system calls to their Fuse->Gluster fops - which is expected :) And, this is a better outcome than learning that Gluster has gaps in its framework with regard to my objective. I.e. I didn't know that lseek mapped to lookup. And the examples aren't comprehensive enough (rot-13 is the only one that really manipulates content, and it only plays with read and write, obviously because it has a 1:1 relationship with the data). This is the key, and not something that I was expecting; > In gluster there are a lot of fops that return a iatt > structure. You must guarantee that all these > functions return the correct size of the file in > the field ia_size to be sure that everything works > as expected. I'll do my best to build a comprehensive list of iatt returning fops from the examples ... but I'd say it'll take a solid peer review to get this hammered out properly. Thanks for steering me straight Xavi, appreciate it. ----- Original Message ----- >From: "Xavier Hernandez" >To: "Ian Latter" >Subject: Re: [Gluster-devel] lseek >Date: Mon, 14 May 2012 12:29:54 +0200 > > Hello Ian, > > lseek calls are handled internally by the kernel and they never reach > the user land for fuse calls. lseek only updates the current file offset > that is stored inside the kernel file's structure. This value is what is > passed to read/write fuse calls as an absolute offset. > > There isn't any problem in this behavior as long as you hide all size > manipulations from fuse. If you write a translator that compresses a > file, you should do so in a transparent manner. This means, basically, that: > > 1. Whenever you are asked to return the file size, you must return the > size of the uncompressed file > 2. Whenever you receive an offset, you must translate that offset to the > corresponding offset in the compressed file and work with that > 3. Whenever you are asked to read or write data, you must return the > number of uncompressed bytes read or written (even if you have > compressed the chunk of data to a smaller size and you have physically > written less bytes). > 4. All read requests must return uncompressed data (this seems obvious > though) > > This guarantees that your manipulations are not seen in any way by any > upper translator or even fuse, thus everything should work smoothly. > > If you respect these rules, lseek (and your translator) will work as > expected. > > In particular, when a user calls lseek with SEEK_END, the kernel takes > the size of the file from the internal kernel inode's structure. This > size is obtained through a previous call to lookup or updated using the > result of write operations. If you respect points 1 and 3, this value > will be correct. > > In gluster there are a lot of fops that return a iatt structure. You > must guarantee that all these functions return the correct size of the > file in the field ia_size to be sure that everything works as expected. > > Xavi > > On 05/14/2012 11:51 AM, Ian Latter wrote: > > Hello Xavi, > > > > > > Ok - thanks. I was hoping that this was how read > > and write were working (i.e. with absolute offsets > > and not just getting relative offsets from the current > > seek point), however what of the raw seek > > command? > > > > len = lseek(fd, 0, SEEK_END); > > > > Upon successful completion, lseek() returns > > the resulting offset location as measured in > > bytes from the beginning of the file. > > > > Any idea on where the return value comes from? > > I will need to fake up a file size for this command .. > > > > > > > > ----- Original Message ----- > >> From: "Xavier Hernandez" > >> To: > >> Subject: Re: [Gluster-devel] lseek > >> Date: Mon, 14 May 2012 09:48:17 +0200 > >> > >> Hello Ian, > >> > >> there is no such thing as an explicit seek in glusterfs. > > Each readv, > >> writev, (f)truncate and rchecksum have an offset parameter > > that tells > >> you the position where the operation must be performed. > >> > >> If you make something that changes the size of the file > > you must make it > >> in a way that it is transparent to upper translators. This > > means that > >> all offsets you will receive are "real" (in your case, > > offsets in the > >> uncompressed version of the file). You should calculate in > > some way the > >> equivalent offset in the compressed version of the file > > and send it to > >> the correspoding fop of the lower translators. > >> > >> In the same way, you must return in all iatt structures > > the real size of > >> the file (not the compressed size). > >> > >> I'm not sure what is the intended use of NONSEEKABLE, but > > I think it is > >> for special file types, like devices or similar that are > > sequential in > >> nature. Anyway, this is a fuse flag that you can't return > > from a regular > >> translator open fop. > >> > >> Xavi > >> > >> On 05/14/2012 03:22 AM, Ian Latter wrote: > >>> Hello, > >>> > >>> > >>> I'm looking for a seek (lseek) implementation in > >>> one of the modules and I can't see one. > >>> > >>> Do I need to care about seeking if my module > >>> changes the file size (i.e. compresses) in Gluster? > >>> I would have thought that I did except that I believe > >>> that what I'm reading is that Gluster returns a > >>> NONSEEKABLE flag on file open (fuse_kernel.h at > >>> line 149). Does this mitigate the need to correct > >>> the user seeks? > >>> > >>> > >>> Cheers, > >>> > >>> > >>> > >>> -- > >>> Ian Latter > >>> Late night coder .. > >>> http://midnightcode.org/ > >>> > >>> _______________________________________________ > >>> Gluster-devel mailing list > >>> Gluster-devel at nongnu.org > >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel > >> _______________________________________________ > >> Gluster-devel mailing list > >> Gluster-devel at nongnu.org > >> https://lists.nongnu.org/mailman/listinfo/gluster-devel > >> > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Mon May 14 11:47:10 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 14 May 2012 13:47:10 +0200 Subject: [Gluster-devel] lseek In-Reply-To: <201205141118.q4EBIMku002113@singularity.tronunltd.com> References: <201205141118.q4EBIMku002113@singularity.tronunltd.com> Message-ID: <4FB0F0BE.9030009@datalab.es> Hello Ian, I didn't thought in statfs. In this special case things are a bit harder for a compression translator. I think it's impossible to return accurate data without a considerable amount of work. Maybe some estimation of the available space based on the current achieved mean compression ratio would be sufficient, but never accurate. With more work you could even be able to say exactly how much space have been used, but the best you can do with the remaining space is an estimation. Regarding lseek, there isn't a map with lookup. Probably I haven't explained it as well as I wanted. There are basically two kinds of user mode calls. Those that use a string containing a filename to operate with (stat, unlink, open, creat, ...), and those that use a file descriptor (fstat, read, write, ...). The kernel does not work with names to handle files, so it has to translate the names to inodes to work with them. This means that any call that uses a string will need to make a "lookup" to get the associated inode (the only exception is creat, that creates a new inode without using lookup). This means that every filename based operation can generate a lookup request (although some caching mechanism may reduce the number of calls). All operations that work with a file descriptor do not generate a lookup request, because the file descriptor is already bound to an inode. In your particular case, to do an lseek you must have made a previous call to open (that would have generated a lookup request) or creat. Hope this better explains how kernel and gluster are bound... Xavi On 05/14/2012 01:18 PM, Ian Latter wrote: > Hello Xavier, > > > I don't have a problem with the principles, these > were effectively how I was traveling (the notable > difference is statfs which I want to pass-through > unaffected, reporting the true file system capacity > such that a du [stat] may sum to a greater value > than a df [statfs]). In 2009 I had a mostly- > functional hashing write function and a dubious > read function (I stumbled when I had to open a > file from within a fop). > > But I think what you're telling/showing me is that > I have no deep understanding of the mapping of > the system calls to their Fuse->Gluster fops - > which is expected :) And, this is a better outcome > than learning that Gluster has gaps in its > framework with regard to my objective. I.e. I > didn't know that lseek mapped to lookup. And > the examples aren't comprehensive enough > (rot-13 is the only one that really manipulates > content, and it only plays with read and write, > obviously because it has a 1:1 relationship with > the data). > > This is the key, and not something that I was > expecting; > >> In gluster there are a lot of fops that return a iatt >> structure. You must guarantee that all these >> functions return the correct size of the file in >> the field ia_size to be sure that everything works >> as expected. > I'll do my best to build a comprehensive list of iatt > returning fops from the examples ... but I'd say it'll > take a solid peer review to get this hammered out > properly. > > Thanks for steering me straight Xavi, appreciate > it. > > > > ----- Original Message ----- >> From: "Xavier Hernandez" >> To: "Ian Latter" >> Subject: Re: [Gluster-devel] lseek >> Date: Mon, 14 May 2012 12:29:54 +0200 >> >> Hello Ian, >> >> lseek calls are handled internally by the kernel and they > never reach >> the user land for fuse calls. lseek only updates the > current file offset >> that is stored inside the kernel file's structure. This > value is what is >> passed to read/write fuse calls as an absolute offset. >> >> There isn't any problem in this behavior as long as you > hide all size >> manipulations from fuse. If you write a translator that > compresses a >> file, you should do so in a transparent manner. This > means, basically, that: >> 1. Whenever you are asked to return the file size, you > must return the >> size of the uncompressed file >> 2. Whenever you receive an offset, you must translate that > offset to the >> corresponding offset in the compressed file and work with that >> 3. Whenever you are asked to read or write data, you must > return the >> number of uncompressed bytes read or written (even if you > have >> compressed the chunk of data to a smaller size and you > have physically >> written less bytes). >> 4. All read requests must return uncompressed data (this > seems obvious >> though) >> >> This guarantees that your manipulations are not seen in > any way by any >> upper translator or even fuse, thus everything should work > smoothly. >> If you respect these rules, lseek (and your translator) > will work as >> expected. >> >> In particular, when a user calls lseek with SEEK_END, the > kernel takes >> the size of the file from the internal kernel inode's > structure. This >> size is obtained through a previous call to lookup or > updated using the >> result of write operations. If you respect points 1 and 3, > this value >> will be correct. >> >> In gluster there are a lot of fops that return a iatt > structure. You >> must guarantee that all these functions return the correct > size of the >> file in the field ia_size to be sure that everything works > as expected. >> Xavi >> >> On 05/14/2012 11:51 AM, Ian Latter wrote: >>> Hello Xavi, >>> >>> >>> Ok - thanks. I was hoping that this was how read >>> and write were working (i.e. with absolute offsets >>> and not just getting relative offsets from the current >>> seek point), however what of the raw seek >>> command? >>> >>> len = lseek(fd, 0, SEEK_END); >>> >>> Upon successful completion, lseek() returns >>> the resulting offset location as measured in >>> bytes from the beginning of the file. >>> >>> Any idea on where the return value comes from? >>> I will need to fake up a file size for this command .. >>> >>> >>> >>> ----- Original Message ----- >>>> From: "Xavier Hernandez" >>>> To: >>>> Subject: Re: [Gluster-devel] lseek >>>> Date: Mon, 14 May 2012 09:48:17 +0200 >>>> >>>> Hello Ian, >>>> >>>> there is no such thing as an explicit seek in glusterfs. >>> Each readv, >>>> writev, (f)truncate and rchecksum have an offset parameter >>> that tells >>>> you the position where the operation must be performed. >>>> >>>> If you make something that changes the size of the file >>> you must make it >>>> in a way that it is transparent to upper translators. This >>> means that >>>> all offsets you will receive are "real" (in your case, >>> offsets in the >>>> uncompressed version of the file). You should calculate in >>> some way the >>>> equivalent offset in the compressed version of the file >>> and send it to >>>> the correspoding fop of the lower translators. >>>> >>>> In the same way, you must return in all iatt structures >>> the real size of >>>> the file (not the compressed size). >>>> >>>> I'm not sure what is the intended use of NONSEEKABLE, but >>> I think it is >>>> for special file types, like devices or similar that are >>> sequential in >>>> nature. Anyway, this is a fuse flag that you can't return >>> from a regular >>>> translator open fop. >>>> >>>> Xavi >>>> >>>> On 05/14/2012 03:22 AM, Ian Latter wrote: >>>>> Hello, >>>>> >>>>> >>>>> I'm looking for a seek (lseek) implementation in >>>>> one of the modules and I can't see one. >>>>> >>>>> Do I need to care about seeking if my module >>>>> changes the file size (i.e. compresses) in Gluster? >>>>> I would have thought that I did except that I believe >>>>> that what I'm reading is that Gluster returns a >>>>> NONSEEKABLE flag on file open (fuse_kernel.h at >>>>> line 149). Does this mitigate the need to correct >>>>> the user seeks? >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> >>>>> -- >>>>> Ian Latter >>>>> Late night coder .. >>>>> http://midnightcode.org/ >>>>> >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at nongnu.org >>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at nongnu.org >>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >>>> >>> -- >>> Ian Latter >>> Late night coder .. >>> http://midnightcode.org/ >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at nongnu.org >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From kkeithle at redhat.com Mon May 14 14:17:12 2012 From: kkeithle at redhat.com (Kaleb S. KEITHLEY) Date: Mon, 14 May 2012 10:17:12 -0400 Subject: [Gluster-devel] python version In-Reply-To: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> Message-ID: <4FB113E8.0@redhat.com> On 05/13/2012 10:42 AM, Emmanuel Dreyfus wrote: > Hi > > There is a problem with python version detection in the configure > script. The machine on which autotools is ran prior releasing glusterfs > expands AM_PATH_PYTHON into a script that fails to accept python> 2.4. > > As I understand, a solution is to concatenate latest > automake-1.12/m4/python.m4 into glusterfs' aclocal.m4. That way python > up to 3.1 should be accepted. Opinions? The aclocal.m4 file is produced when (/usr/bin/)aclocal is invoked by ./autogen.sh file in preparation for building gluster. (You have to run autogen.sh to produce the ./configure file.) aclocal uses whatever python.m4 file you have on your system, e.g. /usr/share/aclocal-1.11/python.m4, which is also from the automake package. I presume whoever packages automake for a particular system is taking into consideration what other packages and versions are standard for the system and picks right version of automake. IOW picks the version of automake that has all the (hard-coded) versions of python to match the python they have on their system. If someone has installed a later version of python and not also updated to a compatible version of automake, that's not a problem that gluster should have to solve, or even try to solve. I don't believe we want to require our build process to download the latest-and-greatest version of automake. As a side note, I sampled a few currently shipping systems and see that the automake shipped with/for Fedora 16 and 17, FreeBSD 8.2 and 8.3, and NetBSD 5.1.2, is automake-1.11, which has all the appearances of supporting python 2.5 (and 3.0). Finally, after all that, note that the configure.ac file appears to be hard-coded to require python 2.x, so if anyone is trying to use python 3.x, that's doomed to fail until configure.ac is "fixed." Do we even know why python 2.x is required and why python 3.x can't be used? -- Kaleb From manu at netbsd.org Mon May 14 14:23:47 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 14 May 2012 14:23:47 +0000 Subject: [Gluster-devel] python version In-Reply-To: <4FB113E8.0@redhat.com> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> Message-ID: <20120514142347.GA3985@homeworld.netbsd.org> On Mon, May 14, 2012 at 10:17:12AM -0400, Kaleb S. KEITHLEY wrote: > The aclocal.m4 file is produced when (/usr/bin/)aclocal is invoked > by ./autogen.sh file in preparation for building gluster. (You have > to run autogen.sh to produce the ./configure file.) Right, then my plan will not work, and the only way to fix the problem is to upgrade automake on the machine that produces the gluterfs releases. > As a side note, I sampled a few currently shipping systems and see > that the automake shipped with/for Fedora 16 and 17, FreeBSD 8.2 and > 8.3, and NetBSD 5.1.2, is automake-1.11, which has all the > appearances of supporting python 2.5 (and 3.0). You seem to take for granted that people building a glusterfs release will run autotools before running configure. This is not the way it should work: a released tarball should contain a configure script that works anywhere. The tarballs released up to at least 3.3.0qa40 have a configure script that cannot detect python > 2.4 -- Emmanuel Dreyfus manu at netbsd.org From kkeithle at redhat.com Mon May 14 14:31:32 2012 From: kkeithle at redhat.com (Kaleb S. KEITHLEY) Date: Mon, 14 May 2012 10:31:32 -0400 Subject: [Gluster-devel] python version In-Reply-To: <20120514142347.GA3985@homeworld.netbsd.org> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> <20120514142347.GA3985@homeworld.netbsd.org> Message-ID: <4FB11744.1040907@redhat.com> On 05/14/2012 10:23 AM, Emmanuel Dreyfus wrote: > On Mon, May 14, 2012 at 10:17:12AM -0400, Kaleb S. KEITHLEY wrote: >> The aclocal.m4 file is produced when (/usr/bin/)aclocal is invoked >> by ./autogen.sh file in preparation for building gluster. (You have >> to run autogen.sh to produce the ./configure file.) > > Right, then my plan will not work, and the only way to fix the problem > is to upgrade automake on the machine that produces the glusterfs > releases. > >> As a side note, I sampled a few currently shipping systems and see >> that the automake shipped with/for Fedora 16 and 17, FreeBSD 8.2 and >> 8.3, and NetBSD 5.1.2, is automake-1.11, which has all the >> appearances of supporting python 2.5 (and 3.0). > > You seem to take for granted that people building a glusterfs > release will run autotools before running configure. This is not > the way it should work: a released tarball should contain a > configure script that works anywhere. The tarballs released up to > at least 3.3.0qa40 have a configure script that cannot detect python> 2.4 > I looked at what I get when I checkout the source from the git repo and what I have to do to build from a freshly checked out source tree. And yes, we need to upgrade the build machines were we package the release tarballs. Right now is not a good time to do that. -- Kaleb From yknev.shankar at gmail.com Mon May 14 15:31:56 2012 From: yknev.shankar at gmail.com (Venky Shankar) Date: Mon, 14 May 2012 21:01:56 +0530 Subject: [Gluster-devel] python version In-Reply-To: <4FB113E8.0@redhat.com> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> Message-ID: [snip] > Finally, after all that, note that the configure.ac file appears to be > hard-coded to require python 2.x, so if anyone is trying to use python 3.x, > that's doomed to fail until configure.ac is "fixed." Do we even know why > python 2.x is required and why python 3.x can't be used? > python 2.x is required by geo-replication. Although geo-replication is code ready for python 3.x, it's not functionally tested with it. That's the reason configure.ac has 2.x hard-coded. > > -- > > Kaleb > > > ______________________________**_________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/**mailman/listinfo/gluster-devel > Thanks, -Venky -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Mon May 14 15:45:48 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 14 May 2012 15:45:48 +0000 Subject: [Gluster-devel] python version In-Reply-To: References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> Message-ID: <20120514154548.GB3985@homeworld.netbsd.org> On Mon, May 14, 2012 at 09:01:56PM +0530, Venky Shankar wrote: > python 2.x is required by geo-replication. Although geo-replication is code > ready for python 3.x, it's not functionally tested with it. That's the > reason configure.ac has 2.x hard-coded. Well, my problem is that python 2.5, python 2.6 and python 2.7 are not detected by configure. One need to patch configure in order to build with python 2.x (x > 4) installed. -- Emmanuel Dreyfus manu at netbsd.org From kkeithle at redhat.com Mon May 14 16:30:12 2012 From: kkeithle at redhat.com (Kaleb S. KEITHLEY) Date: Mon, 14 May 2012 12:30:12 -0400 Subject: [Gluster-devel] python version In-Reply-To: <20120514154548.GB3985@homeworld.netbsd.org> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> <20120514154548.GB3985@homeworld.netbsd.org> Message-ID: <4FB13314.3060708@redhat.com> On 05/14/2012 11:45 AM, Emmanuel Dreyfus wrote: > On Mon, May 14, 2012 at 09:01:56PM +0530, Venky Shankar wrote: >> python 2.x is required by geo-replication. Although geo-replication is code >> ready for python 3.x, it's not functionally tested with it. That's the >> reason configure.ac has 2.x hard-coded. > > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not > detected by configure. One need to patch configure in order to build > with python 2.x (x> 4) installed. > Seems like it would be easier to get autoconf and automake from the NetBSD packages and just run `./autogen.sh && ./configure` (Which, FWIW, is how glusterfs RPMs are built for the Fedora distributions. I'd wager for much the same reason.) -- Kaleb From manu at netbsd.org Mon May 14 18:46:07 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 14 May 2012 20:46:07 +0200 Subject: [Gluster-devel] python version In-Reply-To: <4FB13314.3060708@redhat.com> Message-ID: <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Kaleb S. KEITHLEY wrote: > > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not > > detected by configure. One need to patch configure in order to build > > with python 2.x (x> 4) installed. > > Seems like it would be easier to get autoconf and automake from the > NetBSD packages and just run `./autogen.sh && ./configure` I prefer patching the configure script. Running autogen introduce build dependencies on perl just to substitute a string on a single line: that's overkill. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From abperiasamy at gmail.com Mon May 14 19:25:20 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Mon, 14 May 2012 12:25:20 -0700 Subject: [Gluster-devel] python version In-Reply-To: <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> References: <4FB13314.3060708@redhat.com> <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Message-ID: On Mon, May 14, 2012 at 11:46 AM, Emmanuel Dreyfus wrote: > Kaleb S. KEITHLEY wrote: > >> > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not >> > detected by configure. One need to patch configure in order to build >> > with python 2.x (x> ?4) installed. >> >> Seems like it would be easier to get autoconf and automake from the >> NetBSD packages and just run `./autogen.sh && ./configure` > > I prefer patching the configure script. Running autogen introduce build > dependencies on perl just to substitute a string on a single line: > that's overkill. > Who ever builds from source is required to run autogen.sh to produce env specific configure and build files. "configure" script should not be checked into git repository. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From anand.avati at gmail.com Mon May 14 23:58:18 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 14 May 2012 16:58:18 -0700 Subject: [Gluster-devel] python version In-Reply-To: References: <4FB13314.3060708@redhat.com> <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Message-ID: On Mon, May 14, 2012 at 12:25 PM, Anand Babu Periasamy < abperiasamy at gmail.com> wrote: > On Mon, May 14, 2012 at 11:46 AM, Emmanuel Dreyfus > wrote: > > Kaleb S. KEITHLEY wrote: > > > >> > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not > >> > detected by configure. One need to patch configure in order to build > >> > with python 2.x (x> 4) installed. > >> > >> Seems like it would be easier to get autoconf and automake from the > >> NetBSD packages and just run `./autogen.sh && ./configure` > > > > I prefer patching the configure script. Running autogen introduce build > > dependencies on perl just to substitute a string on a single line: > > that's overkill. > > > > Who ever builds from source is required to run autogen.sh to produce > env specific configure and build files. Not quite. That's the whole point of having a configure script in the first place - to detect the environment at build time. One who builds from source should not require to run autogen.sh, just configure should be sufficient. Since configure itself is a generated script, and can possibly have mistakes and requirements change (like the one being discussed), that's when autogen.sh must be used to re-generate configure script. In this case however, the simplest approach would actually be to run autogen.sh till either: a) we upgrade the release build machine to use newer aclocal macros b) qualify geo-replication to work on python 3 and remove the check. Emmanuel, since the problem is not going to be a long lasting one (either of the two should fix your problem), I suggest you find a solution local to you in the interim. Even better, if someone can actually test and qualify geo-replication to work on python 3 it would ease solution "b" sooner. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 15 01:30:21 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 03:30:21 +0200 Subject: [Gluster-devel] python version In-Reply-To: Message-ID: <1kk4971.wh86xo1gypeoiM%manu@netbsd.org> Anand Avati wrote: > a) we upgrade the release build machine to use newer aclocal macros > > b) qualify geo-replication to work on python 3 and remove the check. Solution b is not enough: even if the configure script does not claim a specific version of python, it will still be unable to detect an installed python > 2.4 because it contains that: for am_cv_pathless_PYTHON in python python2 python2.4 python2.3 python2.2 python2.1 python2.0 none; do What about solution c? c) Tweak autogen.sh so that it patches generated configure and add the checks for python > 2.4 if they are missing: --- autogen.sh.orig 2012-05-15 03:22:48.000000000 +0200 +++ autogen.sh 2012-05-15 03:24:28.000000000 +0200 @@ -5,4 +5,6 @@ (libtoolize --automake --copy --force || glibtoolize --automake --copy --force) autoconf automake --add-missing --copy --foreign cd argp-standalone;./autogen.sh + +sed 's/for am_cv_pathless_PYTHON in python python2 python2.4/for am_cv_pathless_PYTHON in python python2 python3 python3.2 python3.1 python3.0 python2.7 2.6 python2.5 python2.4/' configure > configure.new && mv configure.new configure -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 15 04:20:29 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 06:20:29 +0200 Subject: [Gluster-devel] Fixing Address family mess In-Reply-To: Message-ID: <1kk4hl3.1qjswd01knbbvqM%manu@netbsd.org> Anand Babu Periasamy wrote: > AF_UNSPEC is should be be taken as IPv4/IPv6. It is named > appropriately. Default should be ipv4. > > I have not tested the patch. I did test it and it fixed the problem at mine. Here it is in gerrit: http://review.gluster.com/#change,3319 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 15 04:27:26 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 06:27:26 +0200 Subject: [Gluster-devel] NetBSD support in 3.3? Message-ID: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> Hi I still have a few pending submissions for NetBSD support in latest sources: http://review.gluster.com/3319 Use inet as default transport http://review.gluster.com/3320 Add missing (base|dir)name_r http://review.gluster.com/3321 NetBSD build fixes I would like to have 3.3 building without too many unintegrated patches on NetBSD. Is it worth working on pushing the changes above or is release-3.3 too close to release to expect such changes to get into it now? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From amarts at redhat.com Tue May 15 05:51:55 2012 From: amarts at redhat.com (Amar Tumballi) Date: Tue, 15 May 2012 11:21:55 +0530 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> References: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> Message-ID: <4FB1EEFB.2020509@redhat.com> On 05/15/2012 09:57 AM, Emmanuel Dreyfus wrote: > Hi > > I still have a few pending submissions for NetBSD support in latest > sources: > http://review.gluster.com/3319 Use inet as default transport > http://review.gluster.com/3320 Add missing (base|dir)name_r > http://review.gluster.com/3321 NetBSD build fixes > > I would like to have 3.3 building without too many unintegrated patches > on NetBSD. Is it worth working on pushing the changes above or is > release-3.3 too close to release to expect such changes to get into it > now? > Emmanuel, I understand your concerns, but I suspect we are very close to 3.3.0 release at this point of time, and hence it may be tight for taking these patches in. What we are planing is for a quicker 3.3.1 depending on the community feedback of 3.3.0 release, which should surely have your patches included. Hope that makes sense. Regards, Amar From manu at netbsd.org Tue May 15 10:13:07 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 10:13:07 +0000 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <4FB1EEFB.2020509@redhat.com> References: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> <4FB1EEFB.2020509@redhat.com> Message-ID: <20120515101307.GD3985@homeworld.netbsd.org> On Tue, May 15, 2012 at 11:21:55AM +0530, Amar Tumballi wrote: > I understand your concerns, but I suspect we are very close to 3.3.0 > release at this point of time, and hence it may be tight for taking > these patches in. Riht, I will therefore not request pullups to release-3.3 for theses changes, but I would appreciate if people could review them so that they have a chance to go in master. Will 3.3.1 be based on release-3.3, or will a new branch be forked? -- Emmanuel Dreyfus manu at netbsd.org From vbellur at redhat.com Tue May 15 10:14:38 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Tue, 15 May 2012 15:44:38 +0530 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <20120515101307.GD3985@homeworld.netbsd.org> References: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> <4FB1EEFB.2020509@redhat.com> <20120515101307.GD3985@homeworld.netbsd.org> Message-ID: <4FB22C8E.1@redhat.com> On 05/15/2012 03:43 PM, Emmanuel Dreyfus wrote: > Riht, I will therefore not request pullups to release-3.3 for theses > changes, but I would appreciate if people could review them so that they > have a chance to go in master. > > Will 3.3.1 be based on release-3.3, or will a new branch be forked? All 3.3.x releases will be based on release-3.3. It might be a good idea to rebase these changes to release-3.3 after they have been accepted in master. Vijay From manu at netbsd.org Tue May 15 11:51:36 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 13:51:36 +0200 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <4FB22C8E.1@redhat.com> Message-ID: <1kk51xf.8p0t3l1viyp1mM%manu@netbsd.org> Vijay Bellur wrote: > All 3.3.x releases will be based on release-3.3. It might be a good idea > to rebase these changes to release-3.3 after they have been accepted in > master. But after 3.3 release, as I understand. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From ej1515.park at samsung.com Wed May 16 12:23:12 2012 From: ej1515.park at samsung.com (=?euc-kr?B?udrAusHY?=) Date: Wed, 16 May 2012 12:23:12 +0000 (GMT) Subject: [Gluster-devel] Asking about Gluster Performance Factors Message-ID: <0M44007MX7QO1Z40@mailout1.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201205162123598_1LI1H0JV.jpg Type: image/jpeg Size: 72722 bytes Desc: not available URL: From johnmark at redhat.com Wed May 16 14:38:50 2012 From: johnmark at redhat.com (John Mark Walker) Date: Wed, 16 May 2012 10:38:50 -0400 (EDT) Subject: [Gluster-devel] Asking about Gluster Performance Factors In-Reply-To: <0M44007MX7QO1Z40@mailout1.samsung.com> Message-ID: <931185f2-f1b7-431f-96a0-1e7cb476b7d7@zmail01.collab.prod.int.phx2.redhat.com> Hi Ethan, ----- Original Message ----- > Dear Gluster Dev Team : > I'm Ethan, Assistant engineer in Samsung electronics. Reviewing your > paper, I have some questions of performance factors in gluster. Which paper? Can you provide a link? Also, please note that this is a community mailing list, and we cannot guarantee quick response times here - if you need a fast response, I'm happy to put you through to the right people. Thanks, John Mark Walker Gluster Community Guy > First, what does it mean the option "performance.cache-*"? Does it > mean read cache? If does, what's difference between the options > "prformance.cache-max-file-size" and "performance.cache-size" ? > I read your another paper("performance in a gluster system, versions > 3.1.x") and it says as below on Page 12, > (Gluster Native protocol does not implement write caching, as we > believe that the modest performance improvements from rite caching > do not justify the risk of cache coherency issues.) > Second, how much is the read throughput improved as configuring 2-way > replication? we need any statistics or something like that. > ("performance in a gluster system, versions 3.1.x") and it says as > below on Page 12, > (However, read throughput is generally improved by replication, as > reads can be delivered from either storage node) > I would ask you to return ASAP. From johnmark at redhat.com Wed May 16 15:56:32 2012 From: johnmark at redhat.com (John Mark Walker) Date: Wed, 16 May 2012 11:56:32 -0400 (EDT) Subject: [Gluster-devel] Reminder: community.gluster.org In-Reply-To: <4b117086-34aa-4d8b-aede-ffae2e3abfbd@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1bb98699-b028-4f92-b8fd-603056aef57c@zmail01.collab.prod.int.phx2.redhat.com> Greetings all, Just a friendly reminder that we could use your help on community.gluster.org (hereafter 'c.g.o'). Someday in the near future, we will have 2-way synchronization between our mailing lists and c.g.o, but as of now, there are 2 places to ask and answer questions. I ask that for things with definite answers, even if they start out here on the mailing lists, please provide the question and answer on c.g.o. For lengthy conversations about using or developing GlusterFS, including ideas for new ideas, roadmaps, etc., the mailing lists are ideal for that. Why do we prefer c.g.o? Because it's Google-friendly :) So, if you see any existing questions over there that you are qualified to answer, please do weigh in with an answer. And as always, for quick "real-time" help, you're best served by visiting #gluster on the freenode IRC network. This has been a public service announcement from your friendly community guy. -JM From ndevos at redhat.com Wed May 16 19:56:04 2012 From: ndevos at redhat.com (Niels de Vos) Date: Wed, 16 May 2012 21:56:04 +0200 Subject: [Gluster-devel] Updated Wireshark packages for RHEL-6 and Fedora-17 available for testing Message-ID: <4FB40654.60703@redhat.com> Hi all, today I have merged support for GlusterFS 3.2 and 3.3 into one Wireshark 'dissector'. The packages with date 20120516 in the version support both the current stable 3.2.x version, and the latest 3.3.0qa41. Older 3.3.0 versions will likely have issues due to some changes in the RPC-AUTH protocol used. Updating to the latest qa41 release (or newer) is recommended anyway. I do not expect that we'll add support for earlier 3.3.0 releases. My repository with packages for RHEL-6 and Fedora-17 contains a .repo file for yum (save it in /etc/yum.repos.d): - http://repos.fedorapeople.org/repos/devos/wireshark-gluster/ RPMs for other Fedora or RHEL versions can be provided on request. Let me know if you need an other version (or architecture). Single patches for some different Wireshark versions are available from https://github.com/nixpanic/gluster-wireshark. A full history of commits can be found here: - https://github.com/nixpanic/gluster-wireshark-1.4/commits/master/ (Support for GlusterFS 3.3 was added by Akhila and Shree, thanks!) Please test and report success and problems, file a issues on github: https://github.com/nixpanic/gluster-wireshark-1.4/issues Some functionality is still missing, but with the current status, it should be good for most analysing already. With more issues filed, it makes it easier to track what items are important. Of course, you can also respond to this email and give feedback :-) After some more cleanup of the code, this dissector will be passed on for review and inclusion in the upstream Wireshark project. Some more testing results is therefore much appreciated. Thanks, Niels From johnmark at redhat.com Wed May 16 21:12:41 2012 From: johnmark at redhat.com (John Mark Walker) Date: Wed, 16 May 2012 17:12:41 -0400 (EDT) Subject: [Gluster-devel] GlusterFest! For GlusterFS 3.3 Beta 4 In-Reply-To: Message-ID: <5456de9c-6c8b-4995-ad1e-720c9c52c74f@zmail01.collab.prod.int.phx2.redhat.com> Greetings, We are planning to have one more beta release tomorrow. If all goes as planned, this will be the release candidate. In conjunction with the beta, I thought we should have a 24-hour GlusterFest, starting tomorrow at 8pm - http://www.gluster.org/community/documentation/index.php/GlusterFest 'What's a GlusterFest?' you may be asking. Well, it's all of the below: - Testing the software. Install the new beta (when it's released tomorrow) and put it through its paces. We will put some basic testing procedures on the GlusterFest page here - http://www.gluster.org/community/documentation/index.php/GlusterFest - Feel free to create your own testing procedures and link to it from the GlusterFest page - Finding bugs. See the current list of bugs targeted for this release: http://bit.ly/beta4bugs - Fixing bugs. If you're the kind of person who wants to submit patches, see our development workflow doc: http://www.gluster.org/community/documentation/index.php/Development_Work_Flow - and then get to know Gerritt: http://review.gluster.com/ The GlusterFest page will be updated with some basic testing procedures tomorrow, and GlusterFest will officially begin at 8pm PDT May 17/03:00 UTC May 18 (coinciding with the end of our meetup tomorrow), and ending at 8pm PDT May 18/03:00 UTC May 19. If you need assistance, see #gluster on Freenode for "real-time" questions, gluster-users and community.gluster.org for general usage questions, and gluster-devel for anything related to building, patching, and bug-fixing. To keep up with GlusterFest activity, I'll be sending updates from the @glusterorg account on Twitter, and I'm sure there will be traffic on the mailing lists, as well. Happy testing and bug-hunting! -JM From ej1515.park at samsung.com Thu May 17 01:08:50 2012 From: ej1515.park at samsung.com (=?euc-kr?B?udrAusHY?=) Date: Thu, 17 May 2012 01:08:50 +0000 (GMT) Subject: [Gluster-devel] Asking about Gluster Performance Factors Message-ID: <0M4500FX676Q1150@mailout4.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201205171008201_QKNMBDIF.jpg Type: image/jpeg Size: 72722 bytes Desc: not available URL: From johnmark at redhat.com Thu May 17 04:28:50 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 00:28:50 -0400 (EDT) Subject: [Gluster-devel] Fwd: Asking about Gluster Performance Factors In-Reply-To: Message-ID: <153525d7-fe8c-4f5c-aa06-097fcb4b0980@zmail01.collab.prod.int.phx2.redhat.com> See response below from Ben England. Also, note that this question should probably go in gluster-users. -JM ----- Forwarded Message ----- From: "Ben England" To: "John Mark Walker" Sent: Wednesday, May 16, 2012 8:23:30 AM Subject: Re: [Gluster-devel] Asking about Gluster Performance Factors JM, see comments marked with ben>>> below. ----- Original Message ----- From: "???" To: gluster-devel at nongnu.org Sent: Wednesday, May 16, 2012 5:23:12 AM Subject: [Gluster-devel] Asking about Gluster Performance Factors Samsung Enterprise Portal mySingle May 16, 2012 Dear Gluster Dev Team : I'm Ethan, Assistant engineer in Samsung electronics. Reviewing your paper, I have some questions of performance factors in gluster. First, what does it mean the option "performance.cache-*"? Does it mean read cache? If does, what's difference between the options "prformance.cache-max-file-size" and "performance.cache-size" ? I read your another paper("performance in a gluster system, versions 3.1.x") and it says as below on Page 12, (Gluster Native protocol does not implement write caching, as we believe that the modest performance improvements from rite caching do not justify the risk of cache coherency issues.) ben>>> While gluster processes do not implement write caching internally, there are at least 3 ways to improve write performance in a Gluster system. - If you use a RAID controller with a non-volatile writeback cache, the RAID controller can buffer writes on behalf of the Gluster server and thereby reduce latency. - XFS or any other local filesystem used within the server "bricks" can do "write-thru" caching, meaning that the writes can be aggregated and can be kept in the Linux buffer cache so that subsequent read requests can be satisfied from this cache, transparent to Gluster processes. - there is a "write-behind" translator in the native client that will aggregate small sequential write requests at the FUSE layer into larger network-level write requests. If the smallest possible application I/O size is a requirement, sequential writes can also be efficiently aggregated by an NFS client. Second, how much is the read throughput improved as configuring 2-way replication? we need any statistics or something like that. ("performance in a gluster system, versions 3.1.x") and it says as below on Page 12, (However, read throughput is generally improved by replication, as reads can be delivered from either storage node) ben>>> Yes, reads can be satisfied by either server in a replication pair. Since the gluster native client only reads one of the two replicas, read performance should be approximately the same for 2-replica file system as it would be for a 1-replica file system. The difference in performance is with writes, as you would expect. Sincerely yours, Ethan Eunjun Park Assistant Engineer, Solution Development Team, Media Solution Center 416, Maetan 3-dong, Yeongtong-gu, Suwon-si, Gyeonggi-do 443-742, Korea Mobile : 010-8609-9532 E-mail : ej1515.park at samsung.com http://www.samsung.com/sec _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnmark at redhat.com Thu May 17 06:35:10 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 02:35:10 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: Message-ID: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> I was thinking about sending these gerritt notifications to gluster-devel by default - what do y'all think? -JM ----- Forwarded Message ----- From: "Kaushal M (Code Review)" Sent: Wednesday, May 16, 2012 11:32:26 PM Subject: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() Kaushal M has uploaded a new change for review. Change subject: client/protocol : Changes in client3_1_getxattr() ...................................................................... client/protocol : Changes in client3_1_getxattr() Backporting change 1d02db63ae from master. Copy args->loc to local->loc in client3_1_getxattr(). This prevents logs with "(null) (--)" in client3_1_getxattr_cbk(). Also save args->name in local->name and print it in the log as well. BUG: 812199 Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Signed-off-by: Kaushal M --- M xlators/protocol/client/src/client-helpers.c M xlators/protocol/client/src/client.h M xlators/protocol/client/src/client3_1-fops.c 3 files changed, 11 insertions(+), 2 deletions(-) git pull ssh://*/glusterfs refs/changes/50/3350/1 -- To view, visit http://review.gluster.com/3350 To unsubscribe, visit http://review.gluster.com/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Gerrit-PatchSet: 1 Gerrit-Project: glusterfs Gerrit-Branch: release-3.3 Gerrit-Owner: Kaushal M From rajesh at redhat.com Thu May 17 06:42:56 2012 From: rajesh at redhat.com (Rajesh Amaravathi) Date: Thu, 17 May 2012 02:42:56 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: +1 Regards, Rajesh Amaravathi, Software Engineer, GlusterFS RedHat Inc. ----- Original Message ----- From: "John Mark Walker" To: gluster-devel at nongnu.org Sent: Thursday, May 17, 2012 12:05:10 PM Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() I was thinking about sending these gerritt notifications to gluster-devel by default - what do y'all think? -JM ----- Forwarded Message ----- From: "Kaushal M (Code Review)" Sent: Wednesday, May 16, 2012 11:32:26 PM Subject: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() Kaushal M has uploaded a new change for review. Change subject: client/protocol : Changes in client3_1_getxattr() ...................................................................... client/protocol : Changes in client3_1_getxattr() Backporting change 1d02db63ae from master. Copy args->loc to local->loc in client3_1_getxattr(). This prevents logs with "(null) (--)" in client3_1_getxattr_cbk(). Also save args->name in local->name and print it in the log as well. BUG: 812199 Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Signed-off-by: Kaushal M --- M xlators/protocol/client/src/client-helpers.c M xlators/protocol/client/src/client.h M xlators/protocol/client/src/client3_1-fops.c 3 files changed, 11 insertions(+), 2 deletions(-) git pull ssh://*/glusterfs refs/changes/50/3350/1 -- To view, visit http://review.gluster.com/3350 To unsubscribe, visit http://review.gluster.com/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Gerrit-PatchSet: 1 Gerrit-Project: glusterfs Gerrit-Branch: release-3.3 Gerrit-Owner: Kaushal M _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From vijay at gluster.com Thu May 17 06:55:42 2012 From: vijay at gluster.com (Vijay Bellur) Date: Thu, 17 May 2012 12:25:42 +0530 Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> References: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FB4A0EE.40102@gluster.com> On 05/17/2012 12:05 PM, John Mark Walker wrote: > I was thinking about sending these gerritt notifications to gluster-devel by default - what do y'all think? Gerrit automatically sends out a notification to all registered users who are watching the project. Do we need an additional notification to gluster-devel if there's a considerable overlap between registered users of gluster-devel and gerrit? -Vijay From johnmark at redhat.com Thu May 17 07:26:23 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 03:26:23 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <4FB4A0EE.40102@gluster.com> Message-ID: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> There are close to 600 people now subscribed to gluster-devel - how many of them actually have an account on Gerritt? I honestly have no idea. Another thing this would do is send a subtle message to subscribers that this is not the place to discuss user issues, but perhaps there are better ways to do that. I've seen many projects do this - as well as send all bugzilla and github notifications, but I could also see some people getting annoyed. -JM ----- Original Message ----- > On 05/17/2012 12:05 PM, John Mark Walker wrote: > > I was thinking about sending these gerritt notifications to > > gluster-devel by default - what do y'all think? > > Gerrit automatically sends out a notification to all registered users > who are watching the project. Do we need an additional notification > to > gluster-devel if there's a considerable overlap between registered > users > of gluster-devel and gerrit? > > > -Vijay > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From ashetty at redhat.com Thu May 17 07:35:27 2012 From: ashetty at redhat.com (Anush Shetty) Date: Thu, 17 May 2012 13:05:27 +0530 Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> References: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FB4AA3F.1090700@redhat.com> On 05/17/2012 12:56 PM, John Mark Walker wrote: > There are close to 600 people now subscribed to gluster-devel - how many of them actually have an account on Gerritt? I honestly have no idea. Another thing this would do is send a subtle message to subscribers that this is not the place to discuss user issues, but perhaps there are better ways to do that. > > I've seen many projects do this - as well as send all bugzilla and github notifications, but I could also see some people getting annoyed. > How about a weekly digest of the same. - Anush From manu at netbsd.org Thu May 17 09:02:32 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 17 May 2012 11:02:32 +0200 Subject: [Gluster-devel] Crashes with latest git code Message-ID: <1kk8js0.b6kp732ejixeM%manu@netbsd.org> Hi I get a lot of crashes on NetBSD with latest git code. looking at core dumps, it is obvious I get memory corruption, as I find various structure overwritten by texts (file path or content). Linking with electric fence produces a much earlier crash, always at the same place. Here is how it looks: Program terminated with signal 11, Segmentation fault. #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 (gdb) bt #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 #1 0xbaa5ec1e in gf_strdup (src=0x0) at ../../../../libglusterfs/src/mem-pool.h:119 #2 0xbaa76dbf in client3_1_getxattr (frame=0xbb77f5c0, this=0xba3cd000, data=0xbfbfe18c) at client3_1-fops.c:4641 #3 0xbaa59ab8 in client_getxattr (frame=0xbb77f5c0, this=0xba3cd000, loc=0xb9402dd0, name=0x0, xdata=0x0) at client.c:1452 #4 0xb9ac3c7d in afr_sh_metadata_sync_prepare (frame=0xba8026bc, this=0xba3ce000) at afr-self-heal-metadata.c:419 #5 0xb9ac428b in afr_sh_metadata_fix (frame=0xba8026bc, this=0xba3ce000, op_ret=0, op_errno=0) at afr-self-heal-metadata.c:522 #6 0xb9abeb2b in afr_sh_common_lookup_cbk (frame=0xba8026bc, cookie=0x1, this=0xba3ce000, op_ret=0, op_errno=0, inode=0xb8b001a0, buf=0xbfbfe424, xattr=0xba401394, postparent=0xbfbfe3bc) at afr-self-heal-common.c:1311 #7 0xbaa6dc10 in client3_1_lookup_cbk (req=0xb92010d8, iov=0xb92010f8, count=1, myframe=0xbb77f550) at client3_1-fops.c:2636 Frame 4 is this: STACK_WIND (frame, afr_sh_metadata_getxattr_cbk, priv->children[source], priv->children[source]->fops->getxattr, &local->loc, NULL, NULL); Then in frame 3, I get args.name = NULL client_getxattr (call_frame_t *frame, xlator_t *this, loc_t *loc, const char *name, dict_t *xdata) (...) args.name = name; (...) ret = proc->fn (frame, this, &args); In frame 2, args->name = NULL client3_1_getxattr (call_frame_t *frame, xlator_t *this, void *data) (...) args = data; (...) local->name = gf_strdup (args->name); And there we will crash in gf_strdup(). The root cause is afr_sh_metadata_sync_prepare() calling client_getxattr with NULL arguments. The fix is beyond my knowledge of glusterfs internals, but I am sure that some folks here will be able to comment. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From rabhat at redhat.com Thu May 17 09:11:55 2012 From: rabhat at redhat.com (Raghavendra Bhat) Date: Thu, 17 May 2012 05:11:55 -0400 (EDT) Subject: [Gluster-devel] Crashes with latest git code In-Reply-To: <1kk8js0.b6kp732ejixeM%manu@netbsd.org> Message-ID: <3044c1d3-8b15-4a9d-9d18-7343cf8a33f4@zmail01.collab.prod.int.phx2.redhat.com> Hi Emmanuel, A bug has already been filed for this (822385) and patch has been sent for the review (http://review.gluster.com/#change,3353). Regards, Raghavendra Bhat ----- Original Message ----- From: "Emmanuel Dreyfus" To: gluster-devel at nongnu.org Sent: Thursday, May 17, 2012 2:32:32 PM Subject: [Gluster-devel] Crashes with latest git code Hi I get a lot of crashes on NetBSD with latest git code. looking at core dumps, it is obvious I get memory corruption, as I find various structure overwritten by texts (file path or content). Linking with electric fence produces a much earlier crash, always at the same place. Here is how it looks: Program terminated with signal 11, Segmentation fault. #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 (gdb) bt #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 #1 0xbaa5ec1e in gf_strdup (src=0x0) at ../../../../libglusterfs/src/mem-pool.h:119 #2 0xbaa76dbf in client3_1_getxattr (frame=0xbb77f5c0, this=0xba3cd000, data=0xbfbfe18c) at client3_1-fops.c:4641 #3 0xbaa59ab8 in client_getxattr (frame=0xbb77f5c0, this=0xba3cd000, loc=0xb9402dd0, name=0x0, xdata=0x0) at client.c:1452 #4 0xb9ac3c7d in afr_sh_metadata_sync_prepare (frame=0xba8026bc, this=0xba3ce000) at afr-self-heal-metadata.c:419 #5 0xb9ac428b in afr_sh_metadata_fix (frame=0xba8026bc, this=0xba3ce000, op_ret=0, op_errno=0) at afr-self-heal-metadata.c:522 #6 0xb9abeb2b in afr_sh_common_lookup_cbk (frame=0xba8026bc, cookie=0x1, this=0xba3ce000, op_ret=0, op_errno=0, inode=0xb8b001a0, buf=0xbfbfe424, xattr=0xba401394, postparent=0xbfbfe3bc) at afr-self-heal-common.c:1311 #7 0xbaa6dc10 in client3_1_lookup_cbk (req=0xb92010d8, iov=0xb92010f8, count=1, myframe=0xbb77f550) at client3_1-fops.c:2636 Frame 4 is this: STACK_WIND (frame, afr_sh_metadata_getxattr_cbk, priv->children[source], priv->children[source]->fops->getxattr, &local->loc, NULL, NULL); Then in frame 3, I get args.name = NULL client_getxattr (call_frame_t *frame, xlator_t *this, loc_t *loc, const char *name, dict_t *xdata) (...) args.name = name; (...) ret = proc->fn (frame, this, &args); In frame 2, args->name = NULL client3_1_getxattr (call_frame_t *frame, xlator_t *this, void *data) (...) args = data; (...) local->name = gf_strdup (args->name); And there we will crash in gf_strdup(). The root cause is afr_sh_metadata_sync_prepare() calling client_getxattr with NULL arguments. The fix is beyond my knowledge of glusterfs internals, but I am sure that some folks here will be able to comment. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From manu at netbsd.org Thu May 17 09:18:29 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 17 May 2012 11:18:29 +0200 Subject: [Gluster-devel] Crashes with latest git code In-Reply-To: <3044c1d3-8b15-4a9d-9d18-7343cf8a33f4@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1kk8kq6.1m2qnzf1so7qgfM%manu@netbsd.org> Raghavendra Bhat wrote: > A bug has already been filed for this (822385) and patch has been sent for > the review (http://review.gluster.com/#change,3353). I looked at the patch, it does not fix the problem I reported: args->name is still NULL. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From rabhat at redhat.com Thu May 17 09:46:20 2012 From: rabhat at redhat.com (Raghavendra Bhat) Date: Thu, 17 May 2012 05:46:20 -0400 (EDT) Subject: [Gluster-devel] Crashes with latest git code In-Reply-To: <1kk8kq6.1m2qnzf1so7qgfM%manu@netbsd.org> Message-ID: In getxattr name is NULL means its equivalent listxattr. So args->name being NULL is ok. Process was crashing because it tried to do strdup (actually strlen in the gf_strdup) of the NULL pointer to a string. On wire we will send it as a null string with namelen set to 0 and protocol/server will understand it. On client side: req.name = (char *)args->name; if (!req.name) { req.name = ""; req.namelen = 0; } On server side: if (args.namelen) state->name = gf_strdup (args.name); ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Raghavendra Bhat" Cc: gluster-devel at nongnu.org Sent: Thursday, May 17, 2012 2:48:29 PM Subject: Re: [Gluster-devel] Crashes with latest git code Raghavendra Bhat wrote: > A bug has already been filed for this (822385) and patch has been sent for > the review (http://review.gluster.com/#change,3353). I looked at the patch, it does not fix the problem I reported: args->name is still NULL. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From jdarcy at redhat.com Thu May 17 11:47:52 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 17 May 2012 07:47:52 -0400 Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <4FB4AA3F.1090700@redhat.com> References: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> <4FB4AA3F.1090700@redhat.com> Message-ID: <4FB4E568.8050601@redhat.com> On 05/17/2012 03:35 AM, Anush Shetty wrote: > > On 05/17/2012 12:56 PM, John Mark Walker wrote: >> There are close to 600 people now subscribed to gluster-devel - how many >> of them actually have an account on Gerritt? I honestly have no idea. >> Another thing this would do is send a subtle message to subscribers that >> this is not the place to discuss user issues, but perhaps there are better >> ways to do that. >> >> I've seen many projects do this - as well as send all bugzilla and github >> notifications, but I could also see some people getting annoyed. > > How about a weekly digest of the same. Excellent idea. From johnmark at redhat.com Thu May 17 16:15:59 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 12:15:59 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <4FB4E568.8050601@redhat.com> Message-ID: ----- Original Message ----- > On 05/17/2012 03:35 AM, Anush Shetty wrote: > > > > How about a weekly digest of the same. Sounds reasonable. Now we just have to figure out how to implement :) -JM From vijay at build.gluster.com Thu May 17 16:51:43 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Thu, 17 May 2012 09:51:43 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0beta4 released Message-ID: <20120517165144.1BB041803EB@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0beta4/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0beta4.tar.gz This release is made off From johnmark at redhat.com Thu May 17 18:08:01 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 14:08:01 -0400 (EDT) Subject: [Gluster-devel] glusterfs-3.3.0beta4 released In-Reply-To: <20120517165144.1BB041803EB@build.gluster.com> Message-ID: <864fe250-bfd3-49ca-9310-2fc601411b83@zmail01.collab.prod.int.phx2.redhat.com> Reminder: GlusterFS 3.3 has been branched on GitHub, so you can pull the latest code from this branch if you want to test new fixes after the beta was released: https://github.com/gluster/glusterfs/tree/release-3.3 Also, note that this release features a license change in some files. We noted that some developers could not contribute code to the project because of compatibility issues around GPLv3. So, as a compromise, we changed the licensing in files that we deemed client-specific to allow for more contributors and a stronger developer community. Those files are now dual-licensed under the LGPLv3 and the GPLv2. For text of both of these license, see these URLs: http://www.gnu.org/licenses/lgpl.html http://www.gnu.org/licenses/old-licenses/gpl-2.0.html To see the list of files we modified with the new licensing, see this patchset from Kaleb: http://review.gluster.com/#change,3304 If you have questions or comments about this change, please do reach out to me. Thanks, John Mark ----- Original Message ----- > > http://bits.gluster.com/pub/gluster/glusterfs/3.3.0beta4/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0beta4.tar.gz > > This release is made off > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From johnmark at redhat.com Thu May 17 20:34:56 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 16:34:56 -0400 (EDT) Subject: [Gluster-devel] GlusterFest! For GlusterFS 3.3 Beta 4 In-Reply-To: <5456de9c-6c8b-4995-ad1e-720c9c52c74f@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <88ecc073-688f-4edc-8ff3-ccba3b6142a3@zmail01.collab.prod.int.phx2.redhat.com> An update: Kaleb was kind enough to port his HekaFS testing page for Fedora to GlusterFS. If you're looking for a series of things to test, see this URL: http://www.gluster.org/community/documentation/index.php/3.3.0_Beta_4_Tests By tonight, I'll have a handy form for reporting your results. We are at T-6:30 hours and counting until GlusterFest begins in earnest. For all updates related to GlusterFest, see this page: http://www.gluster.org/community/documentation/index.php/GlusterFest Please do post any series of tests that you would like to run. In particular, we're looking to test some of the new features of GlusterFS 3.3: - Object storage - HDFS compatibility library - Granular locking - More proactive self-heal Happy hacking, JM ----- Original Message ----- > Greetings, > > We are planning to have one more beta release tomorrow. If all goes > as planned, this will be the release candidate. In conjunction with > the beta, I thought we should have a 24-hour GlusterFest, starting > tomorrow at 8pm - > http://www.gluster.org/community/documentation/index.php/GlusterFest > > 'What's a GlusterFest?' you may be asking. Well, it's all of the > below: > > > - Testing the software. Install the new beta (when it's released > tomorrow) and put it through its paces. We will put some basic > testing procedures on the GlusterFest page here - > http://www.gluster.org/community/documentation/index.php/GlusterFest > > - Feel free to create your own testing procedures and link to it > from the GlusterFest page > > > - Finding bugs. See the current list of bugs targeted for this > release: http://bit.ly/beta4bugs > > > - Fixing bugs. If you're the kind of person who wants to submit > patches, see our development workflow doc: > http://www.gluster.org/community/documentation/index.php/Development_Work_Flow > > - and then get to know Gerritt: http://review.gluster.com/ > > > The GlusterFest page will be updated with some basic testing > procedures tomorrow, and GlusterFest will officially begin at 8pm > PDT May 17/03:00 UTC May 18 (coinciding with the end of our meetup > tomorrow), and ending at 8pm PDT May 18/03:00 UTC May 19. > > > If you need assistance, see #gluster on Freenode for "real-time" > questions, gluster-users and community.gluster.org for general usage > questions, and gluster-devel for anything related to building, > patching, and bug-fixing. > > > To keep up with GlusterFest activity, I'll be sending updates from > the @glusterorg account on Twitter, and I'm sure there will be > traffic on the mailing lists, as well. > > > Happy testing and bug-hunting! > > -JM > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From manu at netbsd.org Fri May 18 07:49:29 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 18 May 2012 07:49:29 +0000 Subject: [Gluster-devel] python version In-Reply-To: References: <4FB13314.3060708@redhat.com> <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Message-ID: <20120518074929.GJ3985@homeworld.netbsd.org> On Mon, May 14, 2012 at 04:58:18PM -0700, Anand Avati wrote: > Emmanuel, since the problem is not going to be a long lasting one (either > of the two should fix your problem), I suggest you find a solution local to > you in the interim. I submitted a tiny hack that solves the problem for everyone until automake is upgraded on glusterfs build system: http://review.gluster.com/3360 -- Emmanuel Dreyfus manu at netbsd.org From johnmark at redhat.com Fri May 18 15:02:50 2012 From: johnmark at redhat.com (John Mark Walker) Date: Fri, 18 May 2012 11:02:50 -0400 (EDT) Subject: [Gluster-devel] GlusterFest! For GlusterFS 3.3 Beta 4 In-Reply-To: <88ecc073-688f-4edc-8ff3-ccba3b6142a3@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: Looks like we have a few testers who have reported their results already: http://www.gluster.org/community/documentation/index.php/GlusterFest 12 more hours! -JM ----- Original Message ----- > An update: > > Kaleb was kind enough to port his HekaFS testing page for Fedora to > GlusterFS. If you're looking for a series of things to test, see > this URL: > http://www.gluster.org/community/documentation/index.php/3.3.0_Beta_4_Tests > > > By tonight, I'll have a handy form for reporting your results. We are > at T-6:30 hours and counting until GlusterFest begins in earnest. > For all updates related to GlusterFest, see this page: > http://www.gluster.org/community/documentation/index.php/GlusterFest > > > Please do post any series of tests that you would like to run. In > particular, we're looking to test some of the new features of > GlusterFS 3.3: > > - Object storage > - HDFS compatibility library > - Granular locking > - More proactive self-heal > > > Happy hacking, > JM > > > ----- Original Message ----- > > Greetings, > > > > We are planning to have one more beta release tomorrow. If all goes > > as planned, this will be the release candidate. In conjunction with > > the beta, I thought we should have a 24-hour GlusterFest, starting > > tomorrow at 8pm - > > http://www.gluster.org/community/documentation/index.php/GlusterFest > > > > 'What's a GlusterFest?' you may be asking. Well, it's all of the > > below: > > > > > > - Testing the software. Install the new beta (when it's released > > tomorrow) and put it through its paces. We will put some basic > > testing procedures on the GlusterFest page here - > > http://www.gluster.org/community/documentation/index.php/GlusterFest > > > > - Feel free to create your own testing procedures and link to it > > from the GlusterFest page > > > > > > - Finding bugs. See the current list of bugs targeted for this > > release: http://bit.ly/beta4bugs > > > > > > - Fixing bugs. If you're the kind of person who wants to submit > > patches, see our development workflow doc: > > http://www.gluster.org/community/documentation/index.php/Development_Work_Flow > > > > - and then get to know Gerritt: http://review.gluster.com/ > > > > > > The GlusterFest page will be updated with some basic testing > > procedures tomorrow, and GlusterFest will officially begin at 8pm > > PDT May 17/03:00 UTC May 18 (coinciding with the end of our meetup > > tomorrow), and ending at 8pm PDT May 18/03:00 UTC May 19. > > > > > > If you need assistance, see #gluster on Freenode for "real-time" > > questions, gluster-users and community.gluster.org for general > > usage > > questions, and gluster-devel for anything related to building, > > patching, and bug-fixing. > > > > > > To keep up with GlusterFest activity, I'll be sending updates from > > the @glusterorg account on Twitter, and I'm sure there will be > > traffic on the mailing lists, as well. > > > > > > Happy testing and bug-hunting! > > > > -JM > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > From manu at netbsd.org Fri May 18 16:15:20 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 18 May 2012 16:15:20 +0000 Subject: [Gluster-devel] memory corruption in release-3.3 Message-ID: <20120518161520.GL3985@homeworld.netbsd.org> Hi I still get crashes caused by memory corruption with latest release-3.3. My test case is a rm -Rf on a large tree. It seems I crash in two places: First crash flavor (trav is sometimes unmapped memory, sometimes NULL) #0 0xbbbb60ad in __iobuf_select_arena (iobuf_pool=0xbb70d400, page_size=128) at iobuf.c:453 453 if (trav->passive_cnt) { (gdb) print trav $1 = (struct iobuf_arena *) 0x414d202c (gdb) bt #0 0xbbbb60ad in __iobuf_select_arena (iobuf_pool=0xbb70d400, page_size=128) at iobuf.c:453 #1 0xbbbb655a in iobuf_get2 (iobuf_pool=0xbb70d400, page_size=24) at iobuf.c:604 #2 0xbaa549c7 in client_submit_request () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #3 0xbaa732c5 in client3_1_open () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #4 0xbaa574e6 in client_open () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #5 0xb9abac10 in afr_sh_data_open () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #6 0xb9abacb9 in afr_self_heal_data () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #7 0xb9ac2751 in afr_sh_metadata_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #8 0xb9ac457a in afr_self_heal_metadata () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #9 0xb9abd93f in afr_sh_missing_entries_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #10 0xb9ac169b in afr_self_heal () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #11 0xb9ae2e5b in afr_launch_self_heal () #12 0xb9ae3de9 in afr_lookup_perform_self_heal () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #13 0xb9ae4804 in afr_lookup_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #14 0xb9ae4fab in afr_lookup_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #15 0xbaa6dc10 in client3_1_lookup_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #16 0xbbb69716 in rpc_clnt_handle_reply () from /usr/local/lib/libgfrpc.so.0 #17 0xbbb699b3 in rpc_clnt_notify () from /usr/local/lib/libgfrpc.so.0 #18 0xbbb65989 in rpc_transport_notify () from /usr/local/lib/libgfrpc.so.0 #19 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #20 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #21 0xbbbb270f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=3) at event.c:357 #22 0xbbbb297b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #23 0xbbbb2ca7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #24 0x08050078 in main () Second crash flavor (it looks more like a double free) Program terminated with signal 11, Segmentation fault. #0 0xbb92661e in ?? () from /lib/libc.so.12 (gdb) bt #0 0xbb92661e in ?? () from /lib/libc.so.12 #1 0xbb92891b in free () from /lib/libc.so.12 #2 0xbbbb376f in __gf_free (free_ptr=0xbb70d160) at mem-pool.c:258 #3 0xbbb7e17d in data_destroy (data=0xba301d4c) at dict.c:135 #4 0xbbb7ee18 in data_unref (this=0xba301d4c) at dict.c:470 #5 0xbbb7eb6b in dict_destroy (this=0xba4022d0) at dict.c:395 #6 0xbbb7ecab in dict_unref (this=0xba4022d0) at dict.c:432 #7 0xbaa164ba in __qr_inode_free () from /usr/local/lib/glusterfs/3.3git/xlator/performance/quick-read.so #8 0xbaa27164 in qr_forget () from /usr/local/lib/glusterfs/3.3git/xlator/performance/quick-read.so #9 0xbbb9b221 in __inode_destroy (inode=0xb8b017e4) at inode.c:320 #10 0xbbb9d0a5 in inode_table_prune (table=0xba3cc160) at inode.c:1235 #11 0xbbb9b64e in inode_unref (inode=0xb8b017e4) at inode.c:445 #12 0xbbb85249 in loc_wipe (loc=0xb9402dd0) at xlator.c:530 #13 0xb9ae126e in afr_local_cleanup () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #14 0xb9a9c66b in afr_unlink_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #15 0xb9ad2d5b in afr_unlock_common_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #16 0xb9ad38a2 in afr_unlock_entrylk_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so ---Type to continue, or q to quit--- #17 0xbaa68370 in client3_1_entrylk_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #18 0xbbb69716 in rpc_clnt_handle_reply () from /usr/local/lib/libgfrpc.so.0 #19 0xbbb699b3 in rpc_clnt_notify () from /usr/local/lib/libgfrpc.so.0 #20 0xbbb65989 in rpc_transport_notify () from /usr/local/lib/libgfrpc.so.0 #21 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #22 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #23 0xbbbb270f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=3) at event.c:357 #24 0xbbbb297b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #25 0xbbbb2ca7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #26 0x08050078 in main () (gdb) frame 2 #2 0xbbbb376f in __gf_free (free_ptr=0xbb70d160) at mem-pool.c:258 258 FREE (free_ptr); (gdb) x/1w free_ptr 0xbb70d160: 538978863 -- Emmanuel Dreyfus manu at netbsd.org From amarts at redhat.com Sat May 19 06:15:09 2012 From: amarts at redhat.com (Amar Tumballi) Date: Sat, 19 May 2012 11:45:09 +0530 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <20120518161520.GL3985@homeworld.netbsd.org> References: <20120518161520.GL3985@homeworld.netbsd.org> Message-ID: <4FB73A6D.9050601@redhat.com> On 05/18/2012 09:45 PM, Emmanuel Dreyfus wrote: > Hi > > I still get crashes caused by memory corruption with latest release-3.3. > My test case is a rm -Rf on a large tree. It seems I crash in two places: > Emmanuel, Can you please file bug report? different bugs corresponding to different crash dumps will help us. That helps in tracking development internally. Regards, Amar From manu at netbsd.org Sat May 19 10:29:55 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 12:29:55 +0200 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <20120518161520.GL3985@homeworld.netbsd.org> Message-ID: <1kkccac.1lkm3tq166jzftM%manu@netbsd.org> Emmanuel Dreyfus wrote: > Second crash flavor (it looks more like a double free) Here it is again at a different place. This is in loc_wipe, where loc->path is free'ed. Looking at the code, I see that there are places where loc->path is allocated by gf_strdup(). I see other places where it is copied from another buffer. Since this is done without reference counts, it seems likely that there is a double free somewhere. Opinions? (gdb) bt #0 0xbb92652a in ?? () from /lib/libc.so.12 #1 0xbb92891b in free () from /lib/libc.so.12 #2 0xbbbb376f in __gf_free (free_ptr=0xb8250040) at mem-pool.c:258 #3 0xbbb85269 in loc_wipe (loc=0xba4cd010) at xlator.c:534 #4 0xbaa5e68a in client_local_wipe (local=0xba4cd010) at client-helpers.c:125 #5 0xbaa614d5 in client3_1_open_cbk (req=0xb92010d8, iov=0xb92010f8, count=1, myframe=0xbb77fa20) at client3_1-fops.c:421 #6 0xbbb69716 in rpc_clnt_handle_reply (clnt=0xba3c51c0, pollin=0xbb77d220) at rpc-clnt.c:788 #7 0xbbb699b3 in rpc_clnt_notify (trans=0xbb70ec00, mydata=0xba3c51e0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-clnt.c:907 #8 0xbbb65989 in rpc_transport_notify (this=0xbb70ec00, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-transport.c:489 #9 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #10 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #11 0xbbbb270f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=3) at event.c:357 #12 0xbbbb297b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #13 0xbbbb2ca7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #14 0x08050078 in main () -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vijay at build.gluster.com Sat May 19 12:35:21 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Sat, 19 May 2012 05:35:21 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa42 released Message-ID: <20120519123524.842501803FC@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa42/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa42.tar.gz This release is made off v3.3.0qa42 From manu at netbsd.org Sat May 19 13:50:25 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 15:50:25 +0200 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <1kkccac.1lkm3tq166jzftM%manu@netbsd.org> Message-ID: <1kkcml0.c7hab41bl4auaM%manu@netbsd.org> Emmanuel Dreyfus wrote: > Looking at the code, I see that there are places where loc->path is > allocated by gf_strdup(). I see other places where it is copied from > another buffer. Since this is done without reference counts, it seems > likely that there is a double free somewhere. Opinions? I added a second argument to gf_strdup() so that the calling function can pass __func__, and I started logging gf_strdup() allocations to track a possible double free. ANd the result is... the offending free() is done on a loc->path that was not allocated by gf_strdup(). Can it be allocated by another function? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 19 15:07:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 17:07:53 +0200 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <1kkccac.1lkm3tq166jzftM%manu@netbsd.org> Message-ID: <1kkcpny.16h3fbd1pfhutzM%manu@netbsd.org> Emmanuel Dreyfus wrote: > Looking at the code, I see that there are places where loc->path is > allocated by gf_strdup(). I see other places where it is copied from > another buffer. Since this is done without reference counts, it seems > likely that there is a double free somewhere. Opinions? I found a bug: Thou shalt not free(3) memory dirname(3) returned On Linux basename() and dirname() return a pointer with the string passed as argument. On BSD flavors, basename() and dirname() return static storage, or pthread specific storage. Both behaviour are compliant, but calling free on the result in the second case is a bug. --- xlators/cluster/afr/src/afr-dir-write.c.orig 2012-05-19 16:45:30.000000000 +0200 +++ xlators/cluster/afr/src/afr-dir-write.c 2012-05-19 17:03:17.000000000 +0200 @@ -55,14 +55,22 @@ if (op_errno) *op_errno = ENOMEM; goto out; } - parent->path = dirname (child_path); + parent->path = gf_strdup( dirname (child_path) ); + if (!parent->path) { + if (op_errno) + *op_errno = ENOMEM; + goto out; + } parent->inode = inode_ref (child->parent); uuid_copy (parent->gfid, child->pargfid); ret = 0; out: + if (child_path) + GF_FREE(child_path); + return ret; } /* {{{ create */-- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 19 17:34:51 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 19:34:51 +0200 Subject: [Gluster-devel] mkdir race condition Message-ID: <1kkcuhi.1vqgbxy1lxb8w2M%manu@netbsd.org> On a replicated volume, mkdir quickly followed by the rename of a new directory child fails. # rm -Rf test && mkdir test && touch test/a && mv test/a test/b mv: rename test/a to test/b: No such file or directory # rm -Rf test && mkdir test && sleep 1 && touch test/a && mv test/a test/b (it works) Client log: [2012-05-19 18:49:43.933090] W [client3_1-fops.c:327:client3_1_mkdir_cbk] 0-pfs-client-0: remote operation failed: No such file or directory. Path: /test (00000000-0000-0000-0000-000000000000) [2012-05-19 18:49:43.944883] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-pfs-client-0: remote operation failed: No such file or directory [2012-05-19 18:49:43.946265] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-pfs-client-0: remote operation failed: No such file or directory [2012-05-19 18:49:43.961028] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-pfs-client-0: remote operation failed: No such file or directory [2012-05-19 18:49:43.961528] W [fuse-bridge.c:1515:fuse_rename_cbk] 0-glusterfs-fuse: 27: /test/a -> /test/b => -1 (No such file or directory) Server log: [2012-05-19 18:49:58.455280] E [posix-handle.c:412:posix_handle_mkdir_hashes] 0-pfs-posix: error mkdir hash-1 /export/wd3a/.glusterfs/f6/8b (No such file or directory) [2012-05-19 18:49:58.455384] W [posix-handle.c:521:posix_handle_soft] 0-pfs-posix: mkdir /export/wd3a/.glusterfs/f6/8b/f68b2a33-a649-4705-9dfd-40a15f22589a failed (No such file or directory) [2012-05-19 18:49:58.455425] E [posix.c:968:posix_mkdir] 0-pfs-posix: setting gfid on /export/wd3a/test failed [2012-05-19 18:49:58.455558] E [posix.c:1010:posix_mkdir] 0-pfs-posix: post-operation lstat on parent of /export/wd3a/test failed: No such file or directory [2012-05-19 18:49:58.455664] I [server3_1-fops.c:529:server_mkdir_cbk] 0-pfs-server: 41: MKDIR /test (00000000-0000-0000-0000-000000000000) ==> -1 (No such file or directory) [2012-05-19 18:49:58.467548] I [server3_1-fops.c:346:server_entrylk_cbk] 0-pfs-server: 46: ENTRYLK (null) (--) ==> -1 (No such file or directory) [2012-05-19 18:49:58.468990] I [server3_1-fops.c:346:server_entrylk_cbk] 0-pfs-server: 47: ENTRYLK (null) (--) ==> -1 (No such file or directory) [2012-05-19 18:49:58.483726] I [server3_1-fops.c:346:server_entrylk_cbk] 0-pfs-server: 51: ENTRYLK (null) (--) ==> -1 (No such file or directory) It says it fails, but it seems it succeeded: silo# getextattr -x trusted.gfid /export/wd3a/test /export/wd3a/test 000 f6 8b 2a 33 a6 49 47 05 9d fd 40 a1 5f 22 58 9a ..*3.IG... at ._"X. Client is release-3.3 from yesterday. Server is master branch from may 14th. Is it a known problem? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 05:36:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 07:36:02 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / Message-ID: <1kkdu4r.aq5gouehux9cM%manu@netbsd.org> On release-3.3, glusterd_is_path_in_use() in xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has extended attribute enabled, and makes it impossible to create a volume with bricks from other filesystems (with EA enabled), if / does not support extended attributes. Is it on purpose? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From kparthas at redhat.com Sun May 20 05:53:35 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 01:53:35 -0400 (EDT) Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kkdu4r.aq5gouehux9cM%manu@netbsd.org> Message-ID: <1d6e3018-e614-4273-883c-1cca9efaf0b8@zmail01.collab.prod.int.phx2.redhat.com> Emmanuel, The assumption of EA being enabled in / filesystem or any prefix of brick path is an accidental side-effect of the way glusterd_is_path_in_use() is used in glusterd_brick_create_path(). The error handling should be accommodative to ENOTSUP. In short it is a bug. Will send out a patch immediately. thanks, krish ----- Original Message ----- From: "Emmanuel Dreyfus" To: gluster-devel at nongnu.org Sent: Sunday, May 20, 2012 11:06:02 AM Subject: [Gluster-devel] 3.3 requires extended attribute on / On release-3.3, glusterd_is_path_in_use() in xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has extended attribute enabled, and makes it impossible to create a volume with bricks from other filesystems (with EA enabled), if / does not support extended attributes. Is it on purpose? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From manu at netbsd.org Sun May 20 05:56:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 07:56:53 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kkdu4r.aq5gouehux9cM%manu@netbsd.org> Message-ID: <1kkdul5.4vmrbe1owph67M%manu@netbsd.org> Emmanuel Dreyfus wrote: > On release-3.3, glusterd_is_path_in_use() in > xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has > extended attribute enabled, and makes it impossible to create a volume > with bricks from other filesystems (with EA enabled), if / does not > support extended attributes. And even with EA enabled on root, creating a volume loops forever on reading unexistant trusted.gfid and trusted.glusterfs.volume-id on brick's parent directory. It gets ENODATA and retry forever. If I patch the function to just set in_use = 0 and return 0, I can create a volume. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From ian.latter at midnightcode.org Sun May 20 06:12:39 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 16:12:39 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> Hello, Couple of questions that might help make my module a little more sane; 0) Is there any developer docco? I've just done another quick search and I can't see any. Let me know if there is and I'll try and answer the below myself. 1) What is the difference between STACK_WIND and STACK_WIND_COOKIE? I.e. I've only ever used STACK_WIND, when should I use it versus the other? 2) Is there a way to write linearly within a single function within Gluster (or is there a reason why I wouldn't want to do that)? RE 2: This may stem from my lack of understanding of the broader Gluster internals. I am performing multiple fops per fop, which is creating structural inelegances in the code that make me think I'm heading down the wrong rabbit hole. I want to say; read() { // pull in other content while(want more) { _lookup() _open() _read() _close() } return iovec } But the way I've understood the Gluster internal structure is that I need to operate in a chain of related functions; _read_lookup_cbk_open_cbk_read_cbk() { wind _close() } _read_lookup_cbk_open_cbk() { wind _read() add to local->iovec } _lookup_cbk() { wind _open() } read() { while(want more) { wind _lookup() } return local->iovec } Am I missing something - or is there a nicer way of doing this? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From manu at netbsd.org Sun May 20 06:13:32 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 08:13:32 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1d6e3018-e614-4273-883c-1cca9efaf0b8@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1kkdvl3.1p663u6iyul1oM%manu@netbsd.org> Krishnan Parthasarathi wrote: > Will send out a patch immediately. Great :-) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 06:13:33 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 08:13:33 +0200 Subject: [Gluster-devel] mkdir race condition In-Reply-To: <1kkcuhi.1vqgbxy1lxb8w2M%manu@netbsd.org> Message-ID: <1kkdvma.10s8o2rtrmcvpM%manu@netbsd.org> Emmanuel Dreyfus wrote: > On a replicated volume, mkdir quickly followed by the rename of a new > directory child fails. > > # rm -Rf test && mkdir test && touch test/a && mv test/a test/b > mv: rename test/a to test/b: No such file or directory > # rm -Rf test && mkdir test && sleep 1 && touch test/a && mv test/a test/b > (it works) I just reinstalled server from release-3.3 and now things make more sense. Any directory creation will report failure but will succeed: bacasel# mkdir /gfs/manu mkdir: /gfs/manu: No such file or directory bacasel# cd /gfs bacasel# ls manu Server log reports it fails because: [2012-05-20 07:59:23.775789] E [posix-handle.c:412:posix_handle_mkdir_hashes] 0-pfs-posix: error mkdir hash-1 /export/wd3a/.glusterfs/ec/e2 (No such file or directory) It seems posix_handle_mkdir_hashes() attempts to mkdir two directories at once: ec/ec2. How is it supposed to work? Should parent directory be created somewhere else? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From kparthas at redhat.com Sun May 20 06:36:44 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 02:36:44 -0400 (EDT) Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kkdul5.4vmrbe1owph67M%manu@netbsd.org> Message-ID: <80c2c170-133e-4509-9ac5-062293a199ad@zmail01.collab.prod.int.phx2.redhat.com> ----- Original Message ----- From: "Emmanuel Dreyfus" To: gluster-devel at nongnu.org Sent: Sunday, May 20, 2012 11:26:53 AM Subject: Re: [Gluster-devel] 3.3 requires extended attribute on / Emmanuel Dreyfus wrote: > On release-3.3, glusterd_is_path_in_use() in > xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has > extended attribute enabled, and makes it impossible to create a volume > with bricks from other filesystems (with EA enabled), if / does not > support extended attributes. > And even with EA enabled on root, creating a volume loops forever on > reading unexistant trusted.gfid and trusted.glusterfs.volume-id on > brick's parent directory. It gets ENODATA and retry forever. If I patch > the function to just set in_use = 0 and return 0, I can create a volume. It is strange that the you see glusterd_path_in_use() loop forever. If I am not wrong, the inner loop checks for presence of trusted.gfid and trusted.glusterfs.volume-id and should exit after that, and the outer loop performs dirname on the path repeatedly and dirname(3) guarantees such an operation should return "/" eventually, which we check. It would be great if you could provide values of local variables, "used" and "curdir" when you see the looping forever. I dont have a setup to check this immediately. thanks, krish -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Sun May 20 06:47:57 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 16:47:57 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205200647.q4K6lvdN009529@singularity.tronunltd.com> > > And I am sick of the word-wrap on this client .. I think > > you've finally convinced me to fix it ... what's normal > > these days - still 80 chars? > > I used to line-wrap (gnus and cool emacs extensions). It doesn't make > sense to line wrap any more. Let the email client handle it depending > on the screen size of the device (mobile / tablet / desktop). FYI found this; an hour of code parsing in the mail software and it turns out that it had no wrapping .. it came from the stupid textarea tag in the browser (wrap="hard"). Same principle (server side coded, non client savvy) - now set to "soft". So hopefully fixed :) Cheers. -- Ian Latter Late night coder .. http://midnightcode.org/ From kparthas at redhat.com Sun May 20 06:54:54 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 02:54:54 -0400 (EDT) Subject: [Gluster-devel] Gluster internals In-Reply-To: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> Message-ID: Couple of questions that might help make my module a little more sane; 0) Is there any developer docco? I've just done another quick search and I can't see any. Let me know if there is and I'll try and answer the below myself. 1) What is the difference between STACK_WIND and STACK_WIND_COOKIE? I.e. I've only ever used STACK_WIND, when should I use it versus the other? STACK_WIND_COOKIE is used when we need to 'tie' the call wound with its corresponding callback. You can see this variant being used extensively in cluster xlators where it is used to identify the callback with the subvolume no. it is coming from. 2) Is there a way to write linearly within a single function within Gluster (or is there a reason why I wouldn't want to do that)? RE 2: This may stem from my lack of understanding of the broader Gluster internals. I am performing multiple fops per fop, which is creating structural inelegances in the code that make me think I'm heading down the wrong rabbit hole. I want to say; read() { // pull in other content while(want more) { _lookup() _open() _read() _close() } return iovec } But the way I've understood the Gluster internal structure is that I need to operate in a chain of related functions; _read_lookup_cbk_open_cbk_read_cbk() { wind _close() } _read_lookup_cbk_open_cbk() { wind _read() add to local->iovec } _lookup_cbk() { wind _open() } read() { while(want more) { wind _lookup() } return local->iovec } Am I missing something - or is there a nicer way of doing this? The above method you are trying to use is the "continuation passing style" that is extensively used in afr-inode-read.c and afr-transaction.c to perform multiple internal fops on the trigger of a single fop from the application. cluster/afr may give you some ideas on how you could structure it if you like that more. The other method I can think of (not sure if it would suit your needs) is to use the syncop framework (see libglusterfs/src/syncop.c). This allows one to make a 'synchronous' glusterfs fop. inside a xlator. The downside is that you can only make one call at a time. This may not be acceptable for cluster xlators (ie, xlator with more than one child xlator). Hope that helps, krish _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Sun May 20 07:23:12 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 17:23:12 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205200723.q4K7NCO3009706@singularity.tronunltd.com> > 1) What is the difference between STACK_WIND > and STACK_WIND_COOKIE? I.e. I've only > ever used STACK_WIND, when should I use > it versus the other? > > STACK_WIND_COOKIE is used when we need to 'tie' the call > wound with its corresponding callback. You can see this > variant being used extensively in cluster xlators where it > is used to identify the callback with the subvolume no. it > is coming from. Ok - thanks. I will take a closer look at the examples for this .. this may help me ... > 2) Is there a way to write linearly within a single > function within Gluster (or is there a reason > why I wouldn't want to do that)? > > > RE 2: > > This may stem from my lack of understanding > of the broader Gluster internals. I am performing > multiple fops per fop, which is creating structural > inelegances in the code that make me think I'm > heading down the wrong rabbit hole. I want to > say; > > read() { > // pull in other content > while(want more) { > _lookup() > _open() > _read() > _close() > } > return iovec > } > > > But the way I've understood the Gluster internal > structure is that I need to operate in a chain of > related functions; > > _read_lookup_cbk_open_cbk_read_cbk() { > wind _close() > } > > _read_lookup_cbk_open_cbk() { > wind _read() > add to local->iovec > } > > _lookup_cbk() { > wind _open() > } > > read() { > while(want more) { > wind _lookup() > } > return local->iovec > } > > > > Am I missing something - or is there a nicer way of > doing this? > > The above method you are trying to use is the "continuation passing style" that > is extensively used in afr-inode-read.c and afr-transaction.c to perform multiple > internal fops on the trigger of a single fop from the application. cluster/afr may > give you some ideas on how you could structure it if you like that more. These may have been where I got that code style from originally .. I will go back to these two programs, thanks for the reference. I'm currently working my way through the afr-heal programs .. > The other method I can think of (not sure if it would suit your needs) > is to use the syncop framework (see libglusterfs/src/syncop.c). > This allows one to make a 'synchronous' glusterfs fop. inside a xlator. > The downside is that you can only make one call at a time. This may not > be acceptable for cluster xlators (ie, xlator with more than one child xlator). In the syncop framework, how much gets affected when I use it in my xlator. Does it mean that there's only one call at a time in the whole xlator (so the current write will stop all other reads) or is the scope only the fop (so that within this write, my child->fops are serial, but neighbouring reads on my xlator will continue in other threads)? And does that then restrict what can go above and below my xlator? I mean that my xlator isn't a cluster xlator but I would like it to be able to be used on top of (or underneath) a cluster xlator, will that no longer be possible? > Hope that helps, > krish Thanks Krish, every bit helps! -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Sun May 20 07:40:54 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 17:40:54 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205200740.q4K7esfl009777@singularity.tronunltd.com> > > The other method I can think of (not sure if it would suit your needs) > > is to use the syncop framework (see libglusterfs/src/syncop.c). > > This allows one to make a 'synchronous' glusterfs fop. inside a xlator. > > The downside is that you can only make one call at a time. This may not > > be acceptable for cluster xlators (ie, xlator with more than one child xlator). > > In the syncop framework, how much gets affected when I > use it in my xlator. Does it mean that there's only one call > at a time in the whole xlator (so the current write will stop > all other reads) or is the scope only the fop (so that within > this write, my child->fops are serial, but neighbouring reads > on my xlator will continue in other threads)? And does that > then restrict what can go above and below my xlator? I > mean that my xlator isn't a cluster xlator but I would like it > to be able to be used on top of (or underneath) a cluster > xlator, will that no longer be possible? > I've just taken a look at xlators/cluster/afr/src/pump.c for some syncop usage examples and I really like what I see there. If syncop only serialises/syncs activity that I code within a given fop of my xlator and doesn't impose serial/ sync limits on the parents or children of my xlator then this looks like the right path. I want to be sure that it won't result in a globally syncronous outcome though (like ignoring a cache xlator under mine to get a true disk read) - I just need the internals of my calls to be linear. -- Ian Latter Late night coder .. http://midnightcode.org/ From manu at netbsd.org Sun May 20 08:11:04 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 10:11:04 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <80c2c170-133e-4509-9ac5-062293a199ad@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1kke141.ffp9fr1meqkgbM%manu@netbsd.org> Krishnan Parthasarathi wrote: > It is strange that the you see glusterd_path_in_use() loop forever. If I > am not wrong, the inner loop checks for presence of trusted.gfid and > trusted.glusterfs.volume-id and should exit after that, and the outer loop > performs dirname on the path repeatedly and dirname(3) guarantees such an > operation should return "/" eventually, which we check. Here is my setup when I tried that: / with EA enabled /export/wd3a ibrick with EA enabled But I may have been testing with an untintended patch in glusterd_path_in_use(). I will retry with the right fix once it will be available. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 08:30:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 10:30:53 +0200 Subject: [Gluster-devel] mkdir race condition In-Reply-To: <1kkdvma.10s8o2rtrmcvpM%manu@netbsd.org> Message-ID: <1kke28c.rugeav1w049sdM%manu@netbsd.org> Emmanuel Dreyfus wrote: > It seems posix_handle_mkdir_hashes() attempts to mkdir two directories > at once: ec/ec2. How is it supposed to work? Should parent directory be > created somewhere else? This fixes the problem. Any comment? --- xlators/storage/posix/src/posix-handle.c.orig +++ xlators/storage/posix/src/posix-handle.c @@ -405,8 +405,16 @@ parpath = dirname (duppath); parpath = dirname (duppath); ret = mkdir (parpath, 0700); + if (ret == -1 && errno == ENOENT) { + char *tmppath = NULL; + + tmppath = strdupa(parpath); + ret = mkdir (dirname (tmppath), 0700); + if (ret == 0) + ret = mkdir (parpath, 0700); + } if (ret == -1 && errno != EEXIST) { gf_log (this->name, GF_LOG_ERROR, "error mkdir hash-1 %s (%s)", parpath, strerror (errno)); -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 08:47:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 10:47:02 +0200 Subject: [Gluster-devel] rename(2) race condition Message-ID: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> After I patched to fix the mkdir issue, I now encounter a race in rename(2). Most of the time it works, but sometimes: 3548 1 tar CALL open(0xbb9010e0,0xa02,0x180) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET open 8 3548 1 tar CALL __fstat50(8,0xbfbfe69c) 3548 1 tar RET __fstat50 0 3548 1 tar CALL write(8,0x8067880,0x16) 3548 1 tar GIO fd 8 wrote 22 bytes "Nnetbsd-5-1-2-RELEASE\n" 3548 1 tar RET write 22/0x16 3548 1 tar CALL close(8) 3548 1 tar RET close 0 3548 1 tar CALL lchmod(0xbb9010e0,0x1a4) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET lchmod 0 3548 1 tar CALL __lutimes50(0xbb9010e0,0xbfbfe6d8) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET __lutimes50 0 3548 1 tar CALL rename(0xbb9010e0,0x8071584) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET rename -1 errno 13 Permission denied I can reproduce it with the command below. It runs fine for a few seconds and then hit permission denied. It needs a level of hierarchy to exhibit the hebavior: just install a b will not fail. mkdir test && echo "xxx" > tmp/a while [ 1 ] ; do rm -f test/b && install test/a test/b ; done -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From mihai at patchlog.com Sun May 20 09:19:34 2012 From: mihai at patchlog.com (Mihai Secasiu) Date: Sun, 20 May 2012 12:19:34 +0300 Subject: [Gluster-devel] glusterfs on MacOSX Message-ID: <4FB8B726.10500@patchlog.com> Hello, I am trying to get glusterfs ( 3.2.6, server ) to work on MacOSX ( Lion - I think , darwin kernel 11.3 ). So far I've been able to make it compile with a few patches and --disable-fuse-client. I want to create a volume on a MacMini that will be a replica of another volume stored on a linux server in a different location. The volume stored on the MacMini would also have to be mounted on the macmini. Since the fuse client is broken because it's built to use macfuse and that doesn't work anymore on the latest MacOSX I want to mount the volume over nfs and I've been able to do that ( with a small patch to the xdr code ) but it's really really slow. It's so slow that mounting the volume through a remote node is a lot faster. Also mounting the same volume on a remote node is fast so the problem is definitely in the nfs server on the MacOSX. I did a strace ( dtruss ) on it and it seems like it's doing a lot of polling. Could this be the cause of the slowness ? If anyone wants to try this you can fetch it from https://github.com/mihaisecasiu/glusterfs/tree/release-3.2 Thanks From manu at netbsd.org Sun May 20 12:43:52 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 14:43:52 +0200 Subject: [Gluster-devel] rename(2) race condition In-Reply-To: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> Message-ID: <1kkee8d.8hdhfs177z5zdM%manu@netbsd.org> Emmanuel Dreyfus wrote: > After I patched to fix the mkdir issue, I now encounter a race in > rename(2). Most of the time it works, but sometimes: And the problem onoy happens when running as an unprivilegied user. It works fine for root. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From kparthas at redhat.com Sun May 20 14:14:10 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 10:14:10 -0400 (EDT) Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kke141.ffp9fr1meqkgbM%manu@netbsd.org> Message-ID: Emmanuel, I have submitted the fix for review: http://review.gluster.com/3380 I have not tested the fix with "/" having EA disabled. It would be great if you could confirm the looping forever doesn't happen with this fix. thanks, krish ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Krishnan Parthasarathi" Cc: gluster-devel at nongnu.org Sent: Sunday, May 20, 2012 1:41:04 PM Subject: Re: [Gluster-devel] 3.3 requires extended attribute on / Krishnan Parthasarathi wrote: > It is strange that the you see glusterd_path_in_use() loop forever. If I > am not wrong, the inner loop checks for presence of trusted.gfid and > trusted.glusterfs.volume-id and should exit after that, and the outer loop > performs dirname on the path repeatedly and dirname(3) guarantees such an > operation should return "/" eventually, which we check. Here is my setup when I tried that: / with EA enabled /export/wd3a ibrick with EA enabled But I may have been testing with an untintended patch in glusterd_path_in_use(). I will retry with the right fix once it will be available. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 21 04:51:59 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 21 May 2012 06:51:59 +0200 Subject: [Gluster-devel] zero'ed local data in mdc_setattr_cbk Message-ID: <1kkfmt2.2blugrtqcea2M%manu@netbsd.org> Hi Here is a backtrace for a SIGSEGV in md-cache code. Note inode = NULL when mdc_inode_iatt_set() is called by mdc_setattr_cbk(). It seems that local got corupted in the later function #0 0xbbb3a7c9 in pthread_spin_lock () from /usr/lib/libpthread.so.1 #1 0xbaa09d8c in mdc_inode_prep (this=0xba3e5000, inode=0x0) at md-cache.c:267 #2 0xbaa0a1bf in mdc_inode_iatt_set (this=0xba3e5000, inode=0x0, iatt=0xb9401d40) at md-cache.c:384 #3 0xbaa0ee16 in mdc_setattr_cbk (frame=0xbb77f400, cookie=0xbb77f470, this=0xba3e5000, op_ret=0, op_errno=0, prebuf=0xb9401cd8, postbuf=0xb9401d40, xdata=0x0) at md-cache.c:1423 #4 0xbaa1d0ec in qr_fsetattr_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/performance/quick-read.so #5 0xbbb8ac72 in default_fsetattr_cbk (frame=0xbb77f0f0, cookie=0xbb77f160, this=0xba3e3000, op_ret=0, op_errno=0, statpre=0xb9401cd8, statpost=0xb9401d40, xdata=0x0) at defaults.c:452 #6 0xbbb8ac72 in default_fsetattr_cbk (frame=0xbb77f160, cookie=0xbb77f1d0, this=0xba3e2000, op_ret=0, op_errno=0, statpre=0xb9401cd8, statpost=0xb9401d40, xdata=0x0) at defaults.c:452 #7 0xbbb8ac72 in default_fsetattr_cbk (frame=0xbb77f1d0, cookie=0xbb77f240, this=0xba3e1000, op_ret=0, op_errno=0, statpre=0xb9401cd8, statpost=0xb9401d40, xdata=0x0) at defaults.c:452 #8 0xb9aa9d23 in afr_fsetattr_unwind (frame=0xba801ee8, this=0xba3d1000) at afr-inode-write.c:1160 #9 0xb9aa9f01 in afr_fsetattr_wind_cbk (frame=0xba801ee8, cookie=0x0, this=0xba3d1000, op_ret=0, op_errno=0, preop=0xbfbfe880, postop=0xbfbfe818, xdata=0x0) at afr-inode-write.c:1221 #10 0xbaa6a099 in client3_1_fsetattr_cbk (req=0xb90010d8, iov=0xb90010f8, count=1, myframe=0xbb77f010) at client3_1-fops.c:1897 #11 0xbbb6975e in rpc_clnt_handle_reply (clnt=0xba3c5270, pollin=0xbb77d220) at rpc-clnt.c:788 #12 0xbbb699fb in rpc_clnt_notify (trans=0xbb70f000, mydata=0xba3c5290, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-clnt.c:907 #13 0xbbb659c7 in rpc_transport_notify (this=0xbb70f000, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-transport.c:489 #14 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #15 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #16 0xbbbb281f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=2) at event.c:357 #17 0xbbbb2a8b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #18 0xbbbb2db7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #19 0x0805015e in main () (gdb) frame 3 #3 0xbaa0ee16 in mdc_setattr_cbk (frame=0xbb77f400, cookie=0xbb77f470, this=0xba3e5000, op_ret=0, op_errno=0, prebuf=0xb9401cd8, postbuf=0xb9401d40, xdata=0x0) at md-cache.c:1423 1423 mdc_inode_iatt_set (this, local->loc.inode, postbuf); (gdb) print *local $2 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, fd = 0xb8f9d054, linkname = 0x0, xattr = 0x0} -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 21 10:14:24 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 21 May 2012 10:14:24 +0000 Subject: [Gluster-devel] zero'ed local data in mdc_setattr_cbk In-Reply-To: <1kkfmt2.2blugrtqcea2M%manu@netbsd.org> References: <1kkfmt2.2blugrtqcea2M%manu@netbsd.org> Message-ID: <20120521101424.GA10504@homeworld.netbsd.org> On Mon, May 21, 2012 at 06:51:59AM +0200, Emmanuel Dreyfus wrote: > Here is a backtrace for a SIGSEGV in md-cache code. Note inode = NULL > when mdc_inode_iatt_set() is called by mdc_setattr_cbk(). I submitted a patch to fix it, please review http://review.gluster.com/3383 -- Emmanuel Dreyfus manu at netbsd.org From jdarcy at redhat.com Mon May 21 12:24:30 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 21 May 2012 08:24:30 -0400 Subject: [Gluster-devel] Gluster internals In-Reply-To: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> References: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> Message-ID: <4FBA33FE.3050602@redhat.com> On 05/20/2012 02:12 AM, Ian Latter wrote: > Hello, > > > Couple of questions that might help make my > module a little more sane; > > 0) Is there any developer docco? I've just done > another quick search and I can't see any. Let > me know if there is and I'll try and answer the > below myself. Your best bet right now (if I may say so) is the stuff I've posted on hekafs.org - the "Translator 101" articles plus the API overview at http://hekafs.org/dist/xlator_api_2.html > 1) What is the difference between STACK_WIND > and STACK_WIND_COOKIE? I.e. I've only > ever used STACK_WIND, when should I use > it versus the other? I see Krishnan has already covered this. > 2) Is there a way to write linearly within a single > function within Gluster (or is there a reason > why I wouldn't want to do that)? Any blocking ops would have to be built on top of async ops plus semaphores etc. because (unlike e.g. an HTTP server) the underlying sockets etc. are shared/multiplexed between users and activities. Thus you'd get much more context switching that way than if you stay within the async/continuation style. Some day in the distant future, I'd like to work some more on a preprocessor that turns linear code into async code so that it's easier to write but retains the performance and resource-efficiency advantages of an essentially async style. I did some work (http://pl.atyp.us/ripper/UserGuide.html) in this area several years ago, but it has probably bit-rotted to hell since then. With more recent versions of gcc and LLVM it should be possible to overcome some of the limitations that version had. From manu at netbsd.org Mon May 21 16:27:21 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 21 May 2012 18:27:21 +0200 Subject: [Gluster-devel] rename(2) race condition In-Reply-To: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> Message-ID: <1kkgieh.28gd5j1k9erqhM%manu@netbsd.org> Emmanuel Dreyfus wrote: > 3548 1 tar CALL rename(0xbb9010e0,0x8071584) > 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" > 3548 1 tar RET rename -1 errno 13 Permission denied I tracked this down to FUSE LOOKUP operation that do not set fuse_entry's attr.uid correctly (it is left set to 0). Here is the summary of my findings so far: - as un unprivilegied user, I create and delete files like crazy - most of the time everything is fine - sometime a LOOKUP for a file I created (as an unprivilegied user) will return a fuse_entry with uid set to 0, which cause the kernel to raise EACCESS when I try to delete the file. Here is an example of a FUSE trace, produced by the test case while [ 1 ] ; do cp /etc/fstab test/foo1 ; rm test/foo1 ; done > unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1) < unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1), error = -2 > unique = 1436, nodeid = 3098542296, opcode = CREATE (35) < unique = 1436, nodeid = 3098542296, opcode = CREATE (35), error = 0 > unique = 1437, nodeid = 3098542396, opcode = SETATTR (4) < unique = 1437, nodeid = 3098542396, opcode = SETATTR (4), error = 0 > unique = 1438, nodeid = 3098542396, opcode = WRITE (16) < unique = 1438, nodeid = 3098542396, opcode = WRITE (16), error = 0 > unique = 1439, nodeid = 3098542396, opcode = FSYNC (20) < unique = 1439, nodeid = 3098542396, opcode = FSYNC (20), error = 0 > unique = 1440, nodeid = 3098542396, opcode = RELEASE (18) < unique = 1440, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > unique = 1441, nodeid = 3098542396, opcode = GETATTR (3) < unique = 1441, nodeid = 3098542396, opcode = GETATTR (3), error = 0 > unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1) < unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 --> here I sometimes get fuse_entry's attr.uid incorrectly set to 0 --> When this happens, LOOKUP fails and returns EACCESS. > unique = 1443, nodeid = 3098542296, opcode = UNLINK (10) < unique = 1443, nodeid = 3098542296, opcode = UNLINK (10), error = 0 > unique = 1444, nodeid = 3098542396, opcode = FORGET (2) Is it possible that metadata writes are now so asynchronous that a subsequent lookup cannot retreive the up to date value? If that is the problem, how can I fix it? There is nothing telling the FUSE implementation that a CREATE or SETATTR has just partially completed and has metadata pending. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From ian.latter at midnightcode.org Mon May 21 23:02:44 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 22 May 2012 09:02:44 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205212302.q4LN2idg017478@singularity.tronunltd.com> > > 0) Is there any developer docco? I've just done > > another quick search and I can't see any. Let > > me know if there is and I'll try and answer the > > below myself. > > Your best bet right now (if I may say so) is the stuff I've posted on > hekafs.org - the "Translator 101" articles plus the API overview at > > http://hekafs.org/dist/xlator_api_2.html You must say so - there is so little docco. Actually before I posted I went and re-read your Translator 101 docs as you referred them to me on 10 May, but I hadn't found your API overview - thanks (for both)! > > 2) Is there a way to write linearly within a single > > function within Gluster (or is there a reason > > why I wouldn't want to do that)? > > Any blocking ops would have to be built on top of async ops plus semaphores > etc. because (unlike e.g. an HTTP server) the underlying sockets etc. are > shared/multiplexed between users and activities. Thus you'd get much more > context switching that way than if you stay within the async/continuation style. Interesting - I haven't ever done semaphore coding, but it may not be needed. The syncop framework that Krish referred too seems to do this via a mutex lock (synctask_yawn) and a context switch (synctask_yield). What's the drawback with increased context switching? After my email thread with Krish I decided against syncop, but the flow without was going to be horrific. The only way I could bring it back to anything even half as sane as the afr code (which can cleverly loop through its own _cbk's recursively - I like that, whoever put that together) was to have the last cbk in a chain (say the "close_cbk") call the original function with an index or stepper increment. But after sitting on the idea for a couple of days I actually came to the same conclusion as Manu did in the last message. I.e. without docco I have been writing to what seems to work, and in my 2009 code (I saw last night) a "mkdir" wind followed by "create" code in the same function - which I believe, now, is probably a race condition (because of the threaded/async structure forced through the wind/call macro model). In that case I *do* want a synchronous write - but only within my xlator (which, if I'm reading this right, *is* what syncop does) - as opposed to an end-to-end synchronous write (being sync'd through the full stack of xlators: ignoring caching, waiting for replication to be validated, etc). Although, the same synchronous outcome comes from the chained async calls ... but then we get back to the readability/ fixability of the code. > Some day in the distant future, I'd like to work some more on a preprocessor > that turns linear code into async code so that it's easier to write but retains > the performance and resource-efficiency advantages of an essentially async > style. I did some work (http://pl.atyp.us/ripper/UserGuide.html) in this area > several years ago, but it has probably bit-rotted to hell since then. With > more recent versions of gcc and LLVM it should be possible to overcome some of > the limitations that version had. Yes, I had a very similar thought - a C pre-processor isn't in my experience or time scale though; I considered writing up a script that would chain it out in C for me. I was going to borrow from a script that I wrote which builds one of the libMidnightCode header files but even that seemed impractical .. would anyone be able to debug it? Would I even understand in 2yrs from now - lol So I think the long and the short of it is that anything I do here won't be pretty .. or perhaps: one will look pretty and the other will run pretty :) -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Mon May 21 23:59:07 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 16:59:07 -0700 Subject: [Gluster-devel] rename(2) race condition In-Reply-To: <1kkgieh.28gd5j1k9erqhM%manu@netbsd.org> References: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> <1kkgieh.28gd5j1k9erqhM%manu@netbsd.org> Message-ID: Is the FUSE SETATTR implementation in NetBSD synchronous? i.e, does the chown() or chmod() syscall issued by the application strictly block till GlusterFS's fuse_setattr_cbk() is called? Avati On Mon, May 21, 2012 at 9:27 AM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > > > 3548 1 tar CALL rename(0xbb9010e0,0x8071584) > > 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" > > 3548 1 tar RET rename -1 errno 13 Permission denied > > I tracked this down to FUSE LOOKUP operation that do not set > fuse_entry's attr.uid correctly (it is left set to 0). > > Here is the summary of my findings so far: > - as un unprivilegied user, I create and delete files like crazy > - most of the time everything is fine > - sometime a LOOKUP for a file I created (as an unprivilegied user) will > return a fuse_entry with uid set to 0, which cause the kernel to raise > EACCESS when I try to delete the file. > > Here is an example of a FUSE trace, produced by the test case > while [ 1 ] ; do cp /etc/fstab test/foo1 ; rm test/foo1 ; done > > > unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1) > < unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1), error = -2 > > unique = 1436, nodeid = 3098542296, opcode = CREATE (35) > < unique = 1436, nodeid = 3098542296, opcode = CREATE (35), error = 0 > > unique = 1437, nodeid = 3098542396, opcode = SETATTR (4) > < unique = 1437, nodeid = 3098542396, opcode = SETATTR (4), error = 0 > > unique = 1438, nodeid = 3098542396, opcode = WRITE (16) > < unique = 1438, nodeid = 3098542396, opcode = WRITE (16), error = 0 > > unique = 1439, nodeid = 3098542396, opcode = FSYNC (20) > < unique = 1439, nodeid = 3098542396, opcode = FSYNC (20), error = 0 > > unique = 1440, nodeid = 3098542396, opcode = RELEASE (18) > < unique = 1440, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > > unique = 1441, nodeid = 3098542396, opcode = GETATTR (3) > < unique = 1441, nodeid = 3098542396, opcode = GETATTR (3), error = 0 > > unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1) > < unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 > > --> here I sometimes get fuse_entry's attr.uid incorrectly set to 0 > --> When this happens, LOOKUP fails and returns EACCESS. > > > unique = 1443, nodeid = 3098542296, opcode = UNLINK (10) > < unique = 1443, nodeid = 3098542296, opcode = UNLINK (10), error = 0 > > unique = 1444, nodeid = 3098542396, opcode = FORGET (2) > > > Is it possible that metadata writes are now so asynchronous that a > subsequent lookup cannot retreive the up to date value? If that is the > problem, how can I fix it? There is nothing telling the FUSE > implementation that a CREATE or SETATTR has just partially completed and > has metadata pending. > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anand.avati at gmail.com Tue May 22 00:11:47 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 17:11:47 -0700 Subject: [Gluster-devel] A healing translator In-Reply-To: <4FA8E8AB.2040604@datalab.es> References: <4FA8E8AB.2040604@datalab.es> Message-ID: On Tue, May 8, 2012 at 2:34 AM, Xavier Hernandez wrote: > Hello developers, > > I would like to expose some ideas we are working on to create a new kind > of translator that should be able to unify and simplify to some extent the > healing procedures of complex translators. > > Currently, the only translator with complex healing capabilities that we > are aware of is AFR. We are developing another translator that will also > need healing capabilities, so we thought that it would be interesting to > create a new translator able to handle the common part of the healing > process and hence to simplify and avoid duplicated code in other > translators. > > The basic idea of the new translator is to handle healing tasks nearer the > storage translator on the server nodes instead to control everything from a > translator on the client nodes. Of course the heal translator is not able > to handle healing entirely by itself, it needs a client translator which > will coordinate all tasks. The heal translator is intended to be used by > translators that work with multiple subvolumes. > > I will try to explain how it works without entering into too much details. > > There is an important requisite for all client translators that use > healing: they must have exactly the same list of subvolumes and in the same > order. Currently, I think this is not a problem. > > The heal translator treats each file as an independent entity, and each > one can be in 3 modes: > > 1. Normal mode > > This is the normal mode for a copy or fragment of a file when it is > synchronized and consistent with the same file on other nodes (for example > with other replicas. It is the client translator who decides if it is > synchronized or not). > > 2. Healing mode > > This is the mode used when a client detects an inconsistency in the copy > or fragment of the file stored on this node and initiates the healing > procedures. > > 3. Provider mode (I don't like very much this name, though) > > This is the mode used by client translators when an inconsistency is > detected in this file, but the copy or fragment stored in this node is > considered good and it will be used as a source to repair the contents of > this file on other nodes. > > Initially, when a file is created, it is set in normal mode. Client > translators that make changes must guarantee that they send the > modification requests in the same order to all the servers. This should be > done using inodelk/entrylk. > > When a change is sent to a server, the client must include a bitmap mask > of the clients to which the request is being sent. Normally this is a > bitmap containing all the clients, however, when a server fails for some > reason some bits will be cleared. The heal translator uses this bitmap to > early detect failures on other nodes from the point of view of each client. > When this condition is detected, the request is aborted with an error and > the client is notified with the remaining list of valid nodes. If the > client considers the request can be successfully server with the remaining > list of nodes, it can resend the request with the updated bitmap. > > The heal translator also updates two file attributes for each change > request to mantain the "version" of the data and metadata contents of the > file. A similar task is currently made by AFR using xattrop. This would not > be needed anymore, speeding write requests. > > The version of data and metadata is returned to the client for each read > request, allowing it to detect inconsistent data. > > When a client detects an inconsistency, it initiates healing. First of > all, it must lock the entry and inode (when necessary). Then, from the data > collected from each node, it must decide which nodes have good data and > which ones have bad data and hence need to be healed. There are two > possible cases: > > 1. File is not a regular file > > In this case the reconstruction is very fast and requires few requests, so > it is done while the file is locked. In this case, the heal translator does > nothing relevant. > > 2. File is a regular file > > For regular files, the first step is to synchronize the metadata to the > bad nodes, including the version information. Once this is done, the file > is set in healing mode on bad nodes, and provider mode on good nodes. Then > the entry and inode are unlocked. > > When a file is in provider mode, it works as in normal mode, but refuses > to start another healing. Only one client can be healing a file. > > When a file is in healing mode, each normal write request from any client > are handled as if the file were in normal mode, updating the version > information and detecting possible inconsistencies with the bitmap. > Additionally, the healing translator marks the written region of the file > as "good". > > Each write request from the healing client intended to repair the file > must be marked with a special flag. In this case, the area that wants to be > written is filtered by the list of "good" ranges (if there are any > intersection with a good range, it is removed from the request). The > resulting set of ranges are propagated to the lower translator and added to > the list of "good" ranges but the version information is not updated. > > Read requests are only served if the range requested is entirely contained > into the "good" regions list. > > There are some additional details, but I think this is enough to have a > general idea of its purpose and how it works. > > The main advantages of this translator are: > > 1. Avoid duplicated code in client translators > 2. Simplify and unify healing methods in client translators > 3. xattrop is not needed anymore in client translators to keep track of > changes > 4. Full file contents are repaired without locking the file > 5. Better detection and prevention of some split brain situations as soon > as possible > > I think it would be very useful. It seems to me that it works correctly in > all situations, however I don't have all the experience that other > developers have with the healing functions of AFR, so I will be happy to > answer any question or suggestion to solve problems it may have or to > improve it. > > What do you think about it ? > > The goals you state above are all valid. What would really help (adoption) is if you can implement this as a modification of AFR by utilizing all the work already done, and you get brownie points if it is backward compatible with existing AFR. If you already have any code in a publishable state, please share it with us (github link?). Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Tue May 22 00:40:03 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 22 May 2012 10:40:03 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205220040.q4M0e3ah017846@singularity.tronunltd.com> Actually, while we're at this level I'd like to bolt on another thought / query - these were my words; > But after sitting on the idea for a couple of days I actually came > to the same conclusion as Manu did in the last message. I.e. > without docco I have been writing to what seems to work, and > in my 2009 code (I saw last night) a "mkdir" wind followed by "create" > code in the same function - which I believe, now, is probably a > race condition (because of the threaded/async structure forced > through the wind/call macro model). But they include an assumption. The query is: are async writes and reads sequential? The two specific cases are; 1) Are all reads that are initiated in time after a write guaranteed to occur after that write has taken affect? 2) Are all writes that are initiated in time after a write (x) guaranteed to occur after that write (x) has taken affect? I could also appreciate that there may be a difference between the top/user layer view and the xlator internals .. if there is then can you please include that view in the explanation? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Tue May 22 01:27:41 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 18:27:41 -0700 Subject: [Gluster-devel] Gluster internals In-Reply-To: <201205220040.q4M0e3ah017846@singularity.tronunltd.com> References: <201205220040.q4M0e3ah017846@singularity.tronunltd.com> Message-ID: On Mon, May 21, 2012 at 5:40 PM, Ian Latter wrote: > > But they include an assumption. > > The query is: are async writes and reads sequential? The > two specific cases are; > > 1) Are all reads that are initiated in time after a write > guaranteed to occur after that write has taken affect? > Yes > > 2) Are all writes that are initiated in time after a write (x) > guaranteed to occur after that write (x) has taken > affect? > Only overlapping offsets/regions retain causal ordering of completion. It is write-behind which acknowledges writes pre-maturely and therefore the layer which must maintain the 'effects' for further reads and writes by making the dependent IOs (overlapping offset/regions) wait for previous write's actual completion. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 22 05:33:37 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 22 May 2012 07:33:37 +0200 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: Message-ID: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Anand Avati wrote: > Is the FUSE SETATTR implementation in NetBSD synchronous? i.e, does the > chown() or chmod() syscall issued by the application strictly block till > GlusterFS's fuse_setattr_cbk() is called? I have been able to narrow the test down to the code below, which does not even call chown(). #include #include #include #include #include #include int main(void) { int fd; (void)mkdir("subdir", 0755); do { if ((fd = open("subdir/bugc1.txt", O_CREAT|O_RDWR, 0644)) == -1) err(EX_OSERR, "open failed"); if (close(fd) == -1) err(EX_OSERR, "close failed"); if (unlink("subdir/bugc1.txt") == -1) err(EX_OSERR, "unlink failed"); } while (1 /*CONSTCOND */); /* NOTREACHED */ return EX_OK; } It produces a FUSE trace without SETATTR: > unique = 393, nodeid = 3098542496, opcode = LOOKUP (1) < unique = 393, nodeid = 3098542496, opcode = LOOKUP (1), error = -2 > unique = 394, nodeid = 3098542496, opcode = CREATE (35) < unique = 394, nodeid = 3098542496, opcode = CREATE (35), error = 0 -> I suspect (not yet checked) this is the place where I get fuse_entry_out with attr.uid = 0. This will be cached since attr_valid tells us to do so. > unique = 395, nodeid = 3098542396, opcode = RELEASE (18) < unique = 395, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > unique = 396, nodeid = 3098542296, opcode = LOOKUP (1) < unique = 396, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 >From other traces, I can tell that this last lookup is for the parent directory (subdir). The FUSE request for looking up bugc1.txt with the intent of deleting is not even sent: from cached uid we obtained from fuse_entry_out, we know that permissions shall be denied (I had a debug printf to check that). We do not even ask. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From anand.avati at gmail.com Tue May 22 05:44:30 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 22:44:30 -0700 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Message-ID: On Mon, May 21, 2012 at 10:33 PM, Emmanuel Dreyfus wrote: > Anand Avati wrote: > > > Is the FUSE SETATTR implementation in NetBSD synchronous? i.e, does the > > chown() or chmod() syscall issued by the application strictly block till > > GlusterFS's fuse_setattr_cbk() is called? > > I have been able to narrow the test down to the code below, which does not > even > call chown(). > > #include > #include > #include > #include > #include > #include > > int > main(void) > { > int fd; > > (void)mkdir("subdir", 0755); > > do { > if ((fd = open("subdir/bugc1.txt", O_CREAT|O_RDWR, 0644)) > == -1) > err(EX_OSERR, "open failed"); > > if (close(fd) == -1) > err(EX_OSERR, "close failed"); > > if (unlink("subdir/bugc1.txt") == -1) > err(EX_OSERR, "unlink failed"); > } while (1 /*CONSTCOND */); > > /* NOTREACHED */ > return EX_OK; > } > > It produces a FUSE trace without SETATTR: > > > unique = 393, nodeid = 3098542496, opcode = LOOKUP (1) > < unique = 393, nodeid = 3098542496, opcode = LOOKUP (1), error = -2 > > unique = 394, nodeid = 3098542496, opcode = CREATE (35) > < unique = 394, nodeid = 3098542496, opcode = CREATE (35), error = 0 > > -> I suspect (not yet checked) this is the place where I get > fuse_entry_out > with attr.uid = 0. This will be cached since attr_valid tells us to do so. > > > unique = 395, nodeid = 3098542396, opcode = RELEASE (18) > < unique = 395, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > > unique = 396, nodeid = 3098542296, opcode = LOOKUP (1) > < unique = 396, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 > > From other traces, I can tell that this last lookup is for the parent > directory > (subdir). The FUSE request for looking up bugc1.txt with the intent of > deleting > is not even sent: from cached uid we obtained from fuse_entry_out, we know > that > permissions shall be denied (I had a debug printf to check that). We do > not even > ask. > > Even in the case where bugc1.txt got a wrong uid returned (assuming so), it should not influence the permissibility of it getting deleted. The deletability of a file is based on the permissions on the parent directory and not the ownership of the file (unless +t sticky bit was set on the directory). Is there a way you can extend the trace code above to show the UIDs getting returned? Maybe it was the parent directory (subdir) that got a wrong UID returned? Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From aavati at redhat.com Tue May 22 07:11:36 2012 From: aavati at redhat.com (Anand Avati) Date: Tue, 22 May 2012 00:11:36 -0700 Subject: [Gluster-devel] RFC on fix to bug #802414 In-Reply-To: <7a290bd4-c833-4a35-af04-adb0052f6ff2@zmail01.collab.prod.int.phx2.redhat.com> References: <7a290bd4-c833-4a35-af04-adb0052f6ff2@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FBB3C28.2020106@redhat.com> The PARENT_DOWN_HANDLED approach will take us backwards from the current state where we are resiliant to frame losses and other class of bugs (i.e, if a frame loss happens on either server or client, it only results in prevented graph cleanup but the graph switch still happens). The root "cause" here is that we are giving up on a very important and fundamental principle of immutability on the fd object. The real solution here is to never modify fd->inode. Instead we must bring about a more native fd "migration" than just re-opening an existing fd on the new graph. Think of the inode migration analogy. The handle coming from FUSE (the address of the object) is a "hint". Usually the hint is right, if the object in the address belongs to the latest graph. If not, using the GFID we resolve a new inode on the latest graph and use it. In case of FD we can do something similar, except there are not GFIDs (which should not be a problem). We need to make the handle coming from FUSE (the address of fd_t) just a hint. If the fd->inode->table->xl->graph is the latest, then the hint was a HIT. If the graph was not the latest, we look for a previous migration attempt+result in the "base" (original) fd's context. If that does not exist or is not fresh (on the latest graph) then we do a new fd creation, open on new graph, fd_unref the old cached result in the fd context of the "base fd" and keep ref to this new result. All this must happen from fuse_resolve_fd(). The setting of the latest fd and updation of the latest fd pointer happens under the scope of the base_fd->lock() which gives it a very clear and unambiguous scope which was missing with the old scheme. [The next step will be to nuke the fd->inode swapping in fuse_create_cbk] Avati On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote: > > > ----- Original Message ----- >> From: "Pranith Kumar Karampuri" >> To: "Anand Avati" >> Cc: "Vijay Bellur", "Amar Tumballi", "Krishnan Parthasarathi" >> , "Raghavendra Gowdappa" >> Sent: Tuesday, May 22, 2012 8:42:58 AM >> Subject: Re: RFC on fix to bug #802414 >> >> Dude, >> We have already put logs yesterday in LOCK and UNLOCK and saw >> that the&fd->inode->lock address changed from LOCK to UNLOCK. > > Yes, even I too believe that the hang is because of fd->inode swap in fuse_migrate_fd and not the one in fuse_create_cbk. We could clearly see in the log files following race: > fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this was a naive fix - hold lock on inode in old graph - to the race-condition caused by swapping fd->inode, which didn't work) > > poll-thr: tries to acquire fd->inode->lock (inode is old_inode present in old-graph) in afr_local_cleanup > fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock > poll-thr: gets woken up from lock call on old_inode->lock. > poll-thr: does its work, but while unlocking, uses fd->inode where inode belongs to new graph. > > we had logs printing lock address before and after acquisition of lock and we could clearly see that lock address changed after acquiring lock in afr_local_cleanup. > >> >>>> "The hang in fuse_migrate_fd is _before_ the inode swap performed >>>> there." >> All the fds are opened on the same file. So all fds in the fd >> migration point to same inode. The race is hit by nth fd, (n+1)th fd >> hangs. We have seen that afr_local_cleanup was doing fd_unref, and >> LOCK(fd->inode->lock) was done with one address then by the time >> UNLOCK(fd->inode->lock) is done the address changed. So the next fd >> that has to migrate hung because the prev inode lock is not >> unlocked. >> >> If after nth fd introduces the race a _cbk comes in epoll thread on >> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will >> hang. >> Which is my theory for the hang we observed on Saturday. >> >> Pranith. >> ----- Original Message ----- >> From: "Anand Avati" >> To: "Raghavendra Gowdappa" >> Cc: "Vijay Bellur", "Amar Tumballi" >> , "Krishnan Parthasarathi" >> , "Pranith Kumar Karampuri" >> >> Sent: Tuesday, May 22, 2012 2:09:33 AM >> Subject: Re: RFC on fix to bug #802414 >> >> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote: >>> Avati, >>> >>> fuse_migrate_fd (running in reader thread - rdthr) assigns new >>> inode to fd, once it looks up inode in new graph. But this >>> assignment can race with code that accesses fd->inode->lock >>> executing in poll-thread (pthr) as follows >>> >>> pthr: LOCK (fd->inode->lock); (inode in old graph) >>> rdthr: fd->inode = inode (resolved in new graph) >>> pthr: UNLOCK (fd->inode->lock); (inode in new graph) >>> >> >> The way I see it (the backtrace output in the other mail), the swap >> happening in fuse_create_cbk() must be the one causing lock/unlock to >> land on different inode objects. The hang in fuse_migrate_fd is >> _before_ >> the inode swap performed there. Can you put some logs in >> fuse_create_cbk()'s inode swap code and confirm this? >> >> >>> Now, any lock operations on inode in old graph will block. Thanks >>> to pranith for pointing to this race-condition. >>> >>> The problem here is we don't have a single lock that can >>> synchronize assignment "fd->inode = inode" and other locking >>> attempts on fd->inode->lock. So, we are thinking that instead of >>> trying to synchronize, eliminate the parallel accesses altogether. >>> This can be done by splitting fd migration into two tasks. >>> >>> 1. Actions on old graph (like fsync to flush writes to disk) >>> 2. Actions in new graph (lookup, open) >>> >>> We can send PARENT_DOWN when, >>> 1. Task 1 is complete. >>> 2. No fop sent by fuse is pending. >>> >>> on receiving PARENT_DOWN, protocol/client will shutdown transports. >>> As part of transport cleanup, all pending frames are unwound and >>> protocol/client will notify its parents with PARENT_DOWN_HANDLED >>> event. Each of the translator will pass this event to its parents >>> once it is convinced that there are no pending fops started by it >>> (like background self-heal, reads as part of read-ahead etc). Once >>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there >>> will be no replies that will be racing with migration (note that >>> migration is done using syncops). At this point in time, it is >>> safe to start Task 2 (which associates fd with an inode in new >>> graph). >>> >>> Also note that reader thread will not do other operations till it >>> completes both tasks. >>> >>> As far as the implementation of this patch goes, major work is in >>> translators like read-ahead, afr, dht to provide the guarantee >>> required to send PARENT_DOWN_HANDLED event to their parents. >>> >>> Please let me know your thoughts on this. >>> >> >> All the above steps might not apply if it is caused by the swap in >> fuse_create_cbk(). Let's confirm that first. >> >> Avati >> From ian.latter at midnightcode.org Tue May 22 07:18:08 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 22 May 2012 17:18:08 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205220718.q4M7I8sJ019827@singularity.tronunltd.com> > > But they include an assumption. > > > > The query is: are async writes and reads sequential? The > > two specific cases are; > > > > 1) Are all reads that are initiated in time after a write > > guaranteed to occur after that write has taken affect? > > > > Yes > Excellent. > > > > 2) Are all writes that are initiated in time after a write (x) > > guaranteed to occur after that write (x) has taken > > affect? > > > > Only overlapping offsets/regions retain causal ordering of completion. It > is write-behind which acknowledges writes pre-maturely and therefore the > layer which must maintain the 'effects' for further reads and writes by > making the dependent IOs (overlapping offset/regions) wait for previous > write's actual completion. > Ok, that should do the trick. Let me mull over this for a while .. Thanks for that info. > Avati > -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Tue May 22 07:44:25 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Tue, 22 May 2012 09:44:25 +0200 Subject: [Gluster-devel] A healing translator In-Reply-To: References: <4FA8E8AB.2040604@datalab.es> Message-ID: <4FBB43D9.9070605@datalab.es> On 05/22/2012 02:11 AM, Anand Avati wrote: > > > On Tue, May 8, 2012 at 2:34 AM, Xavier Hernandez > > wrote: > > Hello developers, > > I would like to expose some ideas we are working on to create a > new kind of translator that should be able to unify and simplify > to some extent the healing procedures of complex translators. > > Currently, the only translator with complex healing capabilities > that we are aware of is AFR. We are developing another translator > that will also need healing capabilities, so we thought that it > would be interesting to create a new translator able to handle the > common part of the healing process and hence to simplify and avoid > duplicated code in other translators. > > The basic idea of the new translator is to handle healing tasks > nearer the storage translator on the server nodes instead to > control everything from a translator on the client nodes. Of > course the heal translator is not able to handle healing entirely > by itself, it needs a client translator which will coordinate all > tasks. The heal translator is intended to be used by translators > that work with multiple subvolumes. > > I will try to explain how it works without entering into too much > details. > > There is an important requisite for all client translators that > use healing: they must have exactly the same list of subvolumes > and in the same order. Currently, I think this is not a problem. > > The heal translator treats each file as an independent entity, and > each one can be in 3 modes: > > 1. Normal mode > > This is the normal mode for a copy or fragment of a file when > it is synchronized and consistent with the same file on other > nodes (for example with other replicas. It is the client > translator who decides if it is synchronized or not). > > 2. Healing mode > > This is the mode used when a client detects an inconsistency > in the copy or fragment of the file stored on this node and > initiates the healing procedures. > > 3. Provider mode (I don't like very much this name, though) > > This is the mode used by client translators when an > inconsistency is detected in this file, but the copy or > fragment stored in this node is considered good and it will be > used as a source to repair the contents of this file on other > nodes. > > Initially, when a file is created, it is set in normal mode. > Client translators that make changes must guarantee that they send > the modification requests in the same order to all the servers. > This should be done using inodelk/entrylk. > > When a change is sent to a server, the client must include a > bitmap mask of the clients to which the request is being sent. > Normally this is a bitmap containing all the clients, however, > when a server fails for some reason some bits will be cleared. The > heal translator uses this bitmap to early detect failures on other > nodes from the point of view of each client. When this condition > is detected, the request is aborted with an error and the client > is notified with the remaining list of valid nodes. If the client > considers the request can be successfully server with the > remaining list of nodes, it can resend the request with the > updated bitmap. > > The heal translator also updates two file attributes for each > change request to mantain the "version" of the data and metadata > contents of the file. A similar task is currently made by AFR > using xattrop. This would not be needed anymore, speeding write > requests. > > The version of data and metadata is returned to the client for > each read request, allowing it to detect inconsistent data. > > When a client detects an inconsistency, it initiates healing. > First of all, it must lock the entry and inode (when necessary). > Then, from the data collected from each node, it must decide which > nodes have good data and which ones have bad data and hence need > to be healed. There are two possible cases: > > 1. File is not a regular file > > In this case the reconstruction is very fast and requires few > requests, so it is done while the file is locked. In this > case, the heal translator does nothing relevant. > > 2. File is a regular file > > For regular files, the first step is to synchronize the > metadata to the bad nodes, including the version information. > Once this is done, the file is set in healing mode on bad > nodes, and provider mode on good nodes. Then the entry and > inode are unlocked. > > When a file is in provider mode, it works as in normal mode, but > refuses to start another healing. Only one client can be healing a > file. > > When a file is in healing mode, each normal write request from any > client are handled as if the file were in normal mode, updating > the version information and detecting possible inconsistencies > with the bitmap. Additionally, the healing translator marks the > written region of the file as "good". > > Each write request from the healing client intended to repair the > file must be marked with a special flag. In this case, the area > that wants to be written is filtered by the list of "good" ranges > (if there are any intersection with a good range, it is removed > from the request). The resulting set of ranges are propagated to > the lower translator and added to the list of "good" ranges but > the version information is not updated. > > Read requests are only served if the range requested is entirely > contained into the "good" regions list. > > There are some additional details, but I think this is enough to > have a general idea of its purpose and how it works. > > The main advantages of this translator are: > > 1. Avoid duplicated code in client translators > 2. Simplify and unify healing methods in client translators > 3. xattrop is not needed anymore in client translators to keep > track of changes > 4. Full file contents are repaired without locking the file > 5. Better detection and prevention of some split brain situations > as soon as possible > > I think it would be very useful. It seems to me that it works > correctly in all situations, however I don't have all the > experience that other developers have with the healing functions > of AFR, so I will be happy to answer any question or suggestion to > solve problems it may have or to improve it. > > What do you think about it ? > > > The goals you state above are all valid. What would really help > (adoption) is if you can implement this as a modification of AFR by > utilizing all the work already done, and you get brownie points if it > is backward compatible with existing AFR. If you already have any code > in a publishable state, please share it with us (github link?). > > Avati I've tried to understand how AFR works and, in some way, some of the ideas have been taken from it. However it is very complex and a lot of changes have been carried out in the master branch over the latest months. It's hard for me to follow them while actively working on my translator. Nevertheless, the main reason to take a separate path was that AFR is strongly bound to replication (at least from what I saw when I analyzed it more deeply. Maybe things have changed now, but haven't had time to review them). The requirements for my translator didn't fit very well with AFR, and the needed effort to understand and modify it to adapt it was too high. It also seems that there isn't any detailed developer info about internals of AFR that could have helped to be more confident to modify it (at least I haven't found it). I'm currenty working on it, but it's not ready yet. As soon as it is in a minimally stable state we will publish it, probably on github. I'll write the url to this list. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From anand.avati at gmail.com Tue May 22 07:48:43 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 22 May 2012 00:48:43 -0700 Subject: [Gluster-devel] A healing translator In-Reply-To: <4FBB43D9.9070605@datalab.es> References: <4FA8E8AB.2040604@datalab.es> <4FBB43D9.9070605@datalab.es> Message-ID: > > > I've tried to understand how AFR works and, in some way, some of the > ideas have been taken from it. However it is very complex and a lot of > changes have been carried out in the master branch over the latest months. > It's hard for me to follow them while actively working on my translator. > Nevertheless, the main reason to take a separate path was that AFR is > strongly bound to replication (at least from what I saw when I analyzed it > more deeply. Maybe things have changed now, but haven't had time to review > them). > Have you reviewed the proactive self-heal daemon (+ changelog indexing translator) which is a potential functional replacement for what you might be attempting? Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 22 08:16:06 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 22 May 2012 08:16:06 +0000 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Message-ID: <20120522081606.GA3976@homeworld.netbsd.org> On Mon, May 21, 2012 at 10:44:30PM -0700, Anand Avati wrote: > Even in the case where bugc1.txt got a wrong uid returned (assuming so), it > should not influence the permissibility of it getting deleted. The > deletability of a file is based on the permissions on the parent directory > and not the ownership of the file (unless +t sticky bit was set on the > directory). This is interesting: I get the behavior you describe on Linux (ext2fs), but NetBSD (FFS) hehaves differently (these are native test, without glusterfs). Is it a grey area in standards? $ ls -la test/ total 16 drwxr-xr-x 2 root wheel 512 May 22 10:10 . drwxr-xr-x 19 manu wheel 5632 May 22 10:10 .. -rw-r--r-- 1 manu wheel 0 May 22 10:10 toto $ whoami manu $ rm -f test/toto rm: test/toto: Permission denied $ uname -sr NetBSD 5.1_STABLE -- Emmanuel Dreyfus manu at netbsd.org From rgowdapp at redhat.com Tue May 22 08:44:00 2012 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Tue, 22 May 2012 04:44:00 -0400 (EDT) Subject: [Gluster-devel] RFC on fix to bug #802414 In-Reply-To: <4FBB3C28.2020106@redhat.com> Message-ID: <96991134-54b7-4e4b-a325-b0cdafec8abb@zmail01.collab.prod.int.phx2.redhat.com> ----- Original Message ----- > From: "Anand Avati" > To: "Raghavendra Gowdappa" > Cc: "Pranith Kumar Karampuri" , "Vijay Bellur" , "Amar Tumballi" > , "Krishnan Parthasarathi" , gluster-devel at nongnu.org > Sent: Tuesday, May 22, 2012 12:41:36 PM > Subject: Re: RFC on fix to bug #802414 > > > > The PARENT_DOWN_HANDLED approach will take us backwards from the > current > state where we are resiliant to frame losses and other class of bugs > (i.e, if a frame loss happens on either server or client, it only > results in prevented graph cleanup but the graph switch still > happens). > > The root "cause" here is that we are giving up on a very important > and > fundamental principle of immutability on the fd object. The real > solution here is to never modify fd->inode. Instead we must bring > about > a more native fd "migration" than just re-opening an existing fd on > the > new graph. > > Think of the inode migration analogy. The handle coming from FUSE > (the > address of the object) is a "hint". Usually the hint is right, if the > object in the address belongs to the latest graph. If not, using the > GFID we resolve a new inode on the latest graph and use it. > > In case of FD we can do something similar, except there are not GFIDs > (which should not be a problem). We need to make the handle coming > from > FUSE (the address of fd_t) just a hint. If the > fd->inode->table->xl->graph is the latest, then the hint was a HIT. > If > the graph was not the latest, we look for a previous migration > attempt+result in the "base" (original) fd's context. If that does > not > exist or is not fresh (on the latest graph) then we do a new fd > creation, open on new graph, fd_unref the old cached result in the fd > context of the "base fd" and keep ref to this new result. All this > must > happen from fuse_resolve_fd(). The setting of the latest fd and > updation > of the latest fd pointer happens under the scope of the > base_fd->lock() > which gives it a very clear and unambiguous scope which was missing > with > the old scheme. I remember discussing this solution during initial design. But, not sure why we dropped it. So, Can I go ahead with the implementation? Is this fix required post 3.3? > > [The next step will be to nuke the fd->inode swapping in > fuse_create_cbk] > > Avati > > On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote: > > > > > > ----- Original Message ----- > >> From: "Pranith Kumar Karampuri" > >> To: "Anand Avati" > >> Cc: "Vijay Bellur", "Amar > >> Tumballi", "Krishnan Parthasarathi" > >> , "Raghavendra Gowdappa" > >> Sent: Tuesday, May 22, 2012 8:42:58 AM > >> Subject: Re: RFC on fix to bug #802414 > >> > >> Dude, > >> We have already put logs yesterday in LOCK and UNLOCK and saw > >> that the&fd->inode->lock address changed from LOCK to UNLOCK. > > > > Yes, even I too believe that the hang is because of fd->inode swap > > in fuse_migrate_fd and not the one in fuse_create_cbk. We could > > clearly see in the log files following race: > > fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this > > was a naive fix - hold lock on inode in old graph - to the > > race-condition caused by swapping fd->inode, which didn't work) > > > > poll-thr: tries to acquire fd->inode->lock (inode is old_inode > > present in old-graph) in afr_local_cleanup > > fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock > > poll-thr: gets woken up from lock call on old_inode->lock. > > poll-thr: does its work, but while unlocking, uses fd->inode where > > inode belongs to new graph. > > > > we had logs printing lock address before and after acquisition of > > lock and we could clearly see that lock address changed after > > acquiring lock in afr_local_cleanup. > > > >> > >>>> "The hang in fuse_migrate_fd is _before_ the inode swap > >>>> performed > >>>> there." > >> All the fds are opened on the same file. So all fds in the fd > >> migration point to same inode. The race is hit by nth fd, (n+1)th > >> fd > >> hangs. We have seen that afr_local_cleanup was doing fd_unref, and > >> LOCK(fd->inode->lock) was done with one address then by the time > >> UNLOCK(fd->inode->lock) is done the address changed. So the next > >> fd > >> that has to migrate hung because the prev inode lock is not > >> unlocked. > >> > >> If after nth fd introduces the race a _cbk comes in epoll thread > >> on > >> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will > >> hang. > >> Which is my theory for the hang we observed on Saturday. > >> > >> Pranith. > >> ----- Original Message ----- > >> From: "Anand Avati" > >> To: "Raghavendra Gowdappa" > >> Cc: "Vijay Bellur", "Amar Tumballi" > >> , "Krishnan Parthasarathi" > >> , "Pranith Kumar Karampuri" > >> > >> Sent: Tuesday, May 22, 2012 2:09:33 AM > >> Subject: Re: RFC on fix to bug #802414 > >> > >> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote: > >>> Avati, > >>> > >>> fuse_migrate_fd (running in reader thread - rdthr) assigns new > >>> inode to fd, once it looks up inode in new graph. But this > >>> assignment can race with code that accesses fd->inode->lock > >>> executing in poll-thread (pthr) as follows > >>> > >>> pthr: LOCK (fd->inode->lock); (inode in old graph) > >>> rdthr: fd->inode = inode (resolved in new graph) > >>> pthr: UNLOCK (fd->inode->lock); (inode in new graph) > >>> > >> > >> The way I see it (the backtrace output in the other mail), the > >> swap > >> happening in fuse_create_cbk() must be the one causing lock/unlock > >> to > >> land on different inode objects. The hang in fuse_migrate_fd is > >> _before_ > >> the inode swap performed there. Can you put some logs in > >> fuse_create_cbk()'s inode swap code and confirm this? > >> > >> > >>> Now, any lock operations on inode in old graph will block. Thanks > >>> to pranith for pointing to this race-condition. > >>> > >>> The problem here is we don't have a single lock that can > >>> synchronize assignment "fd->inode = inode" and other locking > >>> attempts on fd->inode->lock. So, we are thinking that instead of > >>> trying to synchronize, eliminate the parallel accesses > >>> altogether. > >>> This can be done by splitting fd migration into two tasks. > >>> > >>> 1. Actions on old graph (like fsync to flush writes to disk) > >>> 2. Actions in new graph (lookup, open) > >>> > >>> We can send PARENT_DOWN when, > >>> 1. Task 1 is complete. > >>> 2. No fop sent by fuse is pending. > >>> > >>> on receiving PARENT_DOWN, protocol/client will shutdown > >>> transports. > >>> As part of transport cleanup, all pending frames are unwound and > >>> protocol/client will notify its parents with PARENT_DOWN_HANDLED > >>> event. Each of the translator will pass this event to its parents > >>> once it is convinced that there are no pending fops started by it > >>> (like background self-heal, reads as part of read-ahead etc). > >>> Once > >>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there > >>> will be no replies that will be racing with migration (note that > >>> migration is done using syncops). At this point in time, it is > >>> safe to start Task 2 (which associates fd with an inode in new > >>> graph). > >>> > >>> Also note that reader thread will not do other operations till it > >>> completes both tasks. > >>> > >>> As far as the implementation of this patch goes, major work is in > >>> translators like read-ahead, afr, dht to provide the guarantee > >>> required to send PARENT_DOWN_HANDLED event to their parents. > >>> > >>> Please let me know your thoughts on this. > >>> > >> > >> All the above steps might not apply if it is caused by the swap in > >> fuse_create_cbk(). Let's confirm that first. > >> > >> Avati > >> > > From xhernandez at datalab.es Tue May 22 08:51:22 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Tue, 22 May 2012 10:51:22 +0200 Subject: [Gluster-devel] A healing translator In-Reply-To: References: <4FA8E8AB.2040604@datalab.es> <4FBB43D9.9070605@datalab.es> Message-ID: <4FBB538A.70201@datalab.es> On 05/22/2012 09:48 AM, Anand Avati wrote: > >> > I've tried to understand how AFR works and, in some way, some of > the ideas have been taken from it. However it is very complex and > a lot of changes have been carried out in the master branch over > the latest months. It's hard for me to follow them while actively > working on my translator. Nevertheless, the main reason to take a > separate path was that AFR is strongly bound to replication (at > least from what I saw when I analyzed it more deeply. Maybe things > have changed now, but haven't had time to review them). > > > Have you reviewed the proactive self-heal daemon (+ changelog indexing > translator) which is a potential functional replacement for what you > might be attempting? > > Avati I must admit that I've read something about it but I haven't had time to explore it in detail. If I understand it correctly, the self-heal daemon works as a client process but can be executed on server nodes. I suppose that multiple self-heal daemons can be running on different nodes. Then, each daemon detects invalid files (not sure exactly how) and replicates the changes from one good node to the bad nodes. The problem is that in the translator I'm working on, the information is dispersed among multiple nodes, so there isn't a single server node that contains the whole data. To repair a node, data must be read from at least two other nodes (it depends on configuration). From what I've read from AFR and the self-healing daemon, it's not straightforward to adapt them to this mechanism because they would need to know a subset of nodes with consistent data, not only one. Each daemon would have to contact all other nodes, read data from each one, determine which ones are valid, rebuild the data and send it to the bad nodes. This means that the daemon will have to be as complex as the clients. My impression (but I may be wrong) is that AFR and the self-healing daemon are closely bound to the replication schema, so it is very hard to try to use them for other purposes. The healing translator I'm writing tries to offer generic server side helpers for the healing process, but it is the client side who really manages the healing operation (though heavily simplified) and could use it to replicate data, to disperse data, or some other schema. Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 22 09:08:48 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 22 May 2012 09:08:48 +0000 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Message-ID: <20120522090848.GC3976@homeworld.netbsd.org> On Mon, May 21, 2012 at 10:44:30PM -0700, Anand Avati wrote: > Is there a way you can extend the trace code above to show the UIDs getting > returned? Maybe it was the parent directory (subdir) that got a wrong UID > returned? Further investigation shows you are right. I traced the struct fuse_entry_out returned by glusterfs on LOOKUP; "/subdir", uid = 500, gid = 500, mode = 040755, attr_valid = 1 "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 bugc1.txt is looked up many times as I loop creating/deleting it subdir is not looked up often since it is cached for 1 second. New subdir lookups will return correct uid/gid/mode. After some time, though, it will return incorrect information: "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 "/subdir", uid = 0, gid = 0, mode = 040700, attr_valid = 1 -- Emmanuel Dreyfus manu at netbsd.org From aavati at redhat.com Tue May 22 17:47:49 2012 From: aavati at redhat.com (Anand Avati) Date: Tue, 22 May 2012 10:47:49 -0700 Subject: [Gluster-devel] RFC on fix to bug #802414 In-Reply-To: <96991134-54b7-4e4b-a325-b0cdafec8abb@zmail01.collab.prod.int.phx2.redhat.com> References: <96991134-54b7-4e4b-a325-b0cdafec8abb@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FBBD145.3030303@redhat.com> On 05/22/2012 01:44 AM, Raghavendra Gowdappa wrote: > > > ----- Original Message ----- >> From: "Anand Avati" >> To: "Raghavendra Gowdappa" >> Cc: "Pranith Kumar Karampuri", "Vijay Bellur", "Amar Tumballi" >> , "Krishnan Parthasarathi", gluster-devel at nongnu.org >> Sent: Tuesday, May 22, 2012 12:41:36 PM >> Subject: Re: RFC on fix to bug #802414 >> >> >> >> The PARENT_DOWN_HANDLED approach will take us backwards from the >> current >> state where we are resiliant to frame losses and other class of bugs >> (i.e, if a frame loss happens on either server or client, it only >> results in prevented graph cleanup but the graph switch still >> happens). >> >> The root "cause" here is that we are giving up on a very important >> and >> fundamental principle of immutability on the fd object. The real >> solution here is to never modify fd->inode. Instead we must bring >> about >> a more native fd "migration" than just re-opening an existing fd on >> the >> new graph. >> >> Think of the inode migration analogy. The handle coming from FUSE >> (the >> address of the object) is a "hint". Usually the hint is right, if the >> object in the address belongs to the latest graph. If not, using the >> GFID we resolve a new inode on the latest graph and use it. >> >> In case of FD we can do something similar, except there are not GFIDs >> (which should not be a problem). We need to make the handle coming >> from >> FUSE (the address of fd_t) just a hint. If the >> fd->inode->table->xl->graph is the latest, then the hint was a HIT. >> If >> the graph was not the latest, we look for a previous migration >> attempt+result in the "base" (original) fd's context. If that does >> not >> exist or is not fresh (on the latest graph) then we do a new fd >> creation, open on new graph, fd_unref the old cached result in the fd >> context of the "base fd" and keep ref to this new result. All this >> must >> happen from fuse_resolve_fd(). The setting of the latest fd and >> updation >> of the latest fd pointer happens under the scope of the >> base_fd->lock() >> which gives it a very clear and unambiguous scope which was missing >> with >> the old scheme. > > I remember discussing this solution during initial design. But, not sure why we dropped it. So, Can I go ahead with the implementation? Is this fix required post 3.3? > The solution you are probably referring to was dropped because there we were talking about chaining FDs to the one on the "next graph" as graphs keep getting changed. The one described above is different because here there will one base fd (the original one on which open() by fuse was performed) and new graphs result in creation of an internal new fd directly referred by the base fd (and naturally unref the previous "new fd") thereby keeping things quite trim. Avati From anand.avati at gmail.com Tue May 22 20:09:52 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 22 May 2012 13:09:52 -0700 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: <20120522090848.GC3976@homeworld.netbsd.org> References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> <20120522090848.GC3976@homeworld.netbsd.org> Message-ID: On Tue, May 22, 2012 at 2:08 AM, Emmanuel Dreyfus wrote: > > Further investigation shows you are right. I traced the > struct fuse_entry_out returned by glusterfs on LOOKUP; > > "/subdir", uid = 500, gid = 500, mode = 040755, attr_valid = 1 > ... > "/subdir", uid = 0, gid = 0, mode = 040700, attr_valid = 1 > Note that even mode has changed, not just the uid/gid. It will probably help if you can put a breakpoint in this case and inspect the stack about where these attribute fields are fetched from (some cache? from posix?) Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Wed May 23 02:04:25 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 23 May 2012 04:04:25 +0200 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: Message-ID: <1kkj4ca.1knxmw01kr7wlgM%manu@netbsd.org> Anand Avati wrote: > Note that even mode has changed, not just the uid/gid. It will probably > help if you can put a breakpoint in this case and inspect the stack about > where these attribute fields are fetched from (some cache? from posix?) My tests shows that the garbage is introduced by mdc_inode_iatt_get() in mdc_lookup(). -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vijay at build.gluster.com Wed May 23 13:57:15 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Wed, 23 May 2012 06:57:15 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa43 released Message-ID: <20120523135718.0E6111008C@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa43/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz This release is made off v3.3.0qa43 From manu at netbsd.org Wed May 23 16:58:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 23 May 2012 16:58:02 +0000 Subject: [Gluster-devel] preparent and postparent? Message-ID: <20120523165802.GC17268@homeworld.netbsd.org> Hi in the protocol/server xlator, there are many occurences where callbacks have a struct iatt for preparent and postparent. What are these for? Is it a normal behavior to have different things in preparent and postparent? -- Emmanuel Dreyfus manu at netbsd.org From jdarcy at redhat.com Wed May 23 17:03:41 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Wed, 23 May 2012 13:03:41 -0400 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <20120523165802.GC17268@homeworld.netbsd.org> References: <20120523165802.GC17268@homeworld.netbsd.org> Message-ID: <20120523130341.1ee693a3@jdarcy-dt.usersys.redhat.com> On Wed, 23 May 2012 16:58:02 +0000 Emmanuel Dreyfus wrote: > in the protocol/server xlator, there are many occurences where > callbacks have a struct iatt for preparent and postparent. What are > these for? NFS needs them to support its style of caching. From manu at netbsd.org Thu May 24 01:31:18 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 24 May 2012 03:31:18 +0200 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <20120523130341.1ee693a3@jdarcy-dt.usersys.redhat.com> Message-ID: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> Jeff Darcy wrote: > > in the protocol/server xlator, there are many occurences where > > callbacks have a struct iatt for preparent and postparent. What are > > these for? > > NFS needs them to support its style of caching. Let me rephrase: what information is stored in preparent and postparent? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Thu May 24 04:29:39 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 24 May 2012 06:29:39 +0200 Subject: [Gluster-devel] gerrit Message-ID: <1kkl5w4.dyowb9lel6oM%manu@netbsd.org> Hi In gerrit, if I sign it and look at the Download field in a patchset, I see this: git fetch ssh://manu@/glusterfs refs/changes/13/3413/2 && git format-patch -1 --stdout FETCH_HEAD It would be nice if I had ssh://manu at git.gluster.com/glusterfs.git so that the line can be copy/pasted without the need to edit each time. Is it something I need to configure (where?), or is it a global setting beyond my reach (in that case, please someone fix it!) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From anand.avati at gmail.com Thu May 24 06:30:20 2012 From: anand.avati at gmail.com (Anand Avati) Date: Wed, 23 May 2012 23:30:20 -0700 Subject: [Gluster-devel] gerrit In-Reply-To: <1kkl5w4.dyowb9lel6oM%manu@netbsd.org> References: <1kkl5w4.dyowb9lel6oM%manu@netbsd.org> Message-ID: fixed! On Wed, May 23, 2012 at 9:29 PM, Emmanuel Dreyfus wrote: > Hi > > In gerrit, if I sign it and look at the Download field in a patchset, I > see this: > > git fetch ssh://manu@/glusterfs refs/changes/13/3413/2 && git > format-patch -1 --stdout FETCH_HEAD > > It would be nice if I had ssh://manu at git.gluster.com/glusterfs.git > so that the line can be copy/pasted without the need to edit each time. > Is it something I need to configure (where?), or is it a global setting > beyond my reach (in that case, please someone fix it!) > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at datalab.es Thu May 24 07:10:59 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Thu, 24 May 2012 09:10:59 +0200 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> References: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> Message-ID: <4FBDDF03.8080203@datalab.es> On 05/24/2012 03:31 AM, Emmanuel Dreyfus wrote: > Jeff Darcy wrote: > >>> in the protocol/server xlator, there are many occurences where >>> callbacks have a struct iatt for preparent and postparent. What are >>> these for? >> NFS needs them to support its style of caching. > Let me rephrase: what information is stored in preparent and postparent? preparent and postparent have the attributes (modification time, size, permissions, ...) of the parent directory of the file being modified before and after the modification is done. Xavi From jdarcy at redhat.com Thu May 24 13:05:08 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 24 May 2012 09:05:08 -0400 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <4FBDDF03.8080203@datalab.es> References: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> <4FBDDF03.8080203@datalab.es> Message-ID: <4FBE3204.7050005@redhat.com> On 05/24/2012 03:10 AM, Xavier Hernandez wrote: > preparent and postparent have the attributes (modification time, size, > permissions, ...) of the parent directory of the file being modified > before and after the modification is done. Thank you, Xavi. :) If you really want to have some fun, you can take a look at the rename callback, which has pre- and post-attributes for both the old and new parent. From johnmark at redhat.com Thu May 24 19:21:22 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 24 May 2012 15:21:22 -0400 (EDT) Subject: [Gluster-devel] glusterfs-3.3.0qa43 released In-Reply-To: <20120523135718.0E6111008C@build.gluster.com> Message-ID: <7c8ea685-d794-451e-820a-25f784e7873d@zmail01.collab.prod.int.phx2.redhat.com> A reminder: As we come down to the final days, it is vitally important that we test these last few qa releases. This one, in particular, contains fixes added to the 3.3 branch after beta 4 was release last week: http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz Please consider using the testing page when evaluating: http://www.gluster.org/community/documentation/index.php/3.3.0_Beta_4_Tests Also, if someone would like to test the object storage as well as the HDFS piece, please report here, or create another test page on the wiki. Finally, you can track all commits to the master and 3.3 branches on Twitter (@glusterdev) ...and via Atom/Rss - https://github.com/gluster/glusterfs/commits/release-3.3.atom https://github.com/gluster/glusterfs/commits/master.atom -JM ----- Original Message ----- > > http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa43/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz > > This release is made off v3.3.0qa43 > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From xhernandez at datalab.es Fri May 25 07:28:43 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Fri, 25 May 2012 09:28:43 +0200 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <4FBE3204.7050005@redhat.com> References: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> <4FBDDF03.8080203@datalab.es> <4FBE3204.7050005@redhat.com> Message-ID: <4FBF34AB.6070606@datalab.es> On 05/24/2012 03:05 PM, Jeff Darcy wrote: > On 05/24/2012 03:10 AM, Xavier Hernandez wrote: >> preparent and postparent have the attributes (modification time, size, >> permissions, ...) of the parent directory of the file being modified >> before and after the modification is done. > Thank you, Xavi. :) If you really want to have some fun, you can take a look > at the rename callback, which has pre- and post-attributes for both the old and > new parent. Yes, I've had some "fun" with them. Without them almost all callbacks would seem too short to me now... hehehe From fernando.frediani at qubenet.net Fri May 25 09:44:10 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 09:44:10 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> Hi, I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 beta4 and the new type of volume striped + replicated. My go is to use it to run Virtual Machines (.vmdk files). Volume is created fine and the ESXi server mountw the Datastore using Gluster built-in NFS, however when trying to use the Datastore or even read, it hangs. Looking at the Gluster NFS logs I see: "[socket.c:195:__socket_rwv] 0-socket.nfs-server: readv failed (Connection reset by peer)" In order to get the rpm files installed I had first to install these two because of the some libraries: "compat-readline5-5.2-17.1.el6.x86_64".rpm and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has anything to do with that. Has anyone ever used Gluster as a backend storage for ESXi ? Does it actually work ? Regards, Fernando Frediani Lead Systems Engineer Qube Managed Services Limited 260-266 Goswell Road, London, EC1V 7EB, United Kingdom sales: +44 (0) 20 7150 3800 ddi: +44 (0) 20 7150 3803 fax: +44 (0) 20 7336 8420 web: http://www.qubenet.net/ P Please consider the environment before printing this email -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.frediani at qubenet.net Fri May 25 11:36:55 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 11:36:55 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> Actually, even on another Linux machine mounting NFS has the same behaviour. I am able to mount it with "mount -t nfs ..." but when I try "ls" it hangs as well. One particular thing of the Gluster servers is that they have two networks, one for management with default gateway and another only for storage. I am only able to mount on the storage network. The hosts file has all nodes' names with the ips on the storage network. I tried to use this but didn't work either. gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* Watching the nfs logs when I try a "ls" from the remote client it shows: pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2012-05-25 11:38:09 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.3.0beta4 /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup+0xa5)[0x7f1c8e7a6ac5] /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdirp_cbk+0x536)[0x7f1c8e543346] /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdirp_cbk+0x1ca)[0x7f1c8e76269a] /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_readdirp_cbk+0x170)[0x7f1c8e9dbbe0] /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x7f1c8f818c8f] /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_handler+0x188)[0x7f1c8f818e38] /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] /usr/sbin/glusterfs(main+0x502)[0x406612] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] /usr/sbin/glusterfs[0x404399] Thanks Fernando From: Fernando Frediani (Qube) Sent: 25 May 2012 10:44 To: 'gluster-devel at nongnu.org' Subject: Can't use NFS with VMware ESXi Hi, I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 beta4 and the new type of volume striped + replicated. My go is to use it to run Virtual Machines (.vmdk files). Volume is created fine and the ESXi server mountw the Datastore using Gluster built-in NFS, however when trying to use the Datastore or even read, it hangs. Looking at the Gluster NFS logs I see: "[socket.c:195:__socket_rwv] 0-socket.nfs-server: readv failed (Connection reset by peer)" In order to get the rpm files installed I had first to install these two because of the some libraries: "compat-readline5-5.2-17.1.el6.x86_64".rpm and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has anything to do with that. Has anyone ever used Gluster as a backend storage for ESXi ? Does it actually work ? Regards, Fernando Frediani Lead Systems Engineer Qube Managed Services Limited 260-266 Goswell Road, London, EC1V 7EB, United Kingdom sales: +44 (0) 20 7150 3800 ddi: +44 (0) 20 7150 3803 fax: +44 (0) 20 7336 8420 web: http://www.qubenet.net/ P Please consider the environment before printing this email -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Fri May 25 13:35:19 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 25 May 2012 13:35:19 +0000 Subject: [Gluster-devel] mismatching ino/dev between file Message-ID: <20120525133519.GC19383@homeworld.netbsd.org> Hi Here is a bug with release-3.3. It happens on a 2 way replicated. Here is what I have in one brick: [2012-05-25 15:03:09.463446] W [posix-handle.c:487:posix_handle_hard] 0-pfs-posix: mismatching ino/dev between file /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (57943061/16) and handle /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce (57943060/16) [2012-05-25 15:03:09.463552] E [posix.c:1277:posix_symlink] 0-pfs-posix: setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 failed On the other one: [2012-05-25 15:03:09.447682] W [posix-handle.c:487:posix_handle_hard] 0-pfs-posix: mismatching ino/dev between file /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (50557989/24) and handle /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce (50557988/24) [2012-05-25 15:03:09.447774] E [posix.c:1277:posix_symlink] 0-pfs-posix: setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 failed Someone can give me a hint of what happens, and how to track it down? -- Emmanuel Dreyfus manu at netbsd.org From abperiasamy at gmail.com Fri May 25 17:09:09 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Fri, 25 May 2012 10:09:09 -0700 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with ?mount ?t nfs ?? but when I try ?ls? it hangs as > well. > > One particular thing of the Gluster servers is that they have two networks, > one for management with default gateway and another only for storage. I am > only able to mount on the storage network. > > The hosts file has all nodes? names with the ips on the storage network. > > > > I tried to use this but didn?t work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a ?ls? from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup+0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdirp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdirp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_readdirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_handler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I?ve setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 beta4 > and the new type of volume striped + replicated. My go is to use it to run > Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or even > read, it hangs. > > > > Looking at the Gluster NFS logs I see: ????[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)? > > > > In order to get the rpm files installed I had first to install these two > because of the some libraries: ?compat-readline5-5.2-17.1.el6.x86_64?.rpm > and ?openssl098e-0.9.8e-17.el6.centos.x86_64.rpm?.Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From pmatthaei at debian.org Fri May 25 18:56:37 2012 From: pmatthaei at debian.org (=?ISO-8859-1?Q?Patrick_Matth=E4i?=) Date: Fri, 25 May 2012 20:56:37 +0200 Subject: [Gluster-devel] glusterfs-3.2.7qa1 released In-Reply-To: <20120412172933.6A2A8102E6@build.gluster.com> References: <20120412172933.6A2A8102E6@build.gluster.com> Message-ID: <4FBFD5E5.1060901@debian.org> Am 12.04.2012 19:29, schrieb Vijay Bellur: > > http://bits.gluster.com/pub/gluster/glusterfs/3.2.7qa1/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.2.7qa1.tar.gz > > This release is made off v3.2.7qa1 Hey, I have tested this qa release and could not find any regression/problem. It would be realy nice to have a 3.2.7 release in the next days (max 2 weeks from now on) so that we could ship glusterfs 3.2.7 instead of 3.2.6 with our next release Debian Wheezy! -- /* Mit freundlichem Gru? / With kind regards, Patrick Matth?i GNU/Linux Debian Developer E-Mail: pmatthaei at debian.org patrick at linux-dev.org */ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: OpenPGP digital signature URL: From fernando.frediani at qubenet.net Fri May 25 19:33:37 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 19:33:37 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From fernando.frediani at qubenet.net Fri May 25 20:32:25 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 20:32:25 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> Anand, Further to that I managed to mount the Datastore and deploy machines there, but when trying to power them On I get an error as if it couldn't find a file. Has anyone seen these kind of error before ? I would say that it could be a lock problem, but it doesn't seem to. Permissions maybe ? Or the way the NFS is exported ? (root_squash, no_root_squash, etc) Here is the log: An unexpected error was received from the ESX host while powering on VM vm-21112. Failed to power on VM. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 20:34 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From manu at netbsd.org Sat May 26 05:37:51 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 07:37:51 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate Message-ID: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> here is a bug in release-3.3: ./xinstall -c -p -r -m 555 xinstall /pfs/manu/netbsd/usr/src/tooldir.NetBSD-6.99.4-i386/bin/i386--netbsdelf-instal xinstall: /pfs/manu/netbsd/usr/src/tooldir.NetBSD-6.99.4-i386/bin/inst.00033a: chmod: Permission denied Kernel trace, client side: 33 1 xinstall CALL open(0xbfbfd8e0,0xa02,0x180) 33 1 xinstall NAMI "/pfs/manu/netbsd/usr/src/tooldir.NetBSD-6.99.4-i38 6/bin/inst.00033a" 33 1 xinstall RET open 3 33 1 xinstall CALL open(0x (...) 33 1 xinstall CALL fchmod(3,0x16d) 33 1 xinstall RET fchmod -1 errno 13 Permission denied I tracked this down to posix_acl_truncate() on the server, where loc->inode and loc->pah are NULL. This code goes red and raise EACCESS: if (acl_permits (frame, loc->inode, POSIX_ACL_WRITE)) goto green; else goto red; Here is the relevant baccktrace: #9 0xb99d1ca6 in posix_acl_truncate (frame=0xbb77f780, this=0xb9d20000, loc=0xb9d41020, off=48933, xdata=0x0) at posix-acl.c:898 #10 0xb99ba4f8 in truncate_stat_cbk (frame=0xbb77f400, cookie=0xbb77f6a0, this=0xb9d22000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at posix.c:204 #11 0xbbb87933 in default_stat_cbk (frame=0xbb77f6a0, cookie=0xbb77f710, this=0xb9d20000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at defaults.c:47 #12 0xb99e1751 in posix_stat (frame=0xbb77f710, this=0xb9d1f000, loc=0xba60091c, xdata=0x0) at posix.c:231 In frame 12, loc->inode is not NULL, and loc->path makes sense: "/netbsd/usr/src/tooldir.NetBSD-6.9 9.4-i386/bin/inst.01911a" In frame 10, loc->path and loc->inode are NULL. In note that xlators/features/locks/src/posix.c:pl_ftruncate() sets truncate_stat_cbk() as the callback, and not ftruncate_stat_cbk(). That later function does not even exist. f-style functions not calling f-style callbacks have been the root of various bugs so far, is it one more of them? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vbellur at redhat.com Sat May 26 07:44:52 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Sat, 26 May 2012 13:14:52 +0530 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> References: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> Message-ID: <4FC089F4.3070004@redhat.com> On 05/26/2012 11:07 AM, Emmanuel Dreyfus wrote: > here is a bug in release-3.3: > > > I tracked this down to posix_acl_truncate() on the server, where loc->inode > and loc->pah are NULL. This code goes red and raise EACCESS: > > if (acl_permits (frame, loc->inode, POSIX_ACL_WRITE)) > goto green; > else > goto red; > > Here is the relevant baccktrace: > > #9 0xb99d1ca6 in posix_acl_truncate (frame=0xbb77f780, this=0xb9d20000, > loc=0xb9d41020, off=48933, xdata=0x0) at posix-acl.c:898 > #10 0xb99ba4f8 in truncate_stat_cbk (frame=0xbb77f400, cookie=0xbb77f6a0, > this=0xb9d22000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) > at posix.c:204 > #11 0xbbb87933 in default_stat_cbk (frame=0xbb77f6a0, cookie=0xbb77f710, > this=0xb9d20000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) > at defaults.c:47 > #12 0xb99e1751 in posix_stat (frame=0xbb77f710, this=0xb9d1f000, > loc=0xba60091c, xdata=0x0) at posix.c:231 > > In frame 12, loc->inode is not NULL, and loc->path makes sense: > "/netbsd/usr/src/tooldir.NetBSD-6.9 > 9.4-i386/bin/inst.01911a" > > In frame 10, loc->path and loc->inode are NULL. > > In note that xlators/features/locks/src/posix.c:pl_ftruncate() sets > truncate_stat_cbk() as the callback, and not ftruncate_stat_cbk(). That later > function does not even exist. f-style functions not calling f-style callbacks > have been the root of various bugs so far, is it one more of them? I don't think it is a f-style problem. I do not get a EPERM with the testcase that you posted for qa39. Can you please provide a bigger bt? Thanks, Vijay > > From manu at netbsd.org Sat May 26 09:00:22 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 11:00:22 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC089F4.3070004@redhat.com> Message-ID: <1kkp7w9.1a5c4mz1tiqw8rM%manu@netbsd.org> Vijay Bellur wrote: > I don't think it is a f-style problem. I do not get a EPERM with the > testcase that you posted for qa39. Can you please provide a bigger bt? #3 0xb99414c4 in server_truncate_cbk (frame=0xba901714, cookie=0xbb77f010, this=0xb9d27000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at server3_1-fops.c:1218 #4 0xb9968bd6 in io_stats_truncate_cbk (frame=0xbb77f010, cookie=0xbb77f080, this=0xb9d26000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at io-stats.c:1600 #5 0xb998036e in marker_truncate_cbk (frame=0xbb77f080, cookie=0xbb77f0f0, this=0xb9d25000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at marker.c:1535 #6 0xbbb87a85 in default_truncate_cbk (frame=0xbb77f0f0, cookie=0xbb77f160, this=0xb9d24000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at defaults.c:58 #7 0xb99a8fa2 in iot_truncate_cbk (frame=0xbb77f160, cookie=0xbb77f400, this=0xb9d23000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at io-threads.c:1270 #8 0xb99b9fe0 in pl_truncate_cbk (frame=0xbb77f400, cookie=0xbb77f780, this=0xb9d22000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at posix.c:119 #9 0xb99d1ca6 in posix_acl_truncate (frame=0xbb77f780, this=0xb9d20000, loc=0xb9d41020, off=48933, xdata=0x0) at posix-acl.c:898 #10 0xb99ba4f8 in truncate_stat_cbk (frame=0xbb77f400, cookie=0xbb77f6a0, this=0xb9d22000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at posix.c:204 #11 0xbbb87933 in default_stat_cbk (frame=0xbb77f6a0, cookie=0xbb77f710, this=0xb9d20000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at defaults.c:47 #12 0xb99e1751 in posix_stat (frame=0xbb77f710, this=0xb9d1f000, loc=0xba60091c, xdata=0x0) at posix.c:231 #13 0xbbb94d76 in default_stat (frame=0xbb77f6a0, this=0xb9d20000, loc=0xba60091c, xdata=0x0) at defaults.c:1231 #14 0xb99babb0 in pl_truncate (frame=0xbb77f400, this=0xb9d22000, loc=0xba60091c, offset=48933, xdata=0x0) at posix.c:249 #15 0xb99a91ac in iot_truncate_wrapper (frame=0xbb77f160, this=0xb9d23000, loc=0xba60091c, offset=48933, xdata=0x0) at io-threads.c:1280 #16 0xbbba76d8 in call_resume_wind (stub=0xba6008fc) at call-stub.c:2474 #17 0xbbbae729 in call_resume (stub=0xba6008fc) at call-stub.c:4151 #18 0xb99a22a3 in iot_worker (data=0xb9d12110) at io-threads.c:131 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 26 11:51:46 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 13:51:46 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC089F4.3070004@redhat.com> Message-ID: <1kkpd53.bn09pz1v8qmwtM%manu@netbsd.org> Vijay Bellur wrote: > I don't think it is a f-style problem. I do not get a EPERM with the > testcase that you posted for qa39. I wonder if the bug can occur because some mess in the .glusterfs directory cause by an earlier problem. Is it possible? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 26 12:55:08 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 14:55:08 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkpd53.bn09pz1v8qmwtM%manu@netbsd.org> Message-ID: <1kkpirc.geu5yvq0165fM%manu@netbsd.org> Emmanuel Dreyfus wrote: > I wonder if the bug can occur because some mess in the .glusterfs > directory cause by an earlier problem. Is it possible? That is not the problem: I nuked .glusterfs on all bricks and the problem remain. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 26 14:20:10 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 16:20:10 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC089F4.3070004@redhat.com> Message-ID: <1kkpmmr.rrgubdjz6w9fM%manu@netbsd.org> Vijay Bellur wrote: > I don't think it is a f-style problem. I do not get a EPERM with the > testcase that you posted for qa39. Can you please provide a bigger bt? Here is a minimal test case that reproduces the problem at mine. Run it as un unprivilegied user in a directory you on which you have write access: $ pwd /pfs/manu/xinstall $ ls -ld . drwxr-xr-x 4 manu manu 512 May 26 16:17 . $ id uid=500(manu) gid=500(manu) groups=500(manu),0(wheel) $ ./test test: fchmod failed: Permission denied #include #include #include #include #include #include #include #define TESTFILE "testfile" int main(void) { int fd; char buf[16384]; if ((unlink(TESTFILE) == -1) && (errno != ENOENT)) err(EX_OSERR, "unlink failed"); if ((fd = open(TESTFILE, O_CREAT|O_EXCL|O_RDWR, 0600)) == -1) err(EX_OSERR, "open failed"); if (write(fd, buf, sizeof(buf)) != sizeof(buf)) err(EX_OSERR, "write failed"); if (fchmod(fd, 0555) == -1) err(EX_OSERR, "fchmod failed"); if (close(fd) == -1) err(EX_OSERR, "close failed"); return EX_OK; } -- Emmanuel Dreyfus http://hcpnet.free.fr/pubzx@ manu at netbsd.org From manu at netbsd.org Sun May 27 05:17:51 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 27 May 2012 07:17:51 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> Message-ID: <1kkqh7z.uvmz7na7peuaM%manu@netbsd.org> Emmanuel Dreyfus wrote: > In frame 10, loc->path and loc->inode are NULL. Here is the investigation so far: xlators/features/locks/src/posix.c:truncate_stat_cbk() has a NULL loc->inode, and this leads to the acl check that fails. As I understand this is a FUSE implentation problem. fchmod() produces a FUSE SETATTR. If the file is being written, NetBSD FUSE will set mode, size, atime, mtime, and fh in this operation. I suspect Linux FUSE only sets mode and fh and this is why the bug does not appear on Linux: the truncate code path is probably not involved. Can someone confirm? If this is the case, it suggests the code path may have never been tested. I suspect there are bugs there, for instance, in pl_truncate_cbk, local is erased after being retreived, which does not look right: local = frame->local; local = mem_get0 (this->local_pool); if (local->op == TRUNCATE) loc_wipe (&local->loc); I tried fixing that one without much improvments. There may be other problems. About fchmod() setting size: is it a reasonable behavior? FUSE does not specify what must happens, so if glusterfs rely on the Linux kernel not doing it may be begging for future bugs if that behavior change. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vbellur at redhat.com Sun May 27 06:54:43 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Sun, 27 May 2012 12:24:43 +0530 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkqh7z.uvmz7na7peuaM%manu@netbsd.org> References: <1kkqh7z.uvmz7na7peuaM%manu@netbsd.org> Message-ID: <4FC1CFB3.7050808@redhat.com> On 05/27/2012 10:47 AM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > >> In frame 10, loc->path and loc->inode are NULL. > > > As I understand this is a FUSE implentation problem. fchmod() produces a > FUSE SETATTR. If the file is being written, NetBSD FUSE will set mode, > size, atime, mtime, and fh in this operation. I suspect Linux FUSE only > sets mode and fh and this is why the bug does not appear on Linux: the > truncate code path is probably not involved. For the testcase that you sent out, I see fsi->valid being set to 1 which indicates only mode on Linux. The truncate path does not get involved. I modified the testcase to send ftruncate/truncate and it completed successfully. > > > Can someone confirm? If this is the case, it suggests the code path may > have never been tested. I suspect there are bugs there, for instance, in > pl_truncate_cbk, local is erased after being retreived, which does not > look right: > > local = frame->local; > > local = mem_get0 (this->local_pool); I don't see this in pl_truncate_cbk(). mem_get0 is done only in pl_truncate(). A code inspection in pl_(f)truncate did not raise any suspicions to me. > > > About fchmod() setting size: is it a reasonable behavior? FUSE does not > specify what must happens, so if glusterfs rely on the Linux kernel not > doing it may be begging for future bugs if that behavior change. > I am not sure why fchmod() should set size. Csaba, any thoughts on this? Vijay From manu at netbsd.org Sun May 27 07:34:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 27 May 2012 09:34:02 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC1CFB3.7050808@redhat.com> Message-ID: <1kkqxw0.1smapd1jsih9iM%manu@netbsd.org> Vijay Bellur wrote: > For the testcase that you sent out, I see fsi->valid being set to 1 > which indicates only mode on Linux. The truncate path does not get > involved. I modified the testcase to send ftruncate/truncate and it > completed successfully. I modified by FUSE implementation to send FATTR_SIZE|FATTR_FH in one request, and FATTR_MODE|FATTR_FH|FATTR_MTIME|FATTR_ATIME in a separate one, and the test passes fine. On your test not raising the bug: Is it possible that Linux already sent a FATTR_SIZE|FATTR_FH when fchmod() is invoked, and that glusterfs discards a FATTR_SIZE that does not really resize? Did you try with supplying a bigger size? > > local = mem_get0 (this->local_pool); > I don't see this in pl_truncate_cbk(). mem_get0 is done only in > pl_truncate(). A code inspection in pl_(f)truncate did not raise any > suspicions to me. Right, this was an unfortunate copy/paste. However reverting to correct code does not fix the bug when FUSE sends FATTR_SIZE is set with FATTR_MODE at the same time. > I am not sure why fchmod() should set size. Csaba, any thoughts on this? This is an optimization. You have an open file, you just grew it and you change mode. The NetBSD kernel and its FUSE implementation do the two operations in a single FUSE request, because they are smart :-) I will commit the fix in NetBSD FUSE. But one day the Linux kernel could decide to use the same shortcut too. It may be wise to fix glusterfs so that it does not assume FATTR_SIZE is not sent with other metadata changes. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From anand.avati at gmail.com Sun May 27 21:40:35 2012 From: anand.avati at gmail.com (Anand Avati) Date: Sun, 27 May 2012 14:40:35 -0700 Subject: [Gluster-devel] mismatching ino/dev between file In-Reply-To: <20120525133519.GC19383@homeworld.netbsd.org> References: <20120525133519.GC19383@homeworld.netbsd.org> Message-ID: Can you give some more steps how you reproduced this? This has never happened in any of our testing. This might probably related to the dirname() differences in BSD? Have you noticed this after the GNU dirname usage? Avati On Fri, May 25, 2012 at 6:35 AM, Emmanuel Dreyfus wrote: > Hi > > Here is a bug with release-3.3. It happens on a 2 way replicated. Here is > what I have in one brick: > > [2012-05-25 15:03:09.463446] W [posix-handle.c:487:posix_handle_hard] > 0-pfs-posix: mismatching ino/dev between file > /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (57943061/16) > and handle > /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce > (57943060/16) > [2012-05-25 15:03:09.463552] E [posix.c:1277:posix_symlink] 0-pfs-posix: > setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 > failed > > > On the other one: > > [2012-05-25 15:03:09.447682] W [posix-handle.c:487:posix_handle_hard] > 0-pfs-posix: mismatching ino/dev between file > /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (50557989/24) > and handle > /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce > (50557988/24) > [2012-05-25 15:03:09.447774] E [posix.c:1277:posix_symlink] 0-pfs-posix: > setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 > failed > > Someone can give me a hint of what happens, and how to track it down? > -- > Emmanuel Dreyfus > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Mon May 28 01:52:41 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 03:52:41 +0200 Subject: [Gluster-devel] mismatching ino/dev between file In-Reply-To: Message-ID: <1kkscxr.1k0ou1xcxcd7rM%manu@netbsd.org> Anand Avati wrote: > Can you give some more steps how you reproduced this? This has never > happened in any of our testing. This might probably related to the > dirname() differences in BSD? Have you noticed this after the GNU dirname > usage? I will investigate further. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 28 02:08:19 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 04:08:19 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkqxw0.1smapd1jsih9iM%manu@netbsd.org> Message-ID: <1kkscze.1y0ip7wj3y9uoM%manu@netbsd.org> Emmanuel Dreyfus wrote: > I modified by FUSE implementation to send FATTR_SIZE|FATTR_FH in one > request, and FATTR_MODE|FATTR_FH|FATTR_MTIME|FATTR_ATIME in a separate > one, and the test passes fine. Um, I spoke too fast. Please disreagard the previous post. The problem was not setting size, and mode in the same request. That works fine. The bug appear when setting size, atime and mtime. It also appear when setting mode, atime and mtime. So here is the summary so far: ATTR_SIZE|FATTR_FH -> ok ATTR_SIZE|FATTR_FH|FATTR_ATIME|FATTR_MTIME -> breaks (*) ATTR_MODE|FATTR_FH -> ok ATTR_MODE|FATTR_FH|FATTR_ATIME|FATTR_MTIME -> breaks ATTR_MODE|FATTR_SIZE|FATTR_FH -> ok (I was wrong here) (*) I noticed that one long time ago, and NetBSD FUSE already strips atime and mtime if ATTR_SIZE is set without ATTR_MODE|ATTR_UID|ATTR_GID. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 28 05:07:46 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 07:07:46 +0200 Subject: [Gluster-devel] Testing server down in replicated volume Message-ID: <1kkslvx.fbj1ua1gom7oyM%manu@netbsd.org> Hi everybody After the last fix in NetBSD FUSE (cf NULL loc in posix_acl_truncate), glusterfs release-3.3 now behaves quite nicely on NetBSD. I have been able to build stuff in a replicated glusterfs volume for a few hours, and it seems much faster than 3.2.6. However things turn badly when I tried to kill glusterfsd on a server. Since the volume is replicated, I would have expected the build to carry on unaffected. but this is now what happens: a ENOTCONN is raised up to the processes using the glusterfs volume: In file included from /pfs/manu/netbsd/usr/src/sys/sys/signal.h:114, from /pfs/manu/netbsd/usr/src/sys/sys/param.h:150, from /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/net/__cmsg_align bytes.c:40: /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string /machine/signal.h: Socket is not connected Is it the intended behavior? Here is the client log: [2012-05-28 05:48:27.440017] W [socket.c:195:__socket_rwv] 0-pfs-client-1: writev failed (Broken pipe) [2012-05-28 05:48:27.440989] W [socket.c:195:__socket_rwv] 0-pfs-client-1: readv failed (Connection reset by peer) [2012-05-28 05:48:27.441496] W [socket.c:1512:__socket_proto_state_machine] 0-pfs-client-1: reading from socket failed. Error (Connection reset by peer), peer (193.54.82.98:24011) [2012-05-28 05:48:27.441825] E [rpc-clnt.c:373:saved_frames_unwind] 0-pfs-client-1: forced unwinding frame type(GlusterFS 3.1) op(INODELK(29)) called at 2012-05-28 05:48:27.439249 (xid=0x1715867x) [2012-05-28 05:48:27.442222] W [client3_1-fops.c:1495:client3_1_inodelk_cbk] 0-pfs-client-1: remote operation failed: Socket is not connected [2012-05-28 05:48:27.442528] E [rpc-clnt.c:373:saved_frames_unwind] 0-pfs-client-1: forced unwinding frame type(GlusterFS 3.1) op(SETATTR(38)) called at 2012-05-28 05:48:27.440397 (xid=0x1715868x) [2012-05-28 05:48:27.442971] W [client3_1-fops.c:1954:client3_1_setattr_cbk] 0-pfs-client-1: remote operation failed: Socket is not connected (and so on with other saved_frames_unwind) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 28 05:08:36 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 07:08:36 +0200 Subject: [Gluster-devel] mismatching ino/dev between file In-Reply-To: <1kkscxr.1k0ou1xcxcd7rM%manu@netbsd.org> Message-ID: <1kksmhc.zfnn6i6bllp8M%manu@netbsd.org> Emmanuel Dreyfus wrote: > > Can you give some more steps how you reproduced this? This has never > > happened in any of our testing. This might probably related to the > > dirname() differences in BSD? Have you noticed this after the GNU dirname > > usage? > I will investigate further. It does not happen anymore. I think it was a consequence of the other bug I fixed. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 29 07:55:09 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 29 May 2012 07:55:09 +0000 Subject: [Gluster-devel] Testing server down in replicated volume In-Reply-To: <1kkslvx.fbj1ua1gom7oyM%manu@netbsd.org> References: <1kkslvx.fbj1ua1gom7oyM%manu@netbsd.org> Message-ID: <20120529075509.GE19383@homeworld.netbsd.org> On Mon, May 28, 2012 at 07:07:46AM +0200, Emmanuel Dreyfus wrote: [One server down in a replicated volume] > /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: > /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string > /machine/signal.h: Socket is not connected > > Is it the intended behavior? No reply? I would like to know if I have a NetBSD-specific bug to fix or if it is standard glusterfs behavior. -- Emmanuel Dreyfus manu at netbsd.org From pkarampu at redhat.com Tue May 29 09:09:04 2012 From: pkarampu at redhat.com (Pranith Kumar Karampuri) Date: Tue, 29 May 2012 05:09:04 -0400 (EDT) Subject: [Gluster-devel] Testing server down in replicated volume In-Reply-To: <20120529075509.GE19383@homeworld.netbsd.org> Message-ID: <97e7abfe-e431-47b8-bb26-cf70adbef253@zmail01.collab.prod.int.phx2.redhat.com> I am looking into this. Will reply soon. Pranith ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Emmanuel Dreyfus" Cc: gluster-devel at nongnu.org Sent: Tuesday, May 29, 2012 1:25:09 PM Subject: Re: [Gluster-devel] Testing server down in replicated volume On Mon, May 28, 2012 at 07:07:46AM +0200, Emmanuel Dreyfus wrote: [One server down in a replicated volume] > /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: > /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string > /machine/signal.h: Socket is not connected > > Is it the intended behavior? No reply? I would like to know if I have a NetBSD-specific bug to fix or if it is standard glusterfs behavior. -- Emmanuel Dreyfus manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From vijay at build.gluster.com Tue May 29 13:44:11 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Tue, 29 May 2012 06:44:11 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa44 released Message-ID: <20120529134412.E8A3C100CB@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa44/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa44.tar.gz This release is made off v3.3.0qa44 From pkarampu at redhat.com Tue May 29 17:28:32 2012 From: pkarampu at redhat.com (Pranith Kumar Karampuri) Date: Tue, 29 May 2012 13:28:32 -0400 (EDT) Subject: [Gluster-devel] Testing server down in replicated volume In-Reply-To: <20120529075509.GE19383@homeworld.netbsd.org> Message-ID: <4fb4ce32-9683-44cd-a7bd-aa935c79db29@zmail01.collab.prod.int.phx2.redhat.com> hi Emmanuel, I tried this for half an hour, everytime it failed because of readdir. It did not fail in any other fop. I saw that FINODELKs which relate to transactions in afr failed, but the fop succeeded on the other brick. I am not sure why a setattr (metadata transaction) is failing in your setup when a node is down. I will instrument the code to simulate the inodelk failure in setattr. Will update you tomorrow. Fop failing in readdir is also an issue that needs to be addressed. Pranith. ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Emmanuel Dreyfus" Cc: gluster-devel at nongnu.org Sent: Tuesday, May 29, 2012 1:25:09 PM Subject: Re: [Gluster-devel] Testing server down in replicated volume On Mon, May 28, 2012 at 07:07:46AM +0200, Emmanuel Dreyfus wrote: [One server down in a replicated volume] > /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: > /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string > /machine/signal.h: Socket is not connected > > Is it the intended behavior? No reply? I would like to know if I have a NetBSD-specific bug to fix or if it is standard glusterfs behavior. -- Emmanuel Dreyfus manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From bfoster at redhat.com Wed May 30 15:16:16 2012 From: bfoster at redhat.com (Brian Foster) Date: Wed, 30 May 2012 11:16:16 -0400 Subject: [Gluster-devel] glusterfs client and page cache Message-ID: <4FC639C0.6020503@redhat.com> Hi all, I've been playing with a little hack recently to add a gluster mount option to support FOPEN_KEEP_CACHE and I wanted to solicit some thoughts on whether there's value to find an intelligent way to support this functionality. To provide some context: Our current behavior with regard to fuse is that page cache is utilized by fuse, from what I can tell, just about in the same manner as a typical local fs. The primary difference is that by default, the address space mapping for an inode is completely invalidated on open. So for example, if process A opens and reads a file in a loop, subsequent reads are served from cache (bypassing fuse and gluster). If process B steps in and opens the same file, the cache is flushed and the next reads from either process are passed down through fuse. The FOPEN_KEEP_CACHE option simply disables this cache flash on open behavior. The following are some notes on my experimentation thus far: - With FOPEN_KEEP_CACHE, fuse currently only invalidates on file size changes. This is a problem in that I can rewrite some or all of a file from another client and the cached client wouldn't notice. I've sent a patch to fuse-devel to also invalidate on mtime changes (similar to nfsv3 or cifs), so we'll see how well that is received. fuse also supports a range based invalidation notification that we could take advantage of if necessary. - I reproduce a measurable performance benefit in the local/cached read situation. For example, running a kernel compile against a source tree in a gluster volume (no other xlators and build output to local storage) improves to 6 minutes from just under 8 minutes with the default graph (9.5 minutes with only the client xlator and 1:09 locally). - Some of the specific differences from current io-cache caching: - io-cache supports time based invalidation and tunables such as cache size and priority. The page cache has no such controls. - io-cache invalidates more frequently on various fops. It also looks like we invalidate on writes and don't take advantage of the write data most recently sent, whereas page cache writes are cached (errors notwithstanding). - Page cache obviously has tighter integration with the system (i.e., drop_caches controls, more specific reporting, ability to drop cache when memory is needed). All in all, I'm curious what people think about enabling the cache behavior in gluster. We could support anything from the basic mount option I'm currently using (i.e., similar to attribute/dentry caching) to something integrated with io-cache (doing invalidations when necessary), or maybe even something eventually along the lines of the nfs weak cache consistency model where it validates the cache after every fop based on file attributes. In general, are there other big issues/questions that would need to be explored before this is useful (i.e., the size invalidation issue)? Are there other performance tests that should be explored? Thoughts appreciated. Thanks. Brian From fernando.frediani at qubenet.net Wed May 30 16:19:36 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Wed, 30 May 2012 16:19:36 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F8BF93@QUBE-TR2-EXC01.qube.qubenet.net> Does anyone have an idea of this problem of not being able to power up the virtual machines on that NFS mount ? Also what do those logs mean that Anand say that there is a problem with the Repstr model. Is it something isn't finished yet ? Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 21:32 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Anand, Further to that I managed to mount the Datastore and deploy machines there, but when trying to power them On I get an error as if it couldn't find a file. Has anyone seen these kind of error before ? I would say that it could be a lock problem, but it doesn't seem to. Permissions maybe ? Or the way the NFS is exported ? (root_squash, no_root_squash, etc) Here is the log: An unexpected error was received from the ESX host while powering on VM vm-21112. Failed to power on VM. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 20:34 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From anand.avati at gmail.com Wed May 30 19:32:50 2012 From: anand.avati at gmail.com (Anand Avati) Date: Wed, 30 May 2012 12:32:50 -0700 Subject: [Gluster-devel] glusterfs client and page cache In-Reply-To: <4FC639C0.6020503@redhat.com> References: <4FC639C0.6020503@redhat.com> Message-ID: Brian, You are right, today we hardly leverage the page cache in the kernel. When Gluster started and performance translators were implemented, the fuse invalidation support did not exist, and since that support was brought in upstream fuse we haven't leveraged that effectively. We can actually do a lot more smart things using the invalidation changes. For the consistency concerns where an open fd continues to refer to local page cache - if that is a problem, today you need to mount with --enable-direct-io-mode to bypass the page cache altogether (this is very different from O_DIRECT open() support). On the other hand, to utilize the fuse invalidation APIs and promote using the page cache and still be consistent, we need to gear up glusterfs framework by first implementing server originated messaging support, then build some kind of opportunistic locking or leases to notify glusterfs clients about modifications from a second client, and third implement hooks in the client side listener to do things like sending fuse invalidations or purge pages in io-cache or flush pending writes in write-behind etc. This needs to happen, but we're short on resources to prioritize this sooner :-) Avati On Wed, May 30, 2012 at 8:16 AM, Brian Foster wrote: > Hi all, > > I've been playing with a little hack recently to add a gluster mount > option to support FOPEN_KEEP_CACHE and I wanted to solicit some thoughts > on whether there's value to find an intelligent way to support this > functionality. To provide some context: > > Our current behavior with regard to fuse is that page cache is utilized > by fuse, from what I can tell, just about in the same manner as a > typical local fs. The primary difference is that by default, the address > space mapping for an inode is completely invalidated on open. So for > example, if process A opens and reads a file in a loop, subsequent reads > are served from cache (bypassing fuse and gluster). If process B steps > in and opens the same file, the cache is flushed and the next reads from > either process are passed down through fuse. The FOPEN_KEEP_CACHE option > simply disables this cache flash on open behavior. > > The following are some notes on my experimentation thus far: > > - With FOPEN_KEEP_CACHE, fuse currently only invalidates on file size > changes. This is a problem in that I can rewrite some or all of a file > from another client and the cached client wouldn't notice. I've sent a > patch to fuse-devel to also invalidate on mtime changes (similar to > nfsv3 or cifs), so we'll see how well that is received. fuse also > supports a range based invalidation notification that we could take > advantage of if necessary. > > - I reproduce a measurable performance benefit in the local/cached read > situation. For example, running a kernel compile against a source tree > in a gluster volume (no other xlators and build output to local storage) > improves to 6 minutes from just under 8 minutes with the default graph > (9.5 minutes with only the client xlator and 1:09 locally). > > - Some of the specific differences from current io-cache caching: > - io-cache supports time based invalidation and tunables such as > cache > size and priority. The page cache has no such controls. > - io-cache invalidates more frequently on various fops. It also > looks > like we invalidate on writes and don't take advantage of the write data > most recently sent, whereas page cache writes are cached (errors > notwithstanding). > - Page cache obviously has tighter integration with the system > (i.e., > drop_caches controls, more specific reporting, ability to drop cache > when memory is needed). > > All in all, I'm curious what people think about enabling the cache > behavior in gluster. We could support anything from the basic mount > option I'm currently using (i.e., similar to attribute/dentry caching) > to something integrated with io-cache (doing invalidations when > necessary), or maybe even something eventually along the lines of the > nfs weak cache consistency model where it validates the cache after > every fop based on file attributes. > > In general, are there other big issues/questions that would need to be > explored before this is useful (i.e., the size invalidation issue)? Are > there other performance tests that should be explored? Thoughts > appreciated. Thanks. > > Brian > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bfoster at redhat.com Wed May 30 23:10:58 2012 From: bfoster at redhat.com (Brian Foster) Date: Wed, 30 May 2012 19:10:58 -0400 Subject: [Gluster-devel] glusterfs client and page cache In-Reply-To: References: <4FC639C0.6020503@redhat.com> Message-ID: <4FC6A902.9010406@redhat.com> On 05/30/2012 03:32 PM, Anand Avati wrote: > Brian, > You are right, today we hardly leverage the page cache in the kernel. > When Gluster started and performance translators were implemented, the > fuse invalidation support did not exist, and since that support was > brought in upstream fuse we haven't leveraged that effectively. We can > actually do a lot more smart things using the invalidation changes. > > For the consistency concerns where an open fd continues to refer to > local page cache - if that is a problem, today you need to mount with > --enable-direct-io-mode to bypass the page cache altogether (this is > very different from O_DIRECT open() support). On the other hand, to > utilize the fuse invalidation APIs and promote using the page cache and > still be consistent, we need to gear up glusterfs framework by first > implementing server originated messaging support, then build some kind > of opportunistic locking or leases to notify glusterfs clients about > modifications from a second client, and third implement hooks in the > client side listener to do things like sending fuse invalidations or > purge pages in io-cache or flush pending writes in write-behind etc. > This needs to happen, but we're short on resources to prioritize this > sooner :-) > Thanks for the context Avati. The fuse patch I sent lead to a similar thought process with regard to finer grained invalidation. So far it seems well received, and as I understand it, we can also utilize that mechanism to do full invalidations from gluster on older fuse modules that wouldn't have that fix. I'll look into incorporating that into what I have so far and making it available for review. Brian > Avati > > On Wed, May 30, 2012 at 8:16 AM, Brian Foster > wrote: > > Hi all, > > I've been playing with a little hack recently to add a gluster mount > option to support FOPEN_KEEP_CACHE and I wanted to solicit some thoughts > on whether there's value to find an intelligent way to support this > functionality. To provide some context: > > Our current behavior with regard to fuse is that page cache is utilized > by fuse, from what I can tell, just about in the same manner as a > typical local fs. The primary difference is that by default, the address > space mapping for an inode is completely invalidated on open. So for > example, if process A opens and reads a file in a loop, subsequent reads > are served from cache (bypassing fuse and gluster). If process B steps > in and opens the same file, the cache is flushed and the next reads from > either process are passed down through fuse. The FOPEN_KEEP_CACHE option > simply disables this cache flash on open behavior. > > The following are some notes on my experimentation thus far: > > - With FOPEN_KEEP_CACHE, fuse currently only invalidates on file size > changes. This is a problem in that I can rewrite some or all of a file > from another client and the cached client wouldn't notice. I've sent a > patch to fuse-devel to also invalidate on mtime changes (similar to > nfsv3 or cifs), so we'll see how well that is received. fuse also > supports a range based invalidation notification that we could take > advantage of if necessary. > > - I reproduce a measurable performance benefit in the local/cached read > situation. For example, running a kernel compile against a source tree > in a gluster volume (no other xlators and build output to local storage) > improves to 6 minutes from just under 8 minutes with the default graph > (9.5 minutes with only the client xlator and 1:09 locally). > > - Some of the specific differences from current io-cache caching: > - io-cache supports time based invalidation and tunables such > as cache > size and priority. The page cache has no such controls. > - io-cache invalidates more frequently on various fops. It > also looks > like we invalidate on writes and don't take advantage of the write data > most recently sent, whereas page cache writes are cached (errors > notwithstanding). > - Page cache obviously has tighter integration with the > system (i.e., > drop_caches controls, more specific reporting, ability to drop cache > when memory is needed). > > All in all, I'm curious what people think about enabling the cache > behavior in gluster. We could support anything from the basic mount > option I'm currently using (i.e., similar to attribute/dentry caching) > to something integrated with io-cache (doing invalidations when > necessary), or maybe even something eventually along the lines of the > nfs weak cache consistency model where it validates the cache after > every fop based on file attributes. > > In general, are there other big issues/questions that would need to be > explored before this is useful (i.e., the size invalidation issue)? Are > there other performance tests that should be explored? Thoughts > appreciated. Thanks. > > Brian > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > From johnmark at redhat.com Thu May 31 16:33:20 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 31 May 2012 12:33:20 -0400 (EDT) Subject: [Gluster-devel] A very special announcement from Gluster.org In-Reply-To: <344ab6e5-d6de-48d9-bfe8-e2727af7b45e@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <660ccad1-e191-405c-8645-1cb2fb02f80c@zmail01.collab.prod.int.phx2.redhat.com> Today, we?re announcing the next generation of GlusterFS , version 3.3. The release has been a year in the making and marks several firsts: the first post-acquisition release under Red Hat, our first major act as an openly-governed project and our first foray beyond NAS. We?ve also taken our first steps towards merging big data and unstructured data storage, giving users and developers new ways of managing their data scalability challenges. GlusterFS is an open source, fully distributed storage solution for the world?s ever-increasing volume of unstructured data. It is a software-only, highly available, scale-out, centrally managed storage pool that can be backed by POSIX filesystems that support extended attributes, such as Ext3/4, XFS, BTRFS and many more. This release provides many of the most commonly requested features including proactive self-healing, quorum enforcement, and granular locking for self-healing, as well as many additional bug fixes and enhancements. Some of the more noteworthy features include: ? Unified File and Object storage ? Blending OpenStack?s Object Storage API with GlusterFS provides simultaneous read and write access to data as files or as objects. ? HDFS compatibility ? Gives Hadoop administrators the ability to run MapReduce jobs on unstructured data on GlusterFS and access the data with well-known tools and shell scripts. ? Proactive self-healing ? GlusterFS volumes will now automatically restore file integrity after a replica recovers from failure. ? Granular locking ? Allows large files to be accessed even during self-healing, a feature that is particularly important for VM images. ? Replication improvements ? With quorum enforcement you can be confident that your data has been written in at least the configured number of places before the file operation returns, allowing a user-configurable adjustment to fault tolerance vs performance. Visit http://www.gluster.org to download. Packages are available for most distributions, including Fedora, Debian, RHEL, Ubuntu and CentOS. Get involved! Join us on #gluster on freenode, join our mailing list , ?like? our Facebook page , follow us on Twitter , or check out our LinkedIn group . GlusterFS is an open source project sponsored by Red Hat ?, who uses it in its line of Red Hat Storage products. (this post published at http://www.gluster.org/2012/05/introducing-glusterfs-3-3/ ) -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.frediani at qubenet.net Thu May 31 16:36:36 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Thu, 31 May 2012 16:36:36 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F8BF93@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F8BF93@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F8DC08@QUBE-TR2-EXC01.qube.qubenet.net> What is happening with this ? Non one actually care to take ownership about this ? If this is a bug why nobody is interested to get it fixed ? If not someone speak up please. Two things are not working as they supposed, I am reporting back and nobody seems to give a dam about it. -----Original Message----- From: Fernando Frediani (Qube) Sent: 30 May 2012 17:20 To: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Does anyone have an idea of this problem of not being able to power up the virtual machines on that NFS mount ? Also what do those logs mean that Anand say that there is a problem with the Repstr model. Is it something isn't finished yet ? Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 21:32 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Anand, Further to that I managed to mount the Datastore and deploy machines there, but when trying to power them On I get an error as if it couldn't find a file. Has anyone seen these kind of error before ? I would say that it could be a lock problem, but it doesn't seem to. Permissions maybe ? Or the way the NFS is exported ? (root_squash, no_root_squash, etc) Here is the log: An unexpected error was received from the ESX host while powering on VM vm-21112. Failed to power on VM. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 20:34 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From johnmark at redhat.com Thu May 31 16:48:45 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 31 May 2012 12:48:45 -0400 (EDT) Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F8DC08@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <59507de0-4264-4e27-ac94-c9b34890a5f4@zmail01.collab.prod.int.phx2.redhat.com> ----- Original Message ----- > What is happening with this ? > Non one actually care to take ownership about this ? > If this is a bug why nobody is interested to get it fixed ? If not > someone speak up please. > Two things are not working as they supposed, I am reporting back and > nobody seems to give a dam about it. Hi Fernando, If nobody is replying, it's because they don't have experience with your particular setup, or they've never seen this problem before. If you feel it's a bug, then please file a bug at http://bugzilla.redhat.com/ You can also ask questions on the IRC channel: #gluster Or on http://community.gluster.org/ I know it can be frustrating, but please understand that you will get a response only if someone out there has experience with your problem. Thanks, John Mark Community guy From manu at netbsd.org Tue May 1 02:18:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 1 May 2012 04:18:53 +0200 Subject: [Gluster-devel] Fwd: Re: Rejected NetBSD patches In-Reply-To: <4F9EED0C.2080203@redhat.com> Message-ID: <1kjeekq.1nkt3n11wtalkgM%manu@netbsd.org> Kaleb S. KEITHLEY wrote: > I haven't seen anything so far that needs to discriminate between NetBSD > and FreeBSD, but if we come across one, we can use __NetBSD__ and > __FreeBSD__ inside GF_BSD_HOST_OS. If you look at the code, NetBSD makes is way using GF_BSD_HOST_OS or GF_LINUX_HOST_OS, depending of the situation. NetBSD and FreeBSD forked 19 years ago, they had time to diverge. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 1 03:21:28 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 1 May 2012 05:21:28 +0200 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjdvf9.1o294sj12c16nlM%manu@netbsd.org> Message-ID: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> Emmanuel Dreyfus wrote: > I got a crash client-side. It happens in pthread_spin_lock() and I > recall fixing that kind of issue for a uninitialized lock. I added printf, and inode is NULL in mdc_inode_pre() therefore this is not an uninitializd lock problem. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 1 05:31:57 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 1 May 2012 07:31:57 +0200 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> Message-ID: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Emmanuel Dreyfus wrote: > I added printf, and inode is NULL in mdc_inode_pre() therefore this is > not an uninitializd lock problem. Indeed, this this the mdc_local_t structure that seems uninitialized: (gdb) frame 3 #3 0xbaa0ecb5 in mdc_setattr_cbk (frame=0xbb7e32a0, cookie=0xbb7a4380, this=0xba3e3000, op_ret=0, op_errno=0, prebuf=0xb940a57c, postbuf=0xb940a5e4, xdata=0x0) at md-cache.c:1365 1365 mdc_inode_iatt_set (this, local->loc.inode, postbuf); (gdb) print *(mdc_local_t *)frame->local $6 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, fd = 0xb8f9d230, linkname = 0x0, xattr = 0x0} And indeed local->loc it is not initialized in mdc_fsetattr(). I suspect there is away of obteining it from fd, but this is getting beyond by knowledge of glusterfs internals. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vbellur at redhat.com Wed May 2 04:21:08 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Wed, 02 May 2012 09:51:08 +0530 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> References: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Message-ID: <4FA0B634.5090605@redhat.com> On 05/01/2012 11:01 AM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > >> I added printf, and inode is NULL in mdc_inode_pre() therefore this is >> not an uninitializd lock problem. > > Indeed, this this the mdc_local_t structure that seems uninitialized: > > (gdb) frame 3 > #3 0xbaa0ecb5 in mdc_setattr_cbk (frame=0xbb7e32a0, cookie=0xbb7a4380, > this=0xba3e3000, op_ret=0, op_errno=0, prebuf=0xb940a57c, > postbuf=0xb940a5e4, xdata=0x0) at md-cache.c:1365 > 1365 mdc_inode_iatt_set (this, local->loc.inode, postbuf); > > (gdb) print *(mdc_local_t *)frame->local > $6 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, > gfid = '\000', pargfid = '\000' > }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, > parent = 0x0, gfid = '\000', > pargfid = '\000'}, fd = 0xb8f9d230, > linkname = 0x0, xattr = 0x0} > > And indeed local->loc it is not initialized in mdc_fsetattr(). I suspect > there is away of obteining it from fd, but this is getting beyond by > knowledge of glusterfs internals. > > Do you have a test case that causes this crash? Vijay From anand.avati at gmail.com Wed May 2 05:29:22 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 1 May 2012 22:29:22 -0700 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> References: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Message-ID: Can you confirm if this fixes (obvious bug) - diff --git a/xlators/performance/md-cache/src/md-cache.c b/xlators/performance/md-cache/src/md-cache.c index 9ef599a..66c0bf3 100644 --- a/xlators/performance/md-cache/src/md-cache.c +++ b/xlators/performance/md-cache/src/md-cache.c @@ -1423,7 +1423,7 @@ mdc_fsetattr (call_frame_t *frame, xlator_t *this, fd_t *fd, local->fd = fd_ref (fd); - STACK_WIND (frame, mdc_setattr_cbk, + STACK_WIND (frame, mdc_fsetattr_cbk, FIRST_CHILD(this), FIRST_CHILD(this)->fops->fsetattr, fd, stbuf, valid, xdata); On Mon, Apr 30, 2012 at 10:31 PM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > > > I added printf, and inode is NULL in mdc_inode_pre() therefore this is > > not an uninitializd lock problem. > > Indeed, this this the mdc_local_t structure that seems uninitialized: > > (gdb) frame 3 > #3 0xbaa0ecb5 in mdc_setattr_cbk (frame=0xbb7e32a0, cookie=0xbb7a4380, > this=0xba3e3000, op_ret=0, op_errno=0, prebuf=0xb940a57c, > postbuf=0xb940a5e4, xdata=0x0) at md-cache.c:1365 > 1365 mdc_inode_iatt_set (this, local->loc.inode, postbuf); > > (gdb) print *(mdc_local_t *)frame->local > $6 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, > gfid = '\000' , pargfid = '\000' > }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, > parent = 0x0, gfid = '\000' , > pargfid = '\000' }, fd = 0xb8f9d230, > linkname = 0x0, xattr = 0x0} > > And indeed local->loc it is not initialized in mdc_fsetattr(). I suspect > there is away of obteining it from fd, but this is getting beyond by > knowledge of glusterfs internals. > > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kshlmster at gmail.com Wed May 2 05:35:02 2012 From: kshlmster at gmail.com (Kaushal M) Date: Wed, 2 May 2012 11:05:02 +0530 Subject: [Gluster-devel] 3.3 and address family In-Reply-To: References: <1kj84l9.19kzk6dfdsrtsM%manu@netbsd.org> Message-ID: Didn't send the last message to list. Resending. On Wed, May 2, 2012 at 10:58 AM, Kaushal M wrote: > Hi Emmanuel, > > Took a look at your patch for fixing this problem. It solves the it for > the brick glusterfsd processes. But glusterd also spawns and communicates > with nfs server & self-heal daemon processes. The proper xlator-option is > not set for these. This might be the cause. These processes are started in > glusterd_nodesvc_start() in glusterd-utils, which is where you could look > into. > > Thanks, > Kaushal > > On Fri, Apr 27, 2012 at 10:31 PM, Emmanuel Dreyfus wrote: > >> Hi >> >> I am still trying on 3.3.0qa39, and now I have an address family issue: >> gluserfs defaults to inet6 transport while the machine is not configured >> for IPv6. >> >> I added option transport.address-family inet in glusterfs/glusterd.vol >> and now glusterd starts with an IPv4 address, but unfortunately, >> communication with spawned glusterfsd do not stick to the same address >> family: I can see packets going from ::1.1023 to ::1.24007 and they are >> rejected since I used transport.address-family inet. >> >> I need to tell glusterfs to use the same address family. I already did a >> patch for exactly the same problem some time ago, this is not very >> difficult, but it would save me some time if someone could tell me where >> should I look at in the code. >> >> -- >> Emmanuel Dreyfus >> http://hcpnet.free.fr/pubz >> manu at netbsd.org >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at nongnu.org >> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Wed May 2 09:30:32 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 2 May 2012 09:30:32 +0000 Subject: [Gluster-devel] qa39 crash In-Reply-To: References: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Message-ID: <20120502093032.GI3677@homeworld.netbsd.org> On Tue, May 01, 2012 at 10:29:22PM -0700, Anand Avati wrote: > Can you confirm if this fixes (obvious bug) - I do not crash anymore, but I spotted another bug, I do not know if it is related: removing owner write access to a non empty file open with write access fails with EPERMo Here is my test case. It works fine with glusterfs-3.2.6 but fchmod() fails with EPERM on 3.3.0qa39 #include #include #include #include #include #include int main(void) { int fd; char buf[16]; if ((fd = open("test.tmp", O_RDWR|O_CREAT, 0644)) == -1) err(EX_OSERR, "fopen failed"); if (write(fd, buf, sizeof(buf)) != sizeof(buf)) err(EX_OSERR, "write failed"); if (fchmod(fd, 0444) == -1) err(EX_OSERR, "fchmod failed"); if (close(fd) == -1) err(EX_OSERR, "close failed"); return EX_OK; } -- Emmanuel Dreyfus manu at netbsd.org From xhernandez at datalab.es Wed May 2 10:55:37 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Wed, 02 May 2012 12:55:37 +0200 Subject: [Gluster-devel] Some questions about requisites of translators Message-ID: <4FA112A9.1080101@datalab.es> Hello, I'm wondering if there are any requisites that translators must satisfy to work correctly inside glusterfs. In particular I need to know two things: 1. Are translators required to respect the order in which they receive the requests ? This is specially important in translators such as performance/io-threads or caching ones. It seems that these translators can reorder requests. If this is the case, is there any way to force some order between requests ? can inodelk/entrylk be used to force the order ? 2. Are translators required to propagate callback arguments even if the result of the operation is an error ? and if an internal translator error occurs ? When a translator has multiple subvolumes, I've seen that some arguments, such as xdata, are replaced with NULL. This can be understood, but are regular translators (those that only have one subvolume) allowed to do that or must they preserve the value of xdata, even in the case of an internal error ? If this is not a requisite, xdata loses it's function of delivering back extra information. Thank you very much, Xavi From anand.avati at gmail.com Sat May 5 06:02:30 2012 From: anand.avati at gmail.com (Anand Avati) Date: Fri, 4 May 2012 23:02:30 -0700 Subject: [Gluster-devel] Some questions about requisites of translators In-Reply-To: <4FA112A9.1080101@datalab.es> References: <4FA112A9.1080101@datalab.es> Message-ID: On Wed, May 2, 2012 at 3:55 AM, Xavier Hernandez wrote: > Hello, > > I'm wondering if there are any requisites that translators must satisfy to > work correctly inside glusterfs. > > In particular I need to know two things: > > 1. Are translators required to respect the order in which they receive the > requests ? > > This is specially important in translators such as performance/io-threads > or caching ones. It seems that these translators can reorder requests. If > this is the case, is there any way to force some order between requests ? > can inodelk/entrylk be used to force the order ? > > Translators are not expected to maintain ordering of requests. The only translator which takes care of ordering calls is write-behind. After acknowledging back write requests it has to make sure future requests see the true "effect" as though the previous write actually completed. To that end, it queues future "dependent" requests till the write acknowledgement is received from the server. inodelk/entrylk calls help achieve synchronization among clients (by getting into a critical section) - just like a mutex. It is an arbitrator. It does not help for ordering of two calls. If one call must strictly complete after another call from your translator's point of view (i.e, if it has such a requirement), then the latter call's STACK_WIND must happen in the callback of the former's STACK_UNWIND path. There are no guarantees maintained by the system to ensure that a second STACK_WIND issued right after a first STACK_WIND will complete and callback in the same order. Write-behind does all its ordering gimmicks only because it STACK_UNWINDs a write call prematurely and therefore must maintain the causal effects by means of queueing new requests behind the downcall towards the server. > 2. Are translators required to propagate callback arguments even if the > result of the operation is an error ? and if an internal translator error > occurs ? > > Usually no. If op_ret is -1, only op_errno is expected to be a usable value. Rest of the callback parameters are junk. > When a translator has multiple subvolumes, I've seen that some arguments, > such as xdata, are replaced with NULL. This can be understood, but are > regular translators (those that only have one subvolume) allowed to do that > or must they preserve the value of xdata, even in the case of an internal > error ? > > It is best to preserve the arguments unless you know specifically what you are doing. In case of error, all the non-op_{ret,errno} arguments are typically junk, including xdata. > If this is not a requisite, xdata loses it's function of delivering back > extra information. > > Can you explain? Are you seeing a use case for having a valid xdata in the callback even with op_ret == -1? Thanks, Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From tsato at valinux.co.jp Mon May 7 04:17:45 2012 From: tsato at valinux.co.jp (Tomoaki Sato) Date: Mon, 07 May 2012 13:17:45 +0900 Subject: [Gluster-devel] showmount reports many entries (Re: glusterfs-3.3.0qa39 released) In-Reply-To: <4F9A98E8.80400@gluster.com> References: <20120427053612.E08671804F5@build.gluster.com> <4F9A6422.3010000@valinux.co.jp> <4F9A98E8.80400@gluster.com> Message-ID: <4FA74CE9.8010805@valinux.co.jp> (2012/04/27 22:02), Vijay Bellur wrote: > On 04/27/2012 02:47 PM, Tomoaki Sato wrote: >> Vijay, >> >> I have been testing gluster-3.3.0qa39 NFS with 4 CentOS 6.2 NFS clients. >> The test set is like following: >> 1) All 4 clients mount 64 directories. (total 192 directories) >> 2) 192 procs runs on the 4 clients. each proc create a new unique file and write 1GB data to the file. (total 192GB) >> 3) All 4 clients umount 64 directories. >> >> The test finished successfully but showmount command reported many entries in spite of there were no NFS clients remain. >> Then I have restarted gluster related daemons. >> After restarting, showmount command reports no entries. >> Any insight into this is much appreciated. > > > http://review.gluster.com/2973 should fix this. Can you please confirm? > > > Thanks, > Vijay Vijay, I have confirmed that following instructions with c3a16c32. # showmount one Hosts on one: # mkdir /tmp/mnt # mount one:/one /tmp/mnt # showmount one Hosts on one: 172.17.200.108 # umount /tmp/mnt # showmount one Hosts on one: # And the test set has started running. It will take a couple of days to finish. by the way, I did following instructions to build RPM packages on a CentOS 5.6 x86_64 host. # yum install python-ctypes ncureses-devel readline-devel libibverbs-devel # git clone -b c3a16c32 ssh://@git.gluster.com/glusterfs.git glusterfs-3git # tar zcf /usr/src/redhat/SOURCES/glusterfs-3bit.tar.gz glusterfs-3git # rpmbuild -bb /usr/src/redhat/SOURCES/glusterfs-3git.tar.gz Thanks, Tomo Sato From manu at netbsd.org Mon May 7 04:39:22 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 7 May 2012 04:39:22 +0000 Subject: [Gluster-devel] Fixing Address family mess Message-ID: <20120507043922.GA10874@homeworld.netbsd.org> Hi Quick summary of the problem: when using transport-type socket with transport.address-family unspecified, glusterfs binds sockets with AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the kernel prefers. At mine it uses AF_INET6, while the machine is not configured to use IPv6. As a result, glusterfs client cannot connect to glusterfs server. A workaround is to use option transport.address-family inet in glusterfsd/glusterd.vol but that option must also be specified in all volume files for all bricks and FUSE client, which is unfortunate because they are automatically generated. I proposed a patch so that glusterd transport.address-family setting is propagated to various places: http://review.gluster.com/3261 That did not meet consensus. Jeff Darcy notes that we should be able to listen both on AF_INET and AF_INET6 sockets at the same time. I had a look at the code, and indeed it could easily be done. The only trouble is how to specify the listeners. For now option transport defaults to socket,rdma. I suggest we add socket families in that specification. We would then have this default: option transport socket/inet,socket/inet6,rdma With the following semantics: socket -> AF_UNSPEC socket (backward comaptibility) socket/inet -> AF_INET socket socket/inet6 -> AF_INET6 socket socket/sdp -> AF_SDP socket rdma -> sameas before Any opinion on that plan? Please comment before I writa code, it will save me some time is the proposal is wrong. -- Emmanuel Dreyfus manu at netbsd.org From xhernandez at datalab.es Mon May 7 08:07:52 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 07 May 2012 10:07:52 +0200 Subject: [Gluster-devel] Some questions about requisites of translators In-Reply-To: References: <4FA112A9.1080101@datalab.es> Message-ID: <4FA782D8.2000100@datalab.es> On 05/05/2012 08:02 AM, Anand Avati wrote: > > > On Wed, May 2, 2012 at 3:55 AM, Xavier Hernandez > > wrote: > > Hello, > > I'm wondering if there are any requisites that translators must > satisfy to work correctly inside glusterfs. > > In particular I need to know two things: > > 1. Are translators required to respect the order in which they > receive the requests ? > > This is specially important in translators such as > performance/io-threads or caching ones. It seems that these > translators can reorder requests. If this is the case, is there > any way to force some order between requests ? can inodelk/entrylk > be used to force the order ? > > > Translators are not expected to maintain ordering of requests. The > only translator which takes care of ordering calls is write-behind. > After acknowledging back write requests it has to make sure future > requests see the true "effect" as though the previous write actually > completed. To that end, it queues future "dependent" requests till the > write acknowledgement is received from the server. > > inodelk/entrylk calls help achieve synchronization among clients (by > getting into a critical section) - just like a mutex. It is an > arbitrator. It does not help for ordering of two calls. If one call > must strictly complete after another call from your translator's point > of view (i.e, if it has such a requirement), then the latter call's > STACK_WIND must happen in the callback of the former's STACK_UNWIND > path. There are no guarantees maintained by the system to ensure that > a second STACK_WIND issued right after a first STACK_WIND will > complete and callback in the same order. Write-behind does all its > ordering gimmicks only because it STACK_UNWINDs a write call > prematurely and therefore must maintain the causal effects by means of > queueing new requests behind the downcall towards the server. Good to know > 2. Are translators required to propagate callback arguments even > if the result of the operation is an error ? and if an internal > translator error occurs ? > > > Usually no. If op_ret is -1, only op_errno is expected to be a usable > value. Rest of the callback parameters are junk. > > When a translator has multiple subvolumes, I've seen that some > arguments, such as xdata, are replaced with NULL. This can be > understood, but are regular translators (those that only have one > subvolume) allowed to do that or must they preserve the value of > xdata, even in the case of an internal error ? > > > It is best to preserve the arguments unless you know specifically what > you are doing. In case of error, all the non-op_{ret,errno} arguments > are typically junk, including xdata. > > If this is not a requisite, xdata loses it's function of > delivering back extra information. > > > Can you explain? Are you seeing a use case for having a valid xdata in > the callback even with op_ret == -1? > As a part of a translator that I'm developing that works with multiple subvolumes, I need to implement some healing support to mantain data coherency (similar to AFR). After some thought, I decided that it could be advantageous to use a dedicated healing translator located near the bottom of the translators stack on the servers. This translator won't work by itself, it only adds support to be used by a higher level translator, which have to manage the logic of the healing and decide when a node needs to be healed. To do this, sometimes I need to return an error because an operation cannot be completed due to some condition related with healing itself (not with the underlying storage). However I need to send some specific healing information to let the upper translator know how it has to handle the detected condition. I cannot send a success answer because intermediate translators could take the fake data as valid and they could begin to operate incorrectly or even create inconsistencies. The other alternative is to use op_errno to encode the extra data, but this will also be difficult, even impossible in some cases, due to the amount of data and the complexity to combine it with an error code without mislead intermediate translators with strange or invalid error codes. I talked with John Mark about this translator and he suggested me to discuss it over the list. Therefore I'll initiate another thread to expose in more detail how it works and I would appreciate very much your opinion, and that of the other developers, about it. Especially if it can really be faster/safer that other solutions or not, or if you find any problem or have any suggestion to improve it. I think it could also be used by AFR and any future translator that may need some healing capabilities. Thank you very much, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From vijay at build.gluster.com Mon May 7 08:15:50 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Mon, 7 May 2012 01:15:50 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa40 released Message-ID: <20120507081553.5AA00100C5@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa40/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa40.tar.gz This release is made off v3.3.0qa40 From vijay at gluster.com Mon May 7 10:31:09 2012 From: vijay at gluster.com (Vijay Bellur) Date: Mon, 07 May 2012 16:01:09 +0530 Subject: [Gluster-devel] glusterfs-3.3.0qa40 released In-Reply-To: <20120507081553.5AA00100C5@build.gluster.com> References: <20120507081553.5AA00100C5@build.gluster.com> Message-ID: <4FA7A46D.2050506@gluster.com> This release is done by reverting commit 7d0397c2144810c8a396e00187a6617873c94002 as replace-brick and quota were not functioning with that commit. Hence the tag for this qa release would not be available in github. If you are interested in creating an equivalent of this qa release from git, it would be c4dadc74fd1d1188f123eae7f2b6d6f5232e2a0f - commit 7d0397c2144810c8a396e00187a6617873c94002. Thanks, Vijay On 05/07/2012 01:45 PM, Vijay Bellur wrote: > http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa40/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa40.tar.gz > > This release is made off v3.3.0qa40 > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From jdarcy at redhat.com Mon May 7 13:16:38 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 07 May 2012 09:16:38 -0400 Subject: [Gluster-devel] Fixing Address family mess In-Reply-To: <20120507043922.GA10874@homeworld.netbsd.org> References: <20120507043922.GA10874@homeworld.netbsd.org> Message-ID: <4FA7CB36.6040701@redhat.com> On 05/07/2012 12:39 AM, Emmanuel Dreyfus wrote: > Quick summary of the problem: when using transport-type socket with > transport.address-family unspecified, glusterfs binds sockets with > AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the > kernel prefers. At mine it uses AF_INET6, while the machine is not > configured to use IPv6. As a result, glusterfs client cannot connect > to glusterfs server. > > A workaround is to use option transport.address-family inet in > glusterfsd/glusterd.vol but that option must also be specified in > all volume files for all bricks and FUSE client, which is > unfortunate because they are automatically generated. I proposed a > patch so that glusterd transport.address-family setting is propagated > to various places: http://review.gluster.com/3261 > > That did not meet consensus. Jeff Darcy notes that we should be able > to listen both on AF_INET and AF_INET6 sockets at the same time. I > had a look at the code, and indeed it could easily be done. The only > trouble is how to specify the listeners. For now option transport > defaults to socket,rdma. I suggest we add socket families in that > specification. We would then have this default: > option transport socket/inet,socket/inet6,rdma > > With the following semantics: > socket -> AF_UNSPEC socket (backward comaptibility) > socket/inet -> AF_INET socket > socket/inet6 -> AF_INET6 socket > socket/sdp -> AF_SDP socket > rdma -> sameas before > > Any opinion on that plan? Please comment before I writa code, it will > save me some time is the proposal is wrong. I think it looks like the right solution. I understand that keeping the address-family multiplexing entirely in the socket code would be more complex, since it changes the relationship between transport instances and file descriptors (and threads in the SSL/multi-thread case). That's unfortunate, but far from the most unfortunate thing about our transport code. I do wonder whether we should use '/' as the separator, since it kind of implies the same kind of relationships between names and paths that we use for translator names - e.g. cluster/dht is actually used as part of the actual path for dht.so - and in this case that relationship doesn't actually exist. Another idea, which I don't actually like any better but which I'll suggest for completeness, would be to express the list of address families via an option: option transport.socket.address-family inet6 Now that I think about it, another benefit is that it supports multiple instances of the same address family with different options, e.g. to support segregated networks. Obviously we lack higher-level support for that right now, but if that should ever change then it would be nice to have the right low-level infrastructure in place for it. From jdarcy at redhat.com Mon May 7 14:43:47 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 07 May 2012 10:43:47 -0400 Subject: [Gluster-devel] ZkFarmer Message-ID: <4FA7DFA3.1030300@redhat.com> I've long felt that our ways of dealing with cluster membership and staging of config changes is not quite as robust and scalable as we might want. Accordingly, I spent a bit of time a couple of weeks ago looking into the possibility of using ZooKeeper to do some of this stuff. Yeah, it brings in a heavy Java dependency, but when I looked at some lighter-weight alternatives they all seemed to be lacking in more important ways. Basically the idea was to do this: * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or point everyone at an existing ZooKeeper cluster. * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" merely updates ZK, and "peer status" merely reads from it). * Store config information in ZK *once* instead of regenerating volfiles etc. on every node (and dealing with the ugly cases where a node was down when the config change happened). * Set watches on ZK nodes to be notified when config changes happen, and respond appropriately. I eventually ran out of time and moved on to other things, but this or something like it (e.g. using Riak Core) still seems like a better approach than what we have. In that context, it looks like ZkFarmer[1] might be a big help. AFAICT someone else was trying to solve almost exactly the same kind of server/config problem that we have, and wrapped their solution into a library. Is this a direction other devs might be interested in pursuing some day, if/when time allows? [1] https://github.com/rs/zkfarmer From johnmark at redhat.com Mon May 7 19:35:54 2012 From: johnmark at redhat.com (John Mark Walker) Date: Mon, 07 May 2012 15:35:54 -0400 (EDT) Subject: [Gluster-devel] Server outage - review.gluster.com - please stand by In-Reply-To: <5299ff98-4714-4702-8f26-0d6f62441fe3@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: Greetings, Our iWeb server, which hosts review.gluster.com, is currently down. I have filed an urgent request to reboot the server in question. If you notice anything else working poorly, aside from review.gluster.com, please let me know ASAP. I'll send a note when services are back to normal. -JM From ian.latter at midnightcode.org Mon May 7 22:17:41 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 08 May 2012 08:17:41 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205072217.q47MHfmr003867@singularity.tronunltd.com> Is there anything written up on why you/all want every node to be completely conscious of every other node? I could see a couple of architectures that might work better (be more scalable) if the config minutiae were either not necessary to be shared or shared in only cases where the config minutiae were a dependency. RE ZK, I have an issue with it not being a binary at the linux distribution level. This is the reason I don't currently have Gluster's geo replication module in place .. ----- Original Message ----- >From: "Jeff Darcy" >To: >Subject: [Gluster-devel] ZkFarmer >Date: Mon, 07 May 2012 10:43:47 -0400 > > I've long felt that our ways of dealing with cluster membership and staging of > config changes is not quite as robust and scalable as we might want. > Accordingly, I spent a bit of time a couple of weeks ago looking into the > possibility of using ZooKeeper to do some of this stuff. Yeah, it brings in a > heavy Java dependency, but when I looked at some lighter-weight alternatives > they all seemed to be lacking in more important ways. Basically the idea was > to do this: > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or > point everyone at an existing ZooKeeper cluster. > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" > merely updates ZK, and "peer status" merely reads from it). > > * Store config information in ZK *once* instead of regenerating volfiles etc. > on every node (and dealing with the ugly cases where a node was down when the > config change happened). > > * Set watches on ZK nodes to be notified when config changes happen, and > respond appropriately. > > I eventually ran out of time and moved on to other things, but this or > something like it (e.g. using Riak Core) still seems like a better approach > than what we have. In that context, it looks like ZkFarmer[1] might be a big > help. AFAICT someone else was trying to solve almost exactly the same kind of > server/config problem that we have, and wrapped their solution into a library. > Is this a direction other devs might be interested in pursuing some day, > if/when time allows? > > > [1] https://github.com/rs/zkfarmer > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From ab at gluster.com Mon May 7 22:55:22 2012 From: ab at gluster.com (Anand Babu Periasamy) Date: Mon, 7 May 2012 15:55:22 -0700 Subject: [Gluster-devel] Fixing Address family mess In-Reply-To: <4FA7CB36.6040701@redhat.com> References: <20120507043922.GA10874@homeworld.netbsd.org> <4FA7CB36.6040701@redhat.com> Message-ID: On Mon, May 7, 2012 at 6:16 AM, Jeff Darcy wrote: > On 05/07/2012 12:39 AM, Emmanuel Dreyfus wrote: >> Quick summary of the problem: when using transport-type socket with >> transport.address-family unspecified, glusterfs binds sockets with >> AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the >> kernel prefers. At mine it uses AF_INET6, while the machine is not >> configured to use IPv6. As a result, glusterfs client cannot connect >> to glusterfs server. >> >> A workaround is to use option transport.address-family inet in >> glusterfsd/glusterd.vol but that option must also be specified in >> all volume files for all bricks and FUSE client, which is >> unfortunate because they are automatically generated. I proposed a >> patch so that glusterd transport.address-family setting is propagated >> to various places: http://review.gluster.com/3261 >> >> That did not meet consensus. Jeff Darcy notes that we should be able >> to listen both on AF_INET and AF_INET6 sockets at the same time. I >> had a look at the code, and indeed it could easily be done. The only >> trouble is how to specify the listeners. For now option transport >> defaults to socket,rdma. I suggest we add socket families in that >> specification. We would then have this default: >> ? ?option transport socket/inet,socket/inet6,rdma >> >> With the following semantics: >> ? ?socket -> AF_UNSPEC socket (backward comaptibility) >> ? ?socket/inet -> AF_INET socket >> ? ?socket/inet6 -> AF_INET6 socket >> ? ?socket/sdp -> AF_SDP socket >> ? ?rdma -> sameas before >> >> Any opinion on that plan? Please comment before I writa code, it will >> save me some time is the proposal is wrong. > > I think it looks like the right solution. I understand that keeping the > address-family multiplexing entirely in the socket code would be more complex, > since it changes the relationship between transport instances and file > descriptors (and threads in the SSL/multi-thread case). ?That's unfortunate, > but far from the most unfortunate thing about our transport code. > > I do wonder whether we should use '/' as the separator, since it kind of > implies the same kind of relationships between names and paths that we use for > translator names - e.g. cluster/dht is actually used as part of the actual path > for dht.so - and in this case that relationship doesn't actually exist. Another > idea, which I don't actually like any better but which I'll suggest for > completeness, would be to express the list of address families via an option: > > ? ? ? ?option transport.socket.address-family inet6 > > Now that I think about it, another benefit is that it supports multiple > instances of the same address family with different options, e.g. to support > segregated networks. ?Obviously we lack higher-level support for that right > now, but if that should ever change then it would be nice to have the right > low-level infrastructure in place for it. > Yes this should be controlled through volume options. "transport.address-family" is the right place to set it. Possible values are "inet, inet6, unix, inet-sdp". I would have named those user facing options as "ipv4, ipv6, sdp, all". If transport.address-family is not set. then if remote-host is set default to AF_INET (ipv4) if if transport.socket.connect-path is set default to AF_UNIX (unix) AF_UNSPEC is should be be taken as IPv4/IPv6. It is named appropriately. Default should be ipv4. I have not tested the patch. It is simply to explain how the changes should look like. I ignored legacy translators. When we implement concurrent support for multiple address-family (likely via mult-process model) we can worry about combinations. I agree. Combinations should look like "inet | inet6 | .." and not "inet / inet6 /.." -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: glusterfs-af-default-ipv4.diff Type: application/octet-stream Size: 9194 bytes Desc: not available URL: From jdarcy at redhat.com Tue May 8 00:43:31 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 07 May 2012 20:43:31 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205072217.q47MHfmr003867@singularity.tronunltd.com> References: <201205072217.q47MHfmr003867@singularity.tronunltd.com> Message-ID: <4FA86C33.6020901@redhat.com> On 05/07/2012 06:17 PM, Ian Latter wrote: > Is there anything written up on why you/all want every > node to be completely conscious of every other node? > > I could see a couple of architectures that might work > better (be more scalable) if the config minutiae were > either not necessary to be shared or shared in only > cases where the config minutiae were a dependency. Well, these aren't exactly minutiae. Everything at file or directory level is fully distributed and will remain so. We're talking only about stuff at the volume or server level, which is very little data but very broad in scope. Trying to segregate that only adds complexity and subtracts convenience, compared to having it equally accessible to (or through) any server. > RE ZK, I have an issue with it not being a binary at > the linux distribution level. This is the reason I don't > currently have Gluster's geo replication module in > place .. What exactly is your objection to interpreted or JIT compiled languages? Performance? Security? It's an unusual position, to say the least. From glusterdevel at louiszuckerman.com Tue May 8 03:52:02 2012 From: glusterdevel at louiszuckerman.com (Louis Zuckerman) Date: Mon, 7 May 2012 23:52:02 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <4FA7DFA3.1030300@redhat.com> References: <4FA7DFA3.1030300@redhat.com> Message-ID: Here's another ZooKeeper management framework that may be useful. It's called Curator, developed by Netflix, and recently released as open source. It probably has a bit more inertia than ZkFarmer too. http://techblog.netflix.com/2011/11/introducing-curator-netflix-zookeeper.html https://github.com/Netflix/curator HTH -louis On Mon, May 7, 2012 at 10:43 AM, Jeff Darcy wrote: > I've long felt that our ways of dealing with cluster membership and > staging of > config changes is not quite as robust and scalable as we might want. > Accordingly, I spent a bit of time a couple of weeks ago looking into the > possibility of using ZooKeeper to do some of this stuff. Yeah, it brings > in a > heavy Java dependency, but when I looked at some lighter-weight > alternatives > they all seemed to be lacking in more important ways. Basically the idea > was > to do this: > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, > or > point everyone at an existing ZooKeeper cluster. > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" > merely updates ZK, and "peer status" merely reads from it). > > * Store config information in ZK *once* instead of regenerating volfiles > etc. > on every node (and dealing with the ugly cases where a node was down when > the > config change happened). > > * Set watches on ZK nodes to be notified when config changes happen, and > respond appropriately. > > I eventually ran out of time and moved on to other things, but this or > something like it (e.g. using Riak Core) still seems like a better approach > than what we have. In that context, it looks like ZkFarmer[1] might be a > big > help. AFAICT someone else was trying to solve almost exactly the same > kind of > server/config problem that we have, and wrapped their solution into a > library. > Is this a direction other devs might be interested in pursuing some day, > if/when time allows? > > > [1] https://github.com/rs/zkfarmer > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Tue May 8 04:27:24 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 08 May 2012 14:27:24 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205080427.q484RO09004857@singularity.tronunltd.com> > > Is there anything written up on why you/all want every > > node to be completely conscious of every other node? > > > > I could see a couple of architectures that might work > > better (be more scalable) if the config minutiae were > > either not necessary to be shared or shared in only > > cases where the config minutiae were a dependency. > > Well, these aren't exactly minutiae. Everything at file or directory level is > fully distributed and will remain so. We're talking only about stuff at the > volume or server level, which is very little data but very broad in scope. > Trying to segregate that only adds complexity and subtracts convenience, > compared to having it equally accessible to (or through) any server. Sorry, I didn't have time this morning to add more detail. Note that my concern isn't bandwidth, its flexibility; the less knowledge needed the more I can do crazy things in user land, like running boxes in different data centres and randomly power things up and down, randomly re- address, randomly replace in-box hardware, load balance, NAT, etc. It makes a dynamic environment difficult to construct, for example, when Gluster rejects the same volume-id being presented to an existing cluster from a new GFID. But there's no need to go even that complicated, let me pull out an example of where shared knowledge may be unnecessary; The work that I was doing in Gluster (pre glusterd) drove out one primary "server" which fronted a Replicate volume of both its own Distribute volume and that of another server or two - themselves serving a single Distribute volume. So the client connected to one server for one volume and the rest was black box / magic (from the client's perspective - big fast storage in many locations); in that case it could be said that servers needed some shared knowledge, while the clients didn't. The equivalent configuration in a glusterd world (from my experiments) pushed all of the distribute knowledge out to the client and I haven't had a response as to how to add a replicate on distributed volumes in this model, so I've lost replicate. But in this world, the client must know about everything and the server is simply a set of served/presented disks (as volumes). In this glusterd world, then, why does any server need to know of any other server, if the clients are doing all of the heavy lifting? The additional consideration is where the server both consumes and presents, but this would be captured in the client side view. i.e. given where glusterd seems to be driving, this knowledge seems to be needed on the client side (within glusterfs, not glusterfsd). To my mind this breaks the gluster architecture that I read about 2009, but I need to stress that I didn't get a reply to the glusterd architecture question that I posted about a month ago; so I don't know if glusterd is currently limiting deployment options because; - there is an intention to drive the heavy lifting to the client (for example for performance reasons in big deployments), or; - there are known limitations in the existing bricks/ modules (for example moving files thru distribute), or; - there is ultimately (long term) more flexibility seen in this model (and we're at a midway point between pre glusterd and post so it doesn't feel that way yet), or; - there is an intent to drive out a particular market outcome or match an existing storage model (the gluster presentation was driving towards cloud, and maybe those vendors don't use server side implementations), etc. As I don't have a clear/big picture in my mind; if I'm not considering all of the impacts, then my apologies. > > RE ZK, I have an issue with it not being a binary at > > the linux distribution level. This is the reason I don't > > currently have Gluster's geo replication module in > > place .. > > What exactly is your objection to interpreted or JIT compiled languages? > Performance? Security? It's an unusual position, to say the least. > Specifically, primarily, space. Saturn builds GlusterFS capacity from a 48 Megabyte Linux distribution and adding many Megabytes of Perl and/or Python and/or PHP and/or Java for a single script is impractical. My secondary concern is licensing (specifically in the Java run-time environment case). Hadoop forced my hand; GNU's JRE/compiler wasn't up to the task of running Hadoop when I last looked at it (about 2 or 3 years ago now) - well, it could run a 2007 or so version but not current ones at that time - so now I work with Gluster .. Going back to ZkFarmer; Considering other architectures; it depends on how you slice and dice the problem as to how much external support you need; > I've long felt that our ways of dealing with cluster > membership and staging of config changes is not > quite as robust and scalable as we might want. By way of example; The openMosix kernel extensions maintained their own information exchange between cluster nodes; if a node (ip) was added via the /proc interface, it was "in" the cluster. Therefore cluster membership was the hand-off/interface. It could be as simple as a text list on each node, or it could be left to a user space daemon which could then gate cluster membership - this suited everyone with a small cluster. The native daemon (omdiscd) used multicast packets to find nodes and then stuff those IP's into the /proc interface - this suited everyone with a private/dedicated cluster. A colleague and I wrote a TCP variation to allow multi-site discovery with SSH public key exchanges and IPSEC tunnel establishment as part of the gating process - this suited those with a distributed/ part-time cluster. To ZooKeeper's point (http://zookeeper.apache.org/), the discovery protocol that we created was weak and I've since found a model/algorithm that allows for far more robust discovery. The point being that, depending on the final cluster architecture for gluster (i.e. all are nodes are peers and thus all are cluster members, nodes are client or server and both are cluster members, nodes are client or server and only clients [or servers] are cluster members, etc) there may be simpler cluster management options .. Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From ab at gluster.com Tue May 8 04:33:50 2012 From: ab at gluster.com (Anand Babu Periasamy) Date: Mon, 7 May 2012 21:33:50 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <4FA7DFA3.1030300@redhat.com> References: <4FA7DFA3.1030300@redhat.com> Message-ID: On Mon, May 7, 2012 at 7:43 AM, Jeff Darcy wrote: > I've long felt that our ways of dealing with cluster membership and staging of > config changes is not quite as robust and scalable as we might want. > Accordingly, I spent a bit of time a couple of weeks ago looking into the > possibility of using ZooKeeper to do some of this stuff. ?Yeah, it brings in a > heavy Java dependency, but when I looked at some lighter-weight alternatives > they all seemed to be lacking in more important ways. ?Basically the idea was > to do this: > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or > point everyone at an existing ZooKeeper cluster. > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" > merely updates ZK, and "peer status" merely reads from it). > > * Store config information in ZK *once* instead of regenerating volfiles etc. > on every node (and dealing with the ugly cases where a node was down when the > config change happened). > > * Set watches on ZK nodes to be notified when config changes happen, and > respond appropriately. > > I eventually ran out of time and moved on to other things, but this or > something like it (e.g. using Riak Core) still seems like a better approach > than what we have. ?In that context, it looks like ZkFarmer[1] might be a big > help. ?AFAICT someone else was trying to solve almost exactly the same kind of > server/config problem that we have, and wrapped their solution into a library. > ?Is this a direction other devs might be interested in pursuing some day, > if/when time allows? > > > [1] https://github.com/rs/zkfarmer Real issue is here is: GlusterFS is a fully distributed system. It is OK for config files to be in one place (centralized). It is easier to manage and backup. Avati still claims that making distributed copies are not a problem (volume operations are fast, versioned and checksumed). Also the code base for replicating 3 way or all-node is same. We all need to come to agreement on the demerits of replicating the volume spec on every node. If we are convinced to keep the config info in one place, ZK is certainly one a good idea. I personally hate Java dependency. I still struggle with Java dependencies for browser and clojure. I can digest that if we are going to adopt Java over Python for future external modules. Alternatively we can also look at creating a replicated meta system volume. What ever we adopt, we should keep dependencies and installation steps to the bare minimum and simple. -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein From ab at gluster.com Tue May 8 04:56:10 2012 From: ab at gluster.com (Anand Babu Periasamy) Date: Mon, 7 May 2012 21:56:10 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205080427.q484RO09004857@singularity.tronunltd.com> References: <201205080427.q484RO09004857@singularity.tronunltd.com> Message-ID: On Mon, May 7, 2012 at 9:27 PM, Ian Latter wrote: > >> > Is there anything written up on why you/all want every >> > node to be completely conscious of every other node? >> > >> > I could see a couple of architectures that might work >> > better (be more scalable) if the config minutiae were >> > either not necessary to be shared or shared in only >> > cases where the config minutiae were a dependency. >> >> Well, these aren't exactly minutiae. ?Everything at file > or directory level is >> fully distributed and will remain so. ?We're talking only > about stuff at the >> volume or server level, which is very little data but very > broad in scope. >> Trying to segregate that only adds complexity and > subtracts convenience, >> compared to having it equally accessible to (or through) > any server. > > Sorry, I didn't have time this morning to add more detail. > > Note that my concern isn't bandwidth, its flexibility; the > less knowledge needed the more I can do crazy things > in user land, like running boxes in different data centres > and randomly power things up and down, randomly re- > address, randomly replace in-box hardware, load > balance, NAT, etc. ?It makes a dynamic environment > difficult to construct, for example, when Gluster rejects > the same volume-id being presented to an existing > cluster from a new GFID. > > But there's no need to go even that complicated, let > me pull out an example of where shared knowledge > may be unnecessary; > > The work that I was doing in Gluster (pre glusterd) drove > out one primary "server" which fronted a Replicate > volume of both its own Distribute volume and that of > another server or two - themselves serving a single > Distribute volume. ?So the client connected to one > server for one volume and the rest was black box / > magic (from the client's perspective - big fast storage > in many locations); in that case it could be said that > servers needed some shared knowledge, while the > clients didn't. > > The equivalent configuration in a glusterd world (from > my experiments) pushed all of the distribute knowledge > out to the client and I haven't had a response as to how > to add a replicate on distributed volumes in this model, > so I've lost replicate. ?But in this world, the client must > know about everything and the server is simply a set > of served/presented disks (as volumes). ?In this > glusterd world, then, why does any server need to > know of any other server, if the clients are doing all of > the heavy lifting? > > The additional consideration is where the server both > consumes and presents, but this would be captured in > the client side view. ?i.e. given where glusterd seems > to be driving, this knowledge seems to be needed on > the client side (within glusterfs, not glusterfsd). > > To my mind this breaks the gluster architecture that I > read about 2009, but I need to stress that I didn't get > a reply to the glusterd architecture question that I > posted about a month ago; ?so I don't know if glusterd > is currently limiting deployment options because; > ?- there is an intention to drive the heavy lifting to the > ? ?client (for example for performance reasons in big > ? ?deployments), or; > ?- there are known limitations in the existing bricks/ > ? ?modules (for example moving files thru distribute), > ? ?or; > ?- there is ultimately (long term) more flexibility seen > ? ?in this model (and we're at a midway point between > ? ?pre glusterd and post so it doesn't feel that way > ? ?yet), or; > ?- there is an intent to drive out a particular market > ? ?outcome or match an existing storage model (the > ? ?gluster presentation was driving towards cloud, > ? ?and maybe those vendors don't use server side > ? ?implementations), etc. > > As I don't have a clear/big picture in my mind; if I'm > not considering all of the impacts, then my apologies. > > >> > RE ZK, I have an issue with it not being a binary at >> > the linux distribution level. ?This is the reason I don't >> > currently have Gluster's geo replication module in >> > place .. >> >> What exactly is your objection to interpreted or JIT > compiled languages? >> Performance? ?Security? ?It's an unusual position, to say > the least. >> > > Specifically, primarily, space. ?Saturn builds GlusterFS > capacity from a 48 Megabyte Linux distribution and > adding many Megabytes of Perl and/or Python and/or > PHP and/or Java for a single script is impractical. > > My secondary concern is licensing (specifically in the > Java run-time environment case). ?Hadoop forced my > hand; GNU's JRE/compiler wasn't up to the task of > running Hadoop when I last looked at it (about 2 or 3 > years ago now) - well, it could run a 2007 or so > version but not current ones at that time - so now I > work with Gluster .. > > > > Going back to ZkFarmer; > > Considering other architectures; it depends on how > you slice and dice the problem as to how much > external support you need; > ?> I've long felt that our ways of dealing with cluster > ?> membership and staging of config changes is not > ?> quite as robust and scalable as we might want. > > By way of example; > ?The openMosix kernel extensions maintained their > own information exchange between cluster nodes; if > a node (ip) was added via the /proc interface, it was > "in" the cluster. ?Therefore cluster membership was > the hand-off/interface. > ?It could be as simple as a text list on each node, or > it could be left to a user space daemon which could > then gate cluster membership - this suited everyone > with a small cluster. > ?The native daemon (omdiscd) used multicast > packets to find nodes and then stuff those IP's into > the /proc interface - this suited everyone with a > private/dedicated cluster. > ?A colleague and I wrote a TCP variation to allow > multi-site discovery with SSH public key exchanges > and IPSEC tunnel establishment as part of the > gating process - this suited those with a distributed/ > part-time cluster. ?To ZooKeeper's point > (http://zookeeper.apache.org/), the discovery > protocol that we created was weak and I've since > found a model/algorithm that allows for far more > robust discovery. > > ?The point being that, depending on the final cluster > architecture for gluster (i.e. all are nodes are peers > and thus all are cluster members, nodes are client > or server and both are cluster members, nodes are > client or server and only clients [or servers] are > cluster members, etc) there may be simpler cluster > management options .. > > > Cheers, > Reason to keep the volume spec files on all servers is simply to be fully distributed. No one node or set of nodes should hold the cluster hostage. Code to keep them in sync over 2 nodes or 20 nodes is essentially the same. We are revisiting this situation now because we want to scale to 1000s of nodes potentially. Gluster CLI operations should not time out or slow down. If ZK requires proprietary JRE for stability, Java will be NO NO!. We may not need ZK at all. If we simply decide to centralize the config, GlusterFS has enough code to handle them. Again Avati will argue that it is exactly the same code as now. My point is to keep things simple as we scale. Even if the code base is same, we should still restrict it to N selected nodes. It is matter of adding config option. -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein From ian.latter at midnightcode.org Tue May 8 05:21:37 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 08 May 2012 15:21:37 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205080521.q485Lb9d005117@singularity.tronunltd.com> > No one node or set of nodes should hold the > cluster hostage. Agreed - this is fundamental. > We are revisiting this situation now because we > want to scale to 1000s of nodes potentially. Good, I hate upper bounds on architectures :) Though I haven't tested my own implementation, I understand that one implementation of the discovery protocol that I've used, scaled to 20,000 hosts across three sites in two countries; this is the the type of robust outcome that can be manipulated at the macro scale - i.e. without manipulating per-node details. > Gluster CLI operations should not time out or > slow down. This is critical - not just the CLI but also the storage interface (in a redundant environment); infrastructure wears and fails, thus failing infrastructure should be regarded as the norm/ default. > If ZK requires proprietary JRE for stability, > Java will be NO NO!. *Fantastic* > My point is to keep things simple as we scale. I couldn't agree more. In that principle I ask that each dependency on cluster knowledge be considered carefully with a minimalist approach. -- Ian Latter Late night coder .. http://midnightcode.org/ From vbellur at redhat.com Tue May 8 09:15:13 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Tue, 08 May 2012 14:45:13 +0530 Subject: [Gluster-devel] Server outage - review.gluster.com - please stand by In-Reply-To: References: Message-ID: <4FA8E421.3090108@redhat.com> On 05/08/2012 01:05 AM, John Mark Walker wrote: > Greetings, > > Our iWeb server, which hosts review.gluster.com, is currently down. I have filed an urgent request to reboot the server in question. > > If you notice anything else working poorly, aside from review.gluster.com, please let me know ASAP. > > I'll send a note when services are back to normal. All services are back to normal. Please let us know if you notice any issue. Thanks, Vijay From xhernandez at datalab.es Tue May 8 09:34:35 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Tue, 08 May 2012 11:34:35 +0200 Subject: [Gluster-devel] A healing translator Message-ID: <4FA8E8AB.2040604@datalab.es> Hello developers, I would like to expose some ideas we are working on to create a new kind of translator that should be able to unify and simplify to some extent the healing procedures of complex translators. Currently, the only translator with complex healing capabilities that we are aware of is AFR. We are developing another translator that will also need healing capabilities, so we thought that it would be interesting to create a new translator able to handle the common part of the healing process and hence to simplify and avoid duplicated code in other translators. The basic idea of the new translator is to handle healing tasks nearer the storage translator on the server nodes instead to control everything from a translator on the client nodes. Of course the heal translator is not able to handle healing entirely by itself, it needs a client translator which will coordinate all tasks. The heal translator is intended to be used by translators that work with multiple subvolumes. I will try to explain how it works without entering into too much details. There is an important requisite for all client translators that use healing: they must have exactly the same list of subvolumes and in the same order. Currently, I think this is not a problem. The heal translator treats each file as an independent entity, and each one can be in 3 modes: 1. Normal mode This is the normal mode for a copy or fragment of a file when it is synchronized and consistent with the same file on other nodes (for example with other replicas. It is the client translator who decides if it is synchronized or not). 2. Healing mode This is the mode used when a client detects an inconsistency in the copy or fragment of the file stored on this node and initiates the healing procedures. 3. Provider mode (I don't like very much this name, though) This is the mode used by client translators when an inconsistency is detected in this file, but the copy or fragment stored in this node is considered good and it will be used as a source to repair the contents of this file on other nodes. Initially, when a file is created, it is set in normal mode. Client translators that make changes must guarantee that they send the modification requests in the same order to all the servers. This should be done using inodelk/entrylk. When a change is sent to a server, the client must include a bitmap mask of the clients to which the request is being sent. Normally this is a bitmap containing all the clients, however, when a server fails for some reason some bits will be cleared. The heal translator uses this bitmap to early detect failures on other nodes from the point of view of each client. When this condition is detected, the request is aborted with an error and the client is notified with the remaining list of valid nodes. If the client considers the request can be successfully server with the remaining list of nodes, it can resend the request with the updated bitmap. The heal translator also updates two file attributes for each change request to mantain the "version" of the data and metadata contents of the file. A similar task is currently made by AFR using xattrop. This would not be needed anymore, speeding write requests. The version of data and metadata is returned to the client for each read request, allowing it to detect inconsistent data. When a client detects an inconsistency, it initiates healing. First of all, it must lock the entry and inode (when necessary). Then, from the data collected from each node, it must decide which nodes have good data and which ones have bad data and hence need to be healed. There are two possible cases: 1. File is not a regular file In this case the reconstruction is very fast and requires few requests, so it is done while the file is locked. In this case, the heal translator does nothing relevant. 2. File is a regular file For regular files, the first step is to synchronize the metadata to the bad nodes, including the version information. Once this is done, the file is set in healing mode on bad nodes, and provider mode on good nodes. Then the entry and inode are unlocked. When a file is in provider mode, it works as in normal mode, but refuses to start another healing. Only one client can be healing a file. When a file is in healing mode, each normal write request from any client are handled as if the file were in normal mode, updating the version information and detecting possible inconsistencies with the bitmap. Additionally, the healing translator marks the written region of the file as "good". Each write request from the healing client intended to repair the file must be marked with a special flag. In this case, the area that wants to be written is filtered by the list of "good" ranges (if there are any intersection with a good range, it is removed from the request). The resulting set of ranges are propagated to the lower translator and added to the list of "good" ranges but the version information is not updated. Read requests are only served if the range requested is entirely contained into the "good" regions list. There are some additional details, but I think this is enough to have a general idea of its purpose and how it works. The main advantages of this translator are: 1. Avoid duplicated code in client translators 2. Simplify and unify healing methods in client translators 3. xattrop is not needed anymore in client translators to keep track of changes 4. Full file contents are repaired without locking the file 5. Better detection and prevention of some split brain situations as soon as possible I think it would be very useful. It seems to me that it works correctly in all situations, however I don't have all the experience that other developers have with the healing functions of AFR, so I will be happy to answer any question or suggestion to solve problems it may have or to improve it. What do you think about it ? Thank you, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From jdarcy at redhat.com Tue May 8 12:57:31 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Tue, 08 May 2012 08:57:31 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: References: <4FA7DFA3.1030300@redhat.com> Message-ID: <4FA9183B.5080708@redhat.com> On 05/08/2012 12:33 AM, Anand Babu Periasamy wrote: > Real issue is here is: GlusterFS is a fully distributed system. It is > OK for config files to be in one place (centralized). It is easier to > manage and backup. Avati still claims that making distributed copies > are not a problem (volume operations are fast, versioned and > checksumed). It's also grossly inefficient at 100-node scale. I'll also need some convincing before I believe that nodes which are down during a config change will catch up automatically and reliably in all cases. I think this is even more of an issue with membership than with config data. All-to-all pings are just not acceptable at 100-node or greater scale. We need something better, and more importantly designing cluster membership protocols is just not a business we should even be in. We shouldn't be devoting our own time to that when we can just use something designed by people who have that as their focus. > Also the code base for replicating 3 way or all-node is > same. We all need to come to agreement on the demerits of replicating > the volume spec on every node. It's somewhat similar to how we replicate data - we need enough copies to survive a certain number of anticipated failures. > If we are convinced to keep the config info in one place, ZK is > certainly one a good idea. I personally hate Java dependency. I still > struggle with Java dependencies for browser and clojure. I can digest > that if we are going to adopt Java over Python for future external > modules. Alternatively we can also look at creating a replicated meta > system volume. What ever we adopt, we should keep dependencies and > installation steps to the bare minimum and simple. I personally hate the Java dependency too. I'd much rather have something in C/Go/Python/Erlang but couldn't find anything that had the same (useful) feature set. I also considered the idea of storing config in a hand-crafted GlusterFS volume, using our own mechanisms for distributing/finding and replicating data. That's at least an area where we can claim some expertise. Such layering does create a few interesting issues, but nothing intractable. The big drawback is that it only solves the config-data problem; a solution which combines that with cluster membership is IMO preferable. The development drag of having to maintain that functionality ourselves, and hook every new feature into the not-very-convenient APIs that have predictably resulted, is considerable. From jdarcy at redhat.com Tue May 8 12:42:19 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Tue, 08 May 2012 08:42:19 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205080427.q484RO09004857@singularity.tronunltd.com> References: <201205080427.q484RO09004857@singularity.tronunltd.com> Message-ID: <4FA914AB.8030209@redhat.com> On 05/08/2012 12:27 AM, Ian Latter wrote: > The equivalent configuration in a glusterd world (from > my experiments) pushed all of the distribute knowledge > out to the client and I haven't had a response as to how > to add a replicate on distributed volumes in this model, > so I've lost replicate. This doesn't seem to be a problem with replicate-first vs. distribute-first, but with client-side vs. server-side deployment of those translators. You *can* construct your own volfiles that do these things on the servers. It will work, but you won't get a lot of support for it. The issue here is that we have only a finite number of developers, and a near-infinite number of configurations. We can't properly qualify everything. One way we've tried to limit that space is by preferring distribute over replicate, because replicate does a better job of shielding distribute from brick failures than vice versa. Another is to deploy both on the clients, following the scalability rule of pushing effort to the most numerous components. The code can support other arrangements, but the people might not. BTW, a similar concern exists with respect to replication (i.e. AFR) across data centers. Performance is going to be bad, and there's not going to be much we can do about it. > But in this world, the client must > know about everything and the server is simply a set > of served/presented disks (as volumes). In this > glusterd world, then, why does any server need to > know of any other server, if the clients are doing all of > the heavy lifting? First, because config changes have to apply across servers. Second, because server machines often spin up client processes for things like repair or rebalance. From ian.latter at midnightcode.org Tue May 8 23:08:32 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Wed, 09 May 2012 09:08:32 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205082308.q48N8WQg008425@singularity.tronunltd.com> > On 05/08/2012 12:27 AM, Ian Latter wrote: > > The equivalent configuration in a glusterd world (from > > my experiments) pushed all of the distribute knowledge > > out to the client and I haven't had a response as to how > > to add a replicate on distributed volumes in this model, > > so I've lost replicate. > > This doesn't seem to be a problem with replicate-first vs. distribute-first, > but with client-side vs. server-side deployment of those translators. You > *can* construct your own volfiles that do these things on the servers. It will > work, but you won't get a lot of support for it. The issue here is that we > have only a finite number of developers, and a near-infinite number of > configurations. We can't properly qualify everything. One way we've tried to > limit that space is by preferring distribute over replicate, because replicate > does a better job of shielding distribute from brick failures than vice versa. > Another is to deploy both on the clients, following the scalability rule of > pushing effort to the most numerous components. The code can support other > arrangements, but the people might not. Sure, I have my own vol files that do (did) what I wanted and I was supporting myself (and users); the question (and the point) is what is the GlusterFS *intent*? I'll write an rsyncd wrapper myself, to run on top of Gluster, if the intent is not allow the configuration I'm after (arbitrary number of disks in one multi-host environment replicated to an arbitrary number of disks in another multi-host environment, where ideally each environment need not sum to the same data capacity, presented in a single contiguous consumable storage layer to an arbitrary number of unintelligent clients, that is as fault tolerant as I choose it to be including the ability to add and offline/online and remove storage as I so choose) .. or switch out the whole solution if Gluster is heading away from my needs. I just need to know what the direction is .. I may even be able to help get you there if you tell me :) > BTW, a similar concern exists with respect to replication (i.e. AFR) across > data centers. Performance is going to be bad, and there's not going to be much > we can do about it. Hmm .. that depends .. these sorts of statements need context/qualification (in bandwidth and latency terms). For example the last multi-site environment that I did architecture for was two DCs set 32kms apart with a redundant 20Gbps layer-2 (ethernet) stretch between them - latency was 1ms average, 2ms max (the fiber actually took a 70km path). Didn't run Gluster on it, but we did stretch a number things that "couldn't" be stretched. > > But in this world, the client must > > know about everything and the server is simply a set > > of served/presented disks (as volumes). In this > > glusterd world, then, why does any server need to > > know of any other server, if the clients are doing all of > > the heavy lifting? > > First, because config changes have to apply across servers. Second, because > server machines often spin up client processes for things like repair or > rebalance. Yep, but my reading is that the config's that the servers need are local - to make a disk a share (volume), and that as you've described the rest are "client processes" (even when on something built as a "server"), so if you catered for all clients then you'd be set? I.e. AFR now runs in the client? And I am sick of the word-wrap on this client .. I think you've finally convinced me to fix it ... what's normal these days - still 80 chars? -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Wed May 9 00:57:49 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 8 May 2012 17:57:49 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205082308.q48N8WQg008425@singularity.tronunltd.com> References: <201205082308.q48N8WQg008425@singularity.tronunltd.com> Message-ID: On Tue, May 8, 2012 at 4:08 PM, Ian Latter wrote: > > On 05/08/2012 12:27 AM, Ian Latter wrote: > > > The equivalent configuration in a glusterd world (from > > > my experiments) pushed all of the distribute knowledge > > > out to the client and I haven't had a response as to how > > > to add a replicate on distributed volumes in this model, > > > so I've lost replicate. > > > > This doesn't seem to be a problem with replicate-first vs. > distribute-first, > > but with client-side vs. server-side deployment of those > translators. You > > *can* construct your own volfiles that do these things on > the servers. It will > > work, but you won't get a lot of support for it. The > issue here is that we > > have only a finite number of developers, and a > near-infinite number of > > configurations. We can't properly qualify everything. > One way we've tried to > > limit that space is by preferring distribute over > replicate, because replicate > > does a better job of shielding distribute from brick > failures than vice versa. > > Another is to deploy both on the clients, following the > scalability rule of > > pushing effort to the most numerous components. The code > can support other > > arrangements, but the people might not. > > Sure, I have my own vol files that do (did) what I wanted > and I was supporting myself (and users); the question > (and the point) is what is the GlusterFS *intent*? The "intent" (more or less - I hate to use the word as it can imply a commitment to what I am about to say, but there isn't one) is to keep the bricks (server process) dumb and have the intelligence on the client side. This is a "rough goal". There are cases where replication on the server side is inevitable (in the case of NFS access) but we keep the software architecture undisturbed by running a client process on the server machine to achieve it. We do plan to support "replication on the server" in the future while still retaining the existing software architecture as much as possible. This is particularly useful in Hadoop environment where the jobs expect write performance of a single copy and expect copy to happen in the background. We have the proactive self-heal daemon running on the server machines now (which again is a client process which happens to be physically placed on the server) which gives us many interesting possibilities - i.e, with simple changes where we fool the client side replicate translator at the time of transaction initiation that only the closest server is up at that point of time and write to it alone, and have the proactive self-heal daemon perform the extra copies in the background. This would be consistent with other readers as they get directed to the "right" version of the file by inspecting the changelogs while the background replication is in progress. The intention of the above example is to give a general sense of how we want to evolve the architecture (i.e, the "intention" you were referring to) - keep the clients intelligent and servers dumb. If some intelligence needs to be built on the physical server, tackle it by loading a client process there (there are also "pathinfo xattr" kind of internal techniques to figure out locality of the clients in a generic way without bringing "server sidedness" into them in a harsh way) I'll > write an rsyncd wrapper myself, to run on top of Gluster, > if the intent is not allow the configuration I'm after > (arbitrary number of disks in one multi-host environment > replicated to an arbitrary number of disks in another > multi-host environment, where ideally each environment > need not sum to the same data capacity, presented in a > single contiguous consumable storage layer to an > arbitrary number of unintelligent clients, that is as fault > tolerant as I choose it to be including the ability to add > and offline/online and remove storage as I so choose) .. > or switch out the whole solution if Gluster is heading > away from my needs. I just need to know what the > direction is .. I may even be able to help get you there if > you tell me :) > > There are good and bad in both styles (distribute on top v/s replicate on top). Replicate on top gives you much better flexibility of configuration. Distribute on top is easier for us developers. As a user I would like replicate on top as well. But the problem today is that replicate (and self-heal) does not understand "partial failure" of its subvolumes. If one of the subvolume of replicate is a distribute, then today's replicate only understands complete failure of the distribute set or it assumes everything is completely fine. An example is self-healing of directory entries. If a file is "missing" in one subvolume because a distribute node is temporarily down, replicate has no clue why it is missing (or that it should keep away from attempting to self-heal). Along the same lines, it does not know that once a server is taken off from its distribute subvolume for good that it needs to start recreating missing files. The effort to fix this seems to be big enough to disturb the inertia of status quo. If this is fixed, we can definitely adopt a replicate-on-top mode in glusterd. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From abperiasamy at gmail.com Wed May 9 01:05:37 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Tue, 8 May 2012 18:05:37 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205082308.q48N8WQg008425@singularity.tronunltd.com> References: <201205082308.q48N8WQg008425@singularity.tronunltd.com> Message-ID: On Tue, May 8, 2012 at 4:08 PM, Ian Latter wrote: >> On 05/08/2012 12:27 AM, Ian Latter wrote: >> > The equivalent configuration in a glusterd world (from >> > my experiments) pushed all of the distribute knowledge >> > out to the client and I haven't had a response as to how >> > to add a replicate on distributed volumes in this model, >> > so I've lost replicate. >> >> This doesn't seem to be a problem with replicate-first vs. > distribute-first, >> but with client-side vs. server-side deployment of those > translators. ?You >> *can* construct your own volfiles that do these things on > the servers. ?It will >> work, but you won't get a lot of support for it. ?The > issue here is that we >> have only a finite number of developers, and a > near-infinite number of >> configurations. ?We can't properly qualify everything. > One way we've tried to >> limit that space is by preferring distribute over > replicate, because replicate >> does a better job of shielding distribute from brick > failures than vice versa. >> Another is to deploy both on the clients, following the > scalability rule of >> pushing effort to the most numerous components. ?The code > can support other >> arrangements, but the people might not. > > Sure, I have my own vol files that do (did) what I wanted > and I was supporting myself (and users); the question > (and the point) is what is the GlusterFS *intent*? ?I'll > write an rsyncd wrapper myself, to run on top of Gluster, > if the intent is not allow the configuration I'm after > (arbitrary number of disks in one multi-host environment > replicated to an arbitrary number of disks in another > multi-host environment, where ideally each environment > need not sum to the same data capacity, presented in a > single contiguous consumable storage layer to an > arbitrary number of unintelligent clients, that is as fault > tolerant as I choose it to be including the ability to add > and offline/online and remove storage as I so choose) .. > or switch out the whole solution if Gluster is heading > away from my ?needs. ?I just need to know what the > direction is .. I may even be able to help get you there if > you tell me :) Rsync'ing the vol spec files is the simplest and elegant approach. It is how glusterfs originally handled config files. How ever elastic volume management (online volume management operations) requires synchronized online changes to volume spec files. This requires GlusterFS to manage volume specification files internally. That is why we brought glusterd in 3.1. Real question is: do we want to keep the volume spec files on all nodes (fully distributed) or few selected nodes. > >> BTW, a similar concern exists with respect to replication > (i.e. AFR) across >> data centers. ?Performance is going to be bad, and there's > not going to be much >> we can do about it. > > Hmm .. that depends .. these sorts of statements need > context/qualification (in bandwidth and latency terms). ?For > example the last multi-site environment that I did > architecture for was two DCs set 32kms apart with a > redundant 20Gbps layer-2 (ethernet) stretch between > them - latency was 1ms average, 2ms max (the fiber > actually took a 70km path). ?Didn't run Gluster on it, but > we did stretch a number things that "couldn't" be stretched. > > >> > But in this world, the client must >> > know about everything and the server is simply a set >> > of served/presented disks (as volumes). ?In this >> > glusterd world, then, why does any server need to >> > know of any other server, if the clients are doing all of >> > the heavy lifting? >> >> First, because config changes have to apply across > servers. ?Second, because >> server machines often spin up client processes for things > like repair or >> rebalance. > > Yep, but my reading is that the config's that the servers > need are local - to make a disk a share (volume), and > that as you've described the rest are "client processes" > (even when on something built as a "server"), so if you > catered for all clients then you'd be set? ?I.e. AFR now > runs in the client? > > > And I am sick of the word-wrap on this client .. I think > you've finally convinced me to fix it ... what's normal > these days - still 80 chars? I used to line-wrap (gnus and cool emacs extensions). It doesn't make sense to line wrap any more. Let the email client handle it depending on the screen size of the device (mobile / tablet / desktop). -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From anand.avati at gmail.com Wed May 9 01:33:50 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 8 May 2012 18:33:50 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: References: <4FA7DFA3.1030300@redhat.com> Message-ID: On Mon, May 7, 2012 at 9:33 PM, Anand Babu Periasamy wrote: > On Mon, May 7, 2012 at 7:43 AM, Jeff Darcy wrote: > > I've long felt that our ways of dealing with cluster membership and > staging of > > config changes is not quite as robust and scalable as we might want. > > Accordingly, I spent a bit of time a couple of weeks ago looking into the > > possibility of using ZooKeeper to do some of this stuff. Yeah, it > brings in a > > heavy Java dependency, but when I looked at some lighter-weight > alternatives > > they all seemed to be lacking in more important ways. Basically the > idea was > > to do this: > > > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper > servers, or > > point everyone at an existing ZooKeeper cluster. > > > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer > probe" > > merely updates ZK, and "peer status" merely reads from it). > > > > * Store config information in ZK *once* instead of regenerating volfiles > etc. > > on every node (and dealing with the ugly cases where a node was down > when the > > config change happened). > > > > * Set watches on ZK nodes to be notified when config changes happen, and > > respond appropriately. > > > > I eventually ran out of time and moved on to other things, but this or > > something like it (e.g. using Riak Core) still seems like a better > approach > > than what we have. In that context, it looks like ZkFarmer[1] might be > a big > > help. AFAICT someone else was trying to solve almost exactly the same > kind of > > server/config problem that we have, and wrapped their solution into a > library. > > Is this a direction other devs might be interested in pursuing some day, > > if/when time allows? > > > > > > [1] https://github.com/rs/zkfarmer > > Real issue is here is: GlusterFS is a fully distributed system. It is > OK for config files to be in one place (centralized). It is easier to > manage and backup. Avati still claims that making distributed copies > are not a problem (volume operations are fast, versioned and > checksumed). Also the code base for replicating 3 way or all-node is > same. We all need to come to agreement on the demerits of replicating > the volume spec on every node. > My claim is somewhat similar to what you said literally, but slightly different in meaning. What I mean is, while it is true keeping multiple copies of the volfile is more expensive/resource consuming in theory, what is the breaking point in terms of number of servers where it begins to matter? There are trivial (low lying) enhancements which are possible (for e.g, store volfiles of a volume only on participating servers instead of all servers) which could address a class of concerns. There are clear advantages in having volfiles in all the participating nodes at least - it takes away dependency on order of booting of servers in your data centre. If volfiles are available locally you dont have to wait/retry for the "central servers" to come up first. Whether this is volfiles managed by glusterd, or "storage servers" of ZK, it is a big advantage to have the startup of a given server decoupled from the others (of course the coupling comes in at an operational level at the time of volume modifications, but that is much more acceptable). If the storage of volfiles on all servers really seems unnecessary, we should first come up with real hard numbers - number of servers v/s latency of volume operations and then figure out at what point it starts becoming unacceptably slow. Maybe a good solution is to just propagate the volfiles in the background while still retaining version info than introducing a more intrusive change? But we really need the numbers first. > > If we are convinced to keep the config info in one place, ZK is > certainly one a good idea. I personally hate Java dependency. I still > struggle with Java dependencies for browser and clojure. I can digest > that if we are going to adopt Java over Python for future external > modules. Alternatively we can also look at creating a replicated meta > system volume. What ever we adopt, we should keep dependencies and > installation steps to the bare minimum and simple. > > It is true other projects have figured out the problem of membership and configuration management and specialize at doing that. That is very good for the entire computing community as a whole. If there are components we can incorporate and build upon their work, that is very desirable. At the same time we also need to check what other baggage we inherit along with the specialized expertise we take on. One of the biggest strengths of Gluster has been its "lightweight"edness and lack of dependencies - which in turn has driven our adoption significantly which in turn results in higher feedback and bug reports etc. (i.e, it is not an isolated strength in itself). Enforcing a Java dependency down the throat of users who want a simple distributed filesystem (yes, the moment we stop thinking of gluster as a "simple" distributed filesystem - even though it may be an oxymoron technically, but I guess you know what I mean :) it's a slippery slope towards it becoming "yet another" distributed filesystem.) The simplicity is what "makes" gluster to a large extent what it is. This makes the developer's life miserable to a fair degree, but it anyways always is, one way or another ;) I am not against adopting external projects. There are good reasons many times to do so. If there are external projects which are "compatible in personality" with gluster and helps us avoid reinventing the wheel, we must definitely do so. If they are not compatible, I'm sure there are lessons and ideas we can adopt, if not code. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Wed May 9 04:18:46 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 9 May 2012 04:18:46 +0000 Subject: [Gluster-devel] ZkFarmer In-Reply-To: References: <4FA7DFA3.1030300@redhat.com> Message-ID: <20120509041846.GB18684@homeworld.netbsd.org> On Mon, May 07, 2012 at 09:33:50PM -0700, Anand Babu Periasamy wrote: > I personally hate Java dependency. Me too. I know Java programs are supposed to have decent performances, but my experiences had always been terrible. Please do not add a dependency on Java. -- Emmanuel Dreyfus manu at netbsd.org From manu at netbsd.org Wed May 9 04:41:47 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 9 May 2012 04:41:47 +0000 Subject: [Gluster-devel] glusterfs-3.3.0qa40 released In-Reply-To: <20120507081553.5AA00100C5@build.gluster.com> References: <20120507081553.5AA00100C5@build.gluster.com> Message-ID: <20120509044147.GC18684@homeworld.netbsd.org> On Mon, May 07, 2012 at 01:15:50AM -0700, Vijay Bellur wrote: > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa40.tar.gz Hi There is a small issue with python: the machine that runs autoconf only has python 2.5 installed, and as a result, the generated configure script fails to detect an installed python 2.6 or higher. Here is an example at mine, where python 2.7 is installed: checking for a Python interpreter with version >= 2.4... none configure: error: no suitable Python interpreter found That can be fixed by patching configure, but it would be nice if gluster builds could contain the check with latest python. -- Emmanuel Dreyfus manu at netbsd.org From renqiang at 360buy.com Wed May 9 04:46:08 2012 From: renqiang at 360buy.com (=?gb2312?B?yM7Hvw==?=) Date: Wed, 9 May 2012 12:46:08 +0800 Subject: [Gluster-devel] How to repair a 1TB disk in 30 mins Message-ID: <000301cd2d9e$a6b07fc0$f4117f40$@com> Dear All: I have a question. When I have a large cluster, maybe more than 10PB data, if a file have 3 copies and each disk have 1TB capacity, So we need about 30,000 disks. All disks are very cheap and are easily damaged. We must repair a 1TB disk in 30 mins?As far as I know?in gluster architecture?all data in the damaged disk will be repaired to the new disk which is used to replace the damaged disk. As a result of the writing speed of disk, when we repair 1TB disk in gluster, we need more than 5 hours. Can we do it in 30 mins? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Wed May 9 05:35:40 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Wed, 09 May 2012 15:35:40 +1000 Subject: [Gluster-devel] automake Message-ID: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Hello, I have built a new module and I can't seem to get the changed makefiles to be built. I have not used "configure" in any of my projects and I'm not seeing an answer from my google searches. The error that I get is during the "make" where glusterfs-3.2.6/missing errors at line 52 "automake-1.9: command not found". This is a newer RedHat environment and it has automake 1.11 .. if I cp 1.11 to 1.9 I get other errors ... libtool is reporting that the automake version is 1.11.1. I believe that it is getting the 1.9 version from Gluster ... How do I get a new Makefile.am and Makefile.in to work in this structure? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From harsha at gluster.com Wed May 9 06:03:00 2012 From: harsha at gluster.com (Harshavardhana) Date: Tue, 8 May 2012 23:03:00 -0700 Subject: [Gluster-devel] automake In-Reply-To: <201205090535.q495Ze5E009996@singularity.tronunltd.com> References: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Message-ID: Ian, Please re-run the ./autogen.sh and use again. Make sure you have added entries in 'configure.ac' and 'Makefile.am' for the respective module name and directory. -Harsha On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > Hello, > > > ?I have built a new module and I can't seem to > get the changed makefiles to be built. ?I have not > used "configure" in any of my projects and I'm > not seeing an answer from my google searches. > > ?The error that I get is during the "make" where > glusterfs-3.2.6/missing errors at line 52 > "automake-1.9: command not found". > > ?This is a newer RedHat environment and it has > automake 1.11 .. if I cp 1.11 to 1.9 I get other > errors ... libtool is reporting that the automake > version is 1.11.1. ?I believe that it is getting the > 1.9 version from Gluster ... > > ?How do I get a new Makefile.am and Makefile.in > to work in this structure? > > > > Cheers, > > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Wed May 9 06:05:54 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Wed, 09 May 2012 16:05:54 +1000 Subject: [Gluster-devel] automake Message-ID: <201205090605.q4965sPn010223@singularity.tronunltd.com> You're a champion. Thanks Harsha. ----- Original Message ----- >From: "Harshavardhana" >To: "Ian Latter" >Subject: Re: [Gluster-devel] automake >Date: Tue, 08 May 2012 23:03:00 -0700 > > Ian, > > Please re-run the ./autogen.sh and use again. > > Make sure you have added entries in 'configure.ac' and 'Makefile.am' > for the respective module name and directory. > > -Harsha > > On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > > Hello, > > > > > > ?I have built a new module and I can't seem to > > get the changed makefiles to be built. ?I have not > > used "configure" in any of my projects and I'm > > not seeing an answer from my google searches. > > > > ?The error that I get is during the "make" where > > glusterfs-3.2.6/missing errors at line 52 > > "automake-1.9: command not found". > > > > ?This is a newer RedHat environment and it has > > automake 1.11 .. if I cp 1.11 to 1.9 I get other > > errors ... libtool is reporting that the automake > > version is 1.11.1. ?I believe that it is getting the > > 1.9 version from Gluster ... > > > > ?How do I get a new Makefile.am and Makefile.in > > to work in this structure? > > > > > > > > Cheers, > > > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Wed May 9 06:08:41 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 8 May 2012 23:08:41 -0700 Subject: [Gluster-devel] automake In-Reply-To: <201205090535.q495Ze5E009996@singularity.tronunltd.com> References: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Message-ID: You might want to read autobook for the general theory behind autotools. Here's a quick summary - aclocal prepares the running of autotools. autoheader prepares autotools to generate a config.h to be consumed by C code configure.ac is the "source" to discover the build system and accept user parameters autoconf converts configure.ac to configure Makefile.am is the "source" to define what is to be built and how. automake converts Makefile.am to Makefile.in till here everything is scripted in ./autogen.sh running configure creates Makefile out of Makefile.in now run make :) Avati On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > Hello, > > > I have built a new module and I can't seem to > get the changed makefiles to be built. I have not > used "configure" in any of my projects and I'm > not seeing an answer from my google searches. > > The error that I get is during the "make" where > glusterfs-3.2.6/missing errors at line 52 > "automake-1.9: command not found". > > This is a newer RedHat environment and it has > automake 1.11 .. if I cp 1.11 to 1.9 I get other > errors ... libtool is reporting that the automake > version is 1.11.1. I believe that it is getting the > 1.9 version from Gluster ... > > How do I get a new Makefile.am and Makefile.in > to work in this structure? > > > > Cheers, > > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abperiasamy at gmail.com Wed May 9 07:21:35 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Wed, 9 May 2012 00:21:35 -0700 Subject: [Gluster-devel] automake In-Reply-To: References: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Message-ID: On Tue, May 8, 2012 at 11:08 PM, Anand Avati wrote: > You might want to read autobook for the general theory behind autotools. > Here's a quick summary - > > aclocal prepares the running of autotools. > autoheader prepares autotools to generate a config.h to be consumed by C > code > configure.ac is the "source" to discover the build system and accept user > parameters > autoconf converts configure.ac to configure > Makefile.am is the "source" to define what is to be built and how. > automake converts Makefile.am to Makefile.in > > till here everything is scripted in ./autogen.sh > > running configure creates Makefile out of Makefile.in > > now run make :) > Best way to learn autotools is copy-paste-customize. In general, if you are starting a new project, Debian has a nice little tool called "autoproject". It will auto generate autoconf and automake files. Then you start customizing it. GNU project should really merge all these tools in to one simple coherent system. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From abperiasamy at gmail.com Wed May 9 07:54:43 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Wed, 9 May 2012 00:54:43 -0700 Subject: [Gluster-devel] How to repair a 1TB disk in 30 mins In-Reply-To: <000301cd2d9e$a6b07fc0$f4117f40$@com> References: <000301cd2d9e$a6b07fc0$f4117f40$@com> Message-ID: On Tue, May 8, 2012 at 9:46 PM, ?? wrote: > Dear All: > > ? I have a question. When I have a large cluster, maybe more than 10PB data, > if a file have 3 copies and each disk have 1TB capacity, So we need about > 30,000 disks. All disks are very cheap and are easily damaged. We must > ?repair a 1TB disk in 30 mins?As far as I know?in gluster architecture?all > data in the damaged disk will be repaired to the new disk which is used to > replace the damaged disk. As a result of the writing speed of disk, when we > repair 1TB disk in gluster, we need more than 5 hours. Can we do it in 30 > mins? 5 hours is based on SATA 1TB disk copying at ~50MB/s across small and large files + folders. This means, you literally attached the disk to the system and manually transferring the data. I can't think of any other faster way to transfer data on 1TB 7200RPM SATA/SAS disks without bending space-time ;). Larger disks and RAID arrays only makes this worse. This is exactly why we implemented passive self-heal in the first place. GlusterFS heals files on demand (as they are accessed), so applications have least down time or disruption. There is plenty of time to heal the cold data in background. All we should care is minimal down time. Self-heal in 3.3 has some major improvements. It got significantly faster, because healing is performed on the server side entirely (server to server). It can perform granular healing on large files (previously checksum operations used to pause or timeout the VMs). Active-healing (Replicate now remembers pending files and heals them when the failed node comes back. Previously you have to perform name-space wide recursive directory listing). Most importantly self-healing is no longer a blackbox. heal-info can show pending and currently-healing files. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From renqiang at 360buy.com Wed May 9 09:29:34 2012 From: renqiang at 360buy.com (=?utf-8?B?5Lu75by6?=) Date: Wed, 9 May 2012 17:29:34 +0800 Subject: [Gluster-devel] =?utf-8?b?562U5aSNOiAgSG93IHRvIHJlcGFpciBhIDFU?= =?utf-8?q?B_disk_in_30_mins?= In-Reply-To: References: <000301cd2d9e$a6b07fc0$f4117f40$@com> Message-ID: <002601cd2dc6$3f68f4f0$be3aded0$@com> Thank you very much? And I have some questions? 1?What's the capacity of the largest cluster online ?And how many nodes in it? And What is it used for? 2?When we excute 'ls' in a directory,it's very slow,if the cluster has too many bricks and too many nodes.Can we do it well? -----????----- ???: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] ????: 2012?5?9? 15:55 ???: renqiang ??: gluster-devel at nongnu.org ??: Re: [Gluster-devel] How to repair a 1TB disk in 30 mins On Tue, May 8, 2012 at 9:46 PM, ?? wrote: > Dear All: > > I have a question. When I have a large cluster, maybe more than 10PB data, > if a file have 3 copies and each disk have 1TB capacity, So we need about > 30,000 disks. All disks are very cheap and are easily damaged. We must > repair a 1TB disk in 30 mins?As far as I know?in gluster architecture?all > data in the damaged disk will be repaired to the new disk which is used to > replace the damaged disk. As a result of the writing speed of disk, when we > repair 1TB disk in gluster, we need more than 5 hours. Can we do it in 30 > mins? 5 hours is based on SATA 1TB disk copying at ~50MB/s across small and large files + folders. This means, you literally attached the disk to the system and manually transferring the data. I can't think of any other faster way to transfer data on 1TB 7200RPM SATA/SAS disks without bending space-time ;). Larger disks and RAID arrays only makes this worse. This is exactly why we implemented passive self-heal in the first place. GlusterFS heals files on demand (as they are accessed), so applications have least down time or disruption. There is plenty of time to heal the cold data in background. All we should care is minimal down time. Self-heal in 3.3 has some major improvements. It got significantly faster, because healing is performed on the server side entirely (server to server). It can perform granular healing on large files (previously checksum operations used to pause or timeout the VMs). Active-healing (Replicate now remembers pending files and heals them when the failed node comes back. Previously you have to perform name-space wide recursive directory listing). Most importantly self-healing is no longer a blackbox. heal-info can show pending and currently-healing files. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From ian.latter at midnightcode.org Thu May 10 05:47:06 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Thu, 10 May 2012 15:47:06 +1000 Subject: [Gluster-devel] Hide Feature Message-ID: <201205100547.q4A5l6eH015066@singularity.tronunltd.com> Hello, I have published an untested "hide" module (compiled against glusterfs-3.2.6); A simple method for hiding an underlying directory structure from parent/up-stream bricks within GlusterFS. In 2012 this code was spawned from my incomplete 2009 dedupe brick code which used this method to protect its internal hash database from the user, above. http://midnightcode.org/projects/saturn/code/hide-0.5.tgz I am serious when I mean untested - I've not even loaded the module under Gluster, it simply compiles. Let me know if there are tweaks that should be made or considered. Enjoy. -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 05:55:55 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Thu, 10 May 2012 15:55:55 +1000 Subject: [Gluster-devel] Fuse operations Message-ID: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> Hello, I published the Hide module in order to open a discussion around Fuse operations; http://fuse.sourceforge.net/doxygen/structfuse__operations.html In the dedupe module I want to secure the hash database from direct parent/use manipulation. The approach that I took was to find every GlusterFS file operation (fop) that took a loc_t parameter (as discovered via every xlator that is included in the tarball), in order to do path matching and then pass-through the call or return an error. The problem is that I can't find GlusterFS examples for all of the Fuse operators and, when I stray from the examples (like getattr and utiments), gluster tells me that there are no such xlator fops (at compile time - from the wind and unwind macros). So, I guess; 1) Are all Fuse/FS ops handled by Gluster? 2) Where can I find a complete list of the Gluster fops, and not just those that have been used in existing modules? 3) Is it safe to path match on loc_t? (i.e. is it fully resolved such that I won't find /etc/././././passwd)? This I could test .. Thanks, -- Ian Latter Late night coder .. http://midnightcode.org/ From jdarcy at redhat.com Thu May 10 13:39:21 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 10 May 2012 09:39:21 -0400 Subject: [Gluster-devel] Hide Feature In-Reply-To: <201205100547.q4A5l6eH015066@singularity.tronunltd.com> References: <201205100547.q4A5l6eH015066@singularity.tronunltd.com> Message-ID: <20120510093921.4a9f581a@jdarcy-dt.usersys.redhat.com> On Thu, 10 May 2012 15:47:06 +1000 "Ian Latter" wrote: > I have published an untested "hide" module (compiled > against glusterfs-3.2.6); > > A simple method for hiding an underlying directory > structure from parent/up-stream bricks within > GlusterFS. In 2012 this code was spawned from > my incomplete 2009 dedupe brick code which used > this method to protect its internal hash database > from the user, above. > > http://midnightcode.org/projects/saturn/code/hide-0.5.tgz > > > I am serious when I mean untested - I've not even > loaded the module under Gluster, it simply compiles. > > > Let me know if there are tweaks that should be made > or considered. A couple of comments: * It should be sufficient to fail lookup for paths that match your pattern. If that fails, the caller will never get to any others. You can use the quota translator as an example for something like this. * If you want to continue supporting this yourself, then you can just leave the code as it is, though in that case you'll want to consider building it "out of tree" as I describe in my "Translator 101" post[1] or do for some of my own translators[2]. Otherwise you'll need to submit it as a patch through Gerrit according to our standard workflow[3]. You'll also need to fix some of the idiosyncratic indentation. I don't remember the current policy wrt copyright assignment, but that might be required too. [1] http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/ [2] https://github.com/jdarcy/negative-lookup [3] http://www.gluster.org/community/documentation/index.php/Development_Work_Flow From jdarcy at redhat.com Thu May 10 13:58:51 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 10 May 2012 09:58:51 -0400 Subject: [Gluster-devel] Fuse operations In-Reply-To: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> References: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> Message-ID: <20120510095851.2f034889@jdarcy-dt.usersys.redhat.com> On Thu, 10 May 2012 15:55:55 +1000 "Ian Latter" wrote: > So, I guess; > 1) Are all Fuse/FS ops handled by Gluster? > 2) Where can I find a complete list of the > Gluster fops, and not just those that have > been used in existing modules? GlusterFS operations for a translator are all defined in an xlator_fops structure. When building translators, it can also be convenient to look at the default_xxx and default_xxx_cbk functions for each fop you implement. Also, I forgot to mention in my comments on your "hide" translator that you can often use the default_xxx_cbk callback when you call STACK_WIND, instead of having to define your own trivial one. FUSE operations are listed by the fuse_opcode enum. You can check for yourself how closely this matches our list. They do have a few ops of their own, we have a few of their own, and a few of theirs actually map to our xlator_cbks instead of xlator_fops. The points of non-correspondence seem to be interrupt, bmap, poll and ioctl. Maybe Csaba can elaborate on what we do (or plan to do) about these. > 3) Is it safe to path match on loc_t? (i.e. is > it fully resolved such that I won't find > /etc/././././passwd)? This I could test .. Name/path resolution is an area that has changed pretty recently, so I'll let Avati or Amar field that one. From anand.avati at gmail.com Thu May 10 19:36:26 2012 From: anand.avati at gmail.com (Anand Avati) Date: Thu, 10 May 2012 12:36:26 -0700 Subject: [Gluster-devel] Fuse operations In-Reply-To: <20120510095851.2f034889@jdarcy-dt.usersys.redhat.com> References: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> <20120510095851.2f034889@jdarcy-dt.usersys.redhat.com> Message-ID: On Thu, May 10, 2012 at 6:58 AM, Jeff Darcy wrote: > On Thu, 10 May 2012 15:55:55 +1000 > "Ian Latter" wrote: > > > So, I guess; > > 1) Are all Fuse/FS ops handled by Gluster? > > 2) Where can I find a complete list of the > > Gluster fops, and not just those that have > > been used in existing modules? > > GlusterFS operations for a translator are all defined in an xlator_fops > structure. When building translators, it can also be convenient to > look at the default_xxx and default_xxx_cbk functions for each fop you > implement. Also, I forgot to mention in my comments on your "hide" > translator that you can often use the default_xxx_cbk callback when you > call STACK_WIND, instead of having to define your own trivial one. > > FUSE operations are listed by the fuse_opcode enum. You can check for > yourself how closely this matches our list. They do have a few ops of > their own, we have a few of their own, and a few of theirs actually map > to our xlator_cbks instead of xlator_fops. The points of > non-correspondence seem to be interrupt, bmap, poll and ioctl. Maybe > Csaba can elaborate on what we do (or plan to do) about these. > > We might support interrupt sometime. Bmap - probably never. Poll, maybe. Ioctl - depeneds on what type of ioctl and requirement. > > 3) Is it safe to path match on loc_t? (i.e. is > > it fully resolved such that I won't find > > /etc/././././passwd)? This I could test .. > > Name/path resolution is an area that has changed pretty recently, so > I'll let Avati or Amar field that one. > The ".." interpretation is done by the client side VFS. Internal path construction does not use ".." and are always normalized. There are new situations where we now support non-absolute paths, but those are for GFID based addressing and ".." does not come into picture there. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnmark at redhat.com Thu May 10 21:41:08 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 10 May 2012 17:41:08 -0400 (EDT) Subject: [Gluster-devel] Bugzilla upgrade & planned outage - May 22 In-Reply-To: Message-ID: Pasting an email from bugzilla-announce: Red Hat Bugzilla (bugzilla.redhat.com) will be unavailable on May 22nd starting at 6 p.m. EDT [2200 UTC] to perform an upgrade from Bugzilla 3.6 to Bugzilla 4.2. We are hoping to be complete in no more than 3 hours barring any problems. Any services relying on bugzilla.redhat.com may not work properly during this time. Please be aware in case you need use of those services during the outage. Also *PLEASE* make sure any scripts or other external applications that rely on bugzilla.redhat.com are tested against our test server before the upgrade if you have not done so already. Let the Bugzilla Team know immediately of any issues found by reporting the bug in bugzilla.redhat.com against the Bugzilla product, version 4.2. A summary of the RPC changes is also included below. RPC changes from upstream Bugzilla 4.2: - Bug.* returns arrays for components, versions and aliases - Bug.* returns target_release array - Bug.* returns flag information (from Bugzilla 4.4) - Bug.search supports searching on keywords, dependancies, blocks - Bug.search supports quick searches, saved searches and advanced searches - Group.get has been added - Component.* and Flag.* have been added - Product.get has a component_names option to return just the component names. RPC changes from Red Hat Bugzilla 3.6: - This list may be incomplete. - This list excludes upstream changes from 3.6 that we inherited - Bug.update calls may use different column names. For example, in 3.6 you updated the 'short_desc' key if you wanted to change the summary. Now you must use the 'summary' key. This may be an inconeniance, but will make it much more maintainable in the long run. - Bug.search_new new becomes Bug.search. The 3.6 version of Bug.search is no longer available. - Product.* has been changed to match upstream code - Group.create has been added - RedHat.* and bugzilla.* calls that mirror official RPC calls are officially depreciated, and will be removed approximately two months after Red Hat Bugzilla 4.2 is released. To test against the new beta Bugzilla server, go to https://partner-bugzilla.redhat.com/ Thanks, JM From ian.latter at midnightcode.org Thu May 10 22:25:02 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 08:25:02 +1000 Subject: [Gluster-devel] automake Message-ID: <201205102225.q4AMP2X2018428@singularity.tronunltd.com> Thanks Avati, Yes, when I said that I hadn't use "configure" I meant "autotools" (though I didn't know it :) I think almost every project I download and build from scratch uses configure .. the last time I looked at the autotools was a few years ago now, maybe its time for a re-look .. my libraries are getting big enough to warrant it I suppose. Hadn't seen autogen before .. thanks for your help. Cheers, ----- Original Message ----- >From: "Anand Avati" >To: "Ian Latter" >Subject: Re: [Gluster-devel] automake >Date: Tue, 08 May 2012 23:08:41 -0700 > > You might want to read autobook for the general theory behind autotools. > Here's a quick summary - > > aclocal prepares the running of autotools. > autoheader prepares autotools to generate a config.h to be consumed by C > code > configure.ac is the "source" to discover the build system and accept user > parameters > autoconf converts configure.ac to configure > Makefile.am is the "source" to define what is to be built and how. > automake converts Makefile.am to Makefile.in > > till here everything is scripted in ./autogen.sh > > running configure creates Makefile out of Makefile.in > > now run make :) > > Avati > > On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > > > Hello, > > > > > > I have built a new module and I can't seem to > > get the changed makefiles to be built. I have not > > used "configure" in any of my projects and I'm > > not seeing an answer from my google searches. > > > > The error that I get is during the "make" where > > glusterfs-3.2.6/missing errors at line 52 > > "automake-1.9: command not found". > > > > This is a newer RedHat environment and it has > > automake 1.11 .. if I cp 1.11 to 1.9 I get other > > errors ... libtool is reporting that the automake > > version is 1.11.1. I believe that it is getting the > > 1.9 version from Gluster ... > > > > How do I get a new Makefile.am and Makefile.in > > to work in this structure? > > > > > > > > Cheers, > > > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 22:26:22 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 08:26:22 +1000 Subject: [Gluster-devel] automake Message-ID: <201205102226.q4AMQMEC018461@singularity.tronunltd.com> > > You might want to read autobook for the general theory behind autotools. > > Here's a quick summary - > > > > aclocal prepares the running of autotools. > > autoheader prepares autotools to generate a config.h to be consumed by C > > code > > configure.ac is the "source" to discover the build system and accept user > > parameters > > autoconf converts configure.ac to configure > > Makefile.am is the "source" to define what is to be built and how. > > automake converts Makefile.am to Makefile.in > > > > till here everything is scripted in ./autogen.sh > > > > running configure creates Makefile out of Makefile.in > > > > now run make :) > > > > Best way to learn autotools is copy-paste-customize. In general, if > you are starting a new project, Debian has a nice little tool called > "autoproject". It will auto generate autoconf and automake files. Then > you start customizing it. > > GNU project should really merge all these tools in to one simple > coherent system. My build environment is Fedora but I'm assuming its there too .. if I get some time I'll have a poke around .. Thanks for the info, appreciate it. -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 22:44:32 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 08:44:32 +1000 Subject: [Gluster-devel] Hide Feature Message-ID: <201205102244.q4AMiW2Z018543@singularity.tronunltd.com> Sorry for the re-send Jeff, I managed to screw up the CC so the list didn't get it; > > Let me know if there are tweaks that should be made > > or considered. > > A couple of comments: > > * It should be sufficient to fail lookup for paths that > match your pattern. If that fails, the caller will > never get to any others. You can use the quota > translator as an example for something like this. Ok, this is interesting. So if someone calls another fop .. say "open" ... against my brick/module, something (Fuse?) will make another, dependent, call to lookup first? If that's true then I can cut this all down to size. > * If you want to continue supporting this yourself, > then you can just leave the code as it is, though in > that case you'll want to consider building it "out of > tree" as I describe in my "Translator 101" post[1] > or do for some of my own translators[2]. > Otherwise you'll need to submit it as a patch > through Gerrit according to our standard workflow[3]. Thanks for the Translator articles/posts, I hadn't seen those. Per my previous patches, I'll publish code on my site under the GPL and you guys (Gluster/RedHat) can run them through whatever processes you choose. If it gets included in the GlusterFS package, then that's fine. If it gets ignored by the GlusterFS package, then that's fine also. > You'll also need to fix some of the idiosyncratic > indentation. I don't remember the current policy wrt > copyright assignment, but that might be required too. The weird indentation style used is not mine .. its what I gathered from the Gluster code that I read through. > [1] > http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/ > > [2] https://github.com/jdarcy/negative-lookup > > [3] > http://www.gluster.org/community/documentation/index.php/Development_Work_Flow > -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 23:39:58 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 09:39:58 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205102339.q4ANdwg8018739@singularity.tronunltd.com> > > Sure, I have my own vol files that do (did) what I wanted > > and I was supporting myself (and users); the question > > (and the point) is what is the GlusterFS *intent*? > > > The "intent" (more or less - I hate to use the word as it can imply a > commitment to what I am about to say, but there isn't one) is to keep the > bricks (server process) dumb and have the intelligence on the client side. > This is a "rough goal". There are cases where replication on the server > side is inevitable (in the case of NFS access) but we keep the software > architecture undisturbed by running a client process on the server machine > to achieve it. [There's a difference between intent and plan/roadmap] Okay. Unfortunately I am unable to leverage this - I tried to serve a Fuse->GlusterFS client mount point (of a Distribute volume) as a GlusterFS posix brick (for a Replicate volume) and it wouldn't play ball .. > We do plan to support "replication on the server" in the future while still > retaining the existing software architecture as much as possible. This is > particularly useful in Hadoop environment where the jobs expect write > performance of a single copy and expect copy to happen in the background. > We have the proactive self-heal daemon running on the server machines now > (which again is a client process which happens to be physically placed on > the server) which gives us many interesting possibilities - i.e, with > simple changes where we fool the client side replicate translator at the > time of transaction initiation that only the closest server is up at that > point of time and write to it alone, and have the proactive self-heal > daemon perform the extra copies in the background. This would be consistent > with other readers as they get directed to the "right" version of the file > by inspecting the changelogs while the background replication is in > progress. > > The intention of the above example is to give a general sense of how we > want to evolve the architecture (i.e, the "intention" you were referring > to) - keep the clients intelligent and servers dumb. If some intelligence > needs to be built on the physical server, tackle it by loading a client > process there (there are also "pathinfo xattr" kind of internal techniques > to figure out locality of the clients in a generic way without bringing > "server sidedness" into them in a harsh way) Okay .. But what happened to the "brick" architecture of stacking anything on anything? I think you point that out here ... > I'll > > write an rsyncd wrapper myself, to run on top of Gluster, > > if the intent is not allow the configuration I'm after > > (arbitrary number of disks in one multi-host environment > > replicated to an arbitrary number of disks in another > > multi-host environment, where ideally each environment > > need not sum to the same data capacity, presented in a > > single contiguous consumable storage layer to an > > arbitrary number of unintelligent clients, that is as fault > > tolerant as I choose it to be including the ability to add > > and offline/online and remove storage as I so choose) .. > > or switch out the whole solution if Gluster is heading > > away from my needs. I just need to know what the > > direction is .. I may even be able to help get you there if > > you tell me :) > > > > > There are good and bad in both styles (distribute on top v/s replicate on > top). Replicate on top gives you much better flexibility of configuration. > Distribute on top is easier for us developers. As a user I would like > replicate on top as well. But the problem today is that replicate (and > self-heal) does not understand "partial failure" of its subvolumes. If one > of the subvolume of replicate is a distribute, then today's replicate only > understands complete failure of the distribute set or it assumes everything > is completely fine. An example is self-healing of directory entries. If a > file is "missing" in one subvolume because a distribute node is temporarily > down, replicate has no clue why it is missing (or that it should keep away > from attempting to self-heal). Along the same lines, it does not know that > once a server is taken off from its distribute subvolume for good that it > needs to start recreating missing files. Hmm. I loved the brick idea. I don't like perverting it by trying to "see through" layers. In that context I can see two or three expected outcomes from someone building this type of stack (heh: a quick trick brick stack) - when a distribute child disappears; At the Distribute layer; 1) The distribute name space / stat space remains in tact, though the content is obviously not avail. 2) The distribute presentation is pure and true of its constituents, showing only the names / stats that are online/avail. In its standalone case, 2 is probably preferable as it allows clean add/start/stop/ remove capacity. At the Replicate layer; 3) replication occurs only where the name / stat space shows a gap 4) the replication occurs at any delta I don't think there's a real choice here, even if 3 were sensible, what would replicate do if there was a local name and even just a remote file size change, when there's no local content to update; it must be 4. In which case, I would expect that a replicate on top of a distribute with a missing child would suddenly see a delta that it would immediately set about repairing. > The effort to fix this seems to be big enough to disturb the inertia of > status quo. If this is fixed, we can definitely adopt a replicate-on-top > mode in glusterd. I'm not sure why there needs to be a "fix" .. wasn't the previous behaviour sensible? Or, if there is something to "change", then bolstering the distribute module might be enough - a combination of 1 and 2 above. Try this out: what if the Distribute layer maintained a full name space on each child, and didn't allow "recreation"? Say 3 children, one is broken/offline, so that /path/to/child/3/file is missing but is known to be missing (internally to Distribute). Then the Distribute brick can both not show the name space to the parent layers, but can also actively prevent manipulation of those files (the parent can neither stat /path/to/child/3/file nor unlink, nor create/write to it). If this change is meant to be permanent, then the administrative act of removing the child from distribute will then truncate the locked name space, allowing parents (be they users or other bricks, like Replicate) to act as they please (such as recreating the missing files). If you adhere to the principles that I thought I understood from 2009 or so then you should be able to let the users create unforeseen Gluster architectures without fear or impact. I.e. i) each brick is fully self contained * ii) physical bricks are the bread of a brick stack sandwich ** iii) any logical brick can appear above/below any other logical brick in a brick stack * Not mandating a 1:1 file mapping from layer to layer ** Eg: the Posix (bottom), Client (bottom), Server (top) and NFS (top) are all regarded as physical bricks. Thus it was my expectation that a dedupe brick (being logical) could either go above or below a distribute brick (also logical), for example. Or that an encryption brick could go on top of replicate which was on top of encryption which was on top of distribute which was on top of encryption on top of posix, for example. Or .. am I over simplifying the problem space? -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 23:52:43 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 09:52:43 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205102352.q4ANqhc6018790@singularity.tronunltd.com> Actually, I want to clarify this point; > But the problem today is that replicate (and > self-heal) does not understand "partial failure" > of its subvolumes. If one of the subvolume of > replicate is a distribute, then today's replicate > only understands complete failure of the > distribute set or it assumes everything is > completely fine. I haven't seen this in practice .. I have seen replicate attempt to repair anything that was "missing" and that both the replicate and the underlying bricks were still viable storage layers in that process ... ----- Original Message ----- >From: "Ian Latter" >To: "Anand Avati" >Subject: Re: [Gluster-devel] ZkFarmer >Date: Fri, 11 May 2012 09:39:58 +1000 > > > > Sure, I have my own vol files that do (did) what I wanted > > > and I was supporting myself (and users); the question > > > (and the point) is what is the GlusterFS *intent*? > > > > > > The "intent" (more or less - I hate to use the word as it > can imply a > > commitment to what I am about to say, but there isn't one) > is to keep the > > bricks (server process) dumb and have the intelligence on > the client side. > > This is a "rough goal". There are cases where replication > on the server > > side is inevitable (in the case of NFS access) but we keep > the software > > architecture undisturbed by running a client process on > the server machine > > to achieve it. > > [There's a difference between intent and plan/roadmap] > > Okay. Unfortunately I am unable to leverage this - I tried > to serve a Fuse->GlusterFS client mount point (of a > Distribute volume) as a GlusterFS posix brick (for a > Replicate volume) and it wouldn't play ball .. > > > We do plan to support "replication on the server" in the > future while still > > retaining the existing software architecture as much as > possible. This is > > particularly useful in Hadoop environment where the jobs > expect write > > performance of a single copy and expect copy to happen in > the background. > > We have the proactive self-heal daemon running on the > server machines now > > (which again is a client process which happens to be > physically placed on > > the server) which gives us many interesting possibilities > - i.e, with > > simple changes where we fool the client side replicate > translator at the > > time of transaction initiation that only the closest > server is up at that > > point of time and write to it alone, and have the > proactive self-heal > > daemon perform the extra copies in the background. This > would be consistent > > with other readers as they get directed to the "right" > version of the file > > by inspecting the changelogs while the background > replication is in > > progress. > > > > The intention of the above example is to give a general > sense of how we > > want to evolve the architecture (i.e, the "intention" you > were referring > > to) - keep the clients intelligent and servers dumb. If > some intelligence > > needs to be built on the physical server, tackle it by > loading a client > > process there (there are also "pathinfo xattr" kind of > internal techniques > > to figure out locality of the clients in a generic way > without bringing > > "server sidedness" into them in a harsh way) > > Okay .. But what happened to the "brick" architecture > of stacking anything on anything? I think you point > that out here ... > > > > I'll > > > write an rsyncd wrapper myself, to run on top of Gluster, > > > if the intent is not allow the configuration I'm after > > > (arbitrary number of disks in one multi-host environment > > > replicated to an arbitrary number of disks in another > > > multi-host environment, where ideally each environment > > > need not sum to the same data capacity, presented in a > > > single contiguous consumable storage layer to an > > > arbitrary number of unintelligent clients, that is as fault > > > tolerant as I choose it to be including the ability to add > > > and offline/online and remove storage as I so choose) .. > > > or switch out the whole solution if Gluster is heading > > > away from my needs. I just need to know what the > > > direction is .. I may even be able to help get you there if > > > you tell me :) > > > > > > > > There are good and bad in both styles (distribute on top > v/s replicate on > > top). Replicate on top gives you much better flexibility > of configuration. > > Distribute on top is easier for us developers. As a user I > would like > > replicate on top as well. But the problem today is that > replicate (and > > self-heal) does not understand "partial failure" of its > subvolumes. If one > > of the subvolume of replicate is a distribute, then > today's replicate only > > understands complete failure of the distribute set or it > assumes everything > > is completely fine. An example is self-healing of > directory entries. If a > > file is "missing" in one subvolume because a distribute > node is temporarily > > down, replicate has no clue why it is missing (or that it > should keep away > > from attempting to self-heal). Along the same lines, it > does not know that > > once a server is taken off from its distribute subvolume > for good that it > > needs to start recreating missing files. > > Hmm. I loved the brick idea. I don't like perverting it by > trying to "see through" layers. In that context I can see > two or three expected outcomes from someone building > this type of stack (heh: a quick trick brick stack) - when > a distribute child disappears; > > At the Distribute layer; > 1) The distribute name space / stat space > remains in tact, though the content is > obviously not avail. > 2) The distribute presentation is pure and true > of its constituents, showing only the names > / stats that are online/avail. > > In its standalone case, 2 is probably > preferable as it allows clean add/start/stop/ > remove capacity. > > At the Replicate layer; > 3) replication occurs only where the name / > stat space shows a gap > 4) the replication occurs at any delta > > I don't think there's a real choice here, even > if 3 were sensible, what would replicate do if > there was a local name and even just a remote > file size change, when there's no local content > to update; it must be 4. > > In which case, I would expect that a replicate > on top of a distribute with a missing child would > suddenly see a delta that it would immediately > set about repairing. > > > > The effort to fix this seems to be big enough to disturb > the inertia of > > status quo. If this is fixed, we can definitely adopt a > replicate-on-top > > mode in glusterd. > > I'm not sure why there needs to be a "fix" .. wasn't > the previous behaviour sensible? > > Or, if there is something to "change", then > bolstering the distribute module might be enough - > a combination of 1 and 2 above. > > Try this out: what if the Distribute layer maintained > a full name space on each child, and didn't allow > "recreation"? Say 3 children, one is broken/offline, > so that /path/to/child/3/file is missing but is known > to be missing (internally to Distribute). Then the > Distribute brick can both not show the name > space to the parent layers, but can also actively > prevent manipulation of those files (the parent > can neither stat /path/to/child/3/file nor unlink, nor > create/write to it). If this change is meant to be > permanent, then the administrative act of > removing the child from distribute will then > truncate the locked name space, allowing parents > (be they users or other bricks, like Replicate) to > act as they please (such as recreating the > missing files). > > If you adhere to the principles that I thought I > understood from 2009 or so then you should be > able to let the users create unforeseen Gluster > architectures without fear or impact. I.e. > > i) each brick is fully self contained * > ii) physical bricks are the bread of a brick > stack sandwich ** > iii) any logical brick can appear above/below > any other logical brick in a brick stack > > * Not mandating a 1:1 file mapping from layer > to layer > > ** Eg: the Posix (bottom), Client (bottom), > Server (top) and NFS (top) are all > regarded as physical bricks. > > Thus it was my expectation that a dedupe brick > (being logical) could either go above or below > a distribute brick (also logical), for example. > > Or that an encryption brick could go on top > of replicate which was on top of encryption > which was on top of distribute which was on > top of encryption on top of posix, for example. > > > Or .. am I over simplifying the problem space? > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From vbellur at redhat.com Fri May 11 07:06:38 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Fri, 11 May 2012 12:36:38 +0530 Subject: [Gluster-devel] release-3.3 branched out Message-ID: <4FACBA7E.6090801@redhat.com> A new branch release-3.3 has been created. You can checkout the branch via: $git checkout -b release-3.3 origin/release-3.3 rfc.sh has been updated to send patches to the appropriate branch. The plan is to have all 3.3.x releases happen off this branch. If you need any fix to be part of a 3.3.x release, please send out a backport of the same from master to release-3.3 after it has been accepted in master. Thanks, Vijay From manu at netbsd.org Fri May 11 07:29:20 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 11 May 2012 07:29:20 +0000 Subject: [Gluster-devel] release-3.3 branched out In-Reply-To: <4FACBA7E.6090801@redhat.com> References: <4FACBA7E.6090801@redhat.com> Message-ID: <20120511072920.GG18684@homeworld.netbsd.org> On Fri, May 11, 2012 at 12:36:38PM +0530, Vijay Bellur wrote: > A new branch release-3.3 has been created. You can checkout the branch via: Any chance someone merge my build fixes so that I can pullup to the new branch? http://review.gluster.com/3238 -- Emmanuel Dreyfus manu at netbsd.org From vbellur at redhat.com Fri May 11 07:43:13 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Fri, 11 May 2012 13:13:13 +0530 Subject: [Gluster-devel] release-3.3 branched out In-Reply-To: <20120511072920.GG18684@homeworld.netbsd.org> References: <4FACBA7E.6090801@redhat.com> <20120511072920.GG18684@homeworld.netbsd.org> Message-ID: <4FACC311.5020708@redhat.com> On 05/11/2012 12:59 PM, Emmanuel Dreyfus wrote: > On Fri, May 11, 2012 at 12:36:38PM +0530, Vijay Bellur wrote: >> A new branch release-3.3 has been created. You can checkout the branch via: > Any chance someone merge my build fixes so that I can pullup to the > new branch? > http://review.gluster.com/3238 Merged to master. Vijay From vijay at build.gluster.com Fri May 11 10:35:24 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Fri, 11 May 2012 03:35:24 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa41 released Message-ID: <20120511103527.5809B18009D@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa41/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa41.tar.gz This release is made off v3.3.0qa41 From 7220022 at gmail.com Sat May 12 15:22:57 2012 From: 7220022 at gmail.com (7220022) Date: Sat, 12 May 2012 19:22:57 +0400 Subject: [Gluster-devel] Gluster VSA for VMware ESX Message-ID: <012701cd3053$1d2e6110$578b2330$@gmail.com> Would love to test performance of Gluster Virtual Storage Appliance for VMware, but cannot get the demo. Emails and calls to Red Hat went unanswered. We've built a nice test system for the cluster at our lab, 8 modern servers running ESX4.1 and connected via 40gb InfiniBand fabric. Each server has 24 2.5" drives, SLC SSD and 10K SAS HDD-s connected to 6 LSI controllers with CacheCade (Pro 2.0 with write cache enabled,) 4 drives per controller. The plan is to test performance using bricks made of HDD-s cached with SSD-s, as well as HDD-s and SSD-s separately. Can anyone help getting the demo version of VSA? It's fine if it's a beta version, we just wanted to check the performance and scalability. -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Sun May 13 08:27:20 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 13 May 2012 10:27:20 +0200 Subject: [Gluster-devel] buffer corruption in io-stats Message-ID: <1kk12tm.1awqq7kf1joseM%manu@netbsd.org> I get a reproductible SIGSEGV with sources from latest git. iosfd is overwritten by the file path, it seems there is a confusion somewhere between iosfd->filename pointer value and pointed buffer (gdb) bt #1 0xbb92891b in free () from /lib/libc.so.12 #2 0xbbbb37a7 in __gf_free (free_ptr=0x74656e2f) at mem-pool.c:258 #3 0xb9a85378 in io_stats_release (this=0xba3e3000, fd=0xb8f9d098) at io-stats.c:2420 #4 0xbbbafcc0 in fd_destroy (fd=0xb8f9d098) at fd.c:507 #5 0xbbbafdf8 in fd_unref (fd=0xb8f9d098) at fd.c:543 #6 0xbbbaf7cf in gf_fdptr_put (fdtable=0xbb77d070, fd=0xb8f9d098) at fd.c:393 #7 0xbb821147 in fuse_release () from /usr/local/lib/glusterfs/3git/xlator/mount/fuse.so #8 0xbb82a2e1 in fuse_thread_proc () from /usr/local/lib/glusterfs/3git/xlator/mount/fuse.so (gdb) frame 3 #3 0xb9a85378 in io_stats_release (this=0xba3e3000, fd=0xb8f9d098) at io-stats.c:2420 2420 GF_FREE (iosfd->filename); (gdb) print *iosfd $2 = {filename = 0x74656e2f
, data_written = 3418922014271107938, data_read = 7813586423313035891, block_count_write = {4788563690262784356, 3330756270057407571, 7074933154630937908, 28265, 0 }, block_count_read = { 0 }, opened_at = {tv_sec = 1336897011, tv_usec = 145734}} (gdb) x/10s iosfd 0xbb70f800: "/netbsd/usr/src/tooldir.NetBSD-6.99.4-i386/bin" -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 13 14:42:45 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 13 May 2012 16:42:45 +0200 Subject: [Gluster-devel] python version Message-ID: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> Hi There is a problem with python version detection in the configure script. The machine on which autotools is ran prior releasing glusterfs expands AM_PATH_PYTHON into a script that fails to accept python > 2.4. As I understand, a solution is to concatenate latest automake-1.12/m4/python.m4 into glusterfs' aclocal.m4. That way python up to 3.1 shoul be accepted. Opinions? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From renqiang at 360buy.com Mon May 14 01:20:32 2012 From: renqiang at 360buy.com (=?gb2312?B?yM7Hvw==?=) Date: Mon, 14 May 2012 09:20:32 +0800 Subject: [Gluster-devel] balance stoped Message-ID: <018001cd316f$c25a6f90$470f4eb0$@com> Hi,All! May I ask you a question? When we do balance on a volume, it stopped when moving the 505th?s file 0f 1006 files. Now we cannot restart it and also cannot cancel it. How can I do, please? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Mon May 14 01:22:43 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Mon, 14 May 2012 11:22:43 +1000 Subject: [Gluster-devel] lseek Message-ID: <201205140122.q4E1MhU8000317@singularity.tronunltd.com> Hello, I'm looking for a seek (lseek) implementation in one of the modules and I can't see one. Do I need to care about seeking if my module changes the file size (i.e. compresses) in Gluster? I would have thought that I did except that I believe that what I'm reading is that Gluster returns a NONSEEKABLE flag on file open (fuse_kernel.h at line 149). Does this mitigate the need to correct the user seeks? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Mon May 14 07:48:17 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 14 May 2012 09:48:17 +0200 Subject: [Gluster-devel] lseek In-Reply-To: <201205140122.q4E1MhU8000317@singularity.tronunltd.com> References: <201205140122.q4E1MhU8000317@singularity.tronunltd.com> Message-ID: <4FB0B8C1.4020908@datalab.es> Hello Ian, there is no such thing as an explicit seek in glusterfs. Each readv, writev, (f)truncate and rchecksum have an offset parameter that tells you the position where the operation must be performed. If you make something that changes the size of the file you must make it in a way that it is transparent to upper translators. This means that all offsets you will receive are "real" (in your case, offsets in the uncompressed version of the file). You should calculate in some way the equivalent offset in the compressed version of the file and send it to the correspoding fop of the lower translators. In the same way, you must return in all iatt structures the real size of the file (not the compressed size). I'm not sure what is the intended use of NONSEEKABLE, but I think it is for special file types, like devices or similar that are sequential in nature. Anyway, this is a fuse flag that you can't return from a regular translator open fop. Xavi On 05/14/2012 03:22 AM, Ian Latter wrote: > Hello, > > > I'm looking for a seek (lseek) implementation in > one of the modules and I can't see one. > > Do I need to care about seeking if my module > changes the file size (i.e. compresses) in Gluster? > I would have thought that I did except that I believe > that what I'm reading is that Gluster returns a > NONSEEKABLE flag on file open (fuse_kernel.h at > line 149). Does this mitigate the need to correct > the user seeks? > > > Cheers, > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Mon May 14 09:51:59 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Mon, 14 May 2012 19:51:59 +1000 Subject: [Gluster-devel] lseek Message-ID: <201205140951.q4E9px5H001754@singularity.tronunltd.com> Hello Xavi, Ok - thanks. I was hoping that this was how read and write were working (i.e. with absolute offsets and not just getting relative offsets from the current seek point), however what of the raw seek command? len = lseek(fd, 0, SEEK_END); Upon successful completion, lseek() returns the resulting offset location as measured in bytes from the beginning of the file. Any idea on where the return value comes from? I will need to fake up a file size for this command .. ----- Original Message ----- >From: "Xavier Hernandez" >To: >Subject: Re: [Gluster-devel] lseek >Date: Mon, 14 May 2012 09:48:17 +0200 > > Hello Ian, > > there is no such thing as an explicit seek in glusterfs. Each readv, > writev, (f)truncate and rchecksum have an offset parameter that tells > you the position where the operation must be performed. > > If you make something that changes the size of the file you must make it > in a way that it is transparent to upper translators. This means that > all offsets you will receive are "real" (in your case, offsets in the > uncompressed version of the file). You should calculate in some way the > equivalent offset in the compressed version of the file and send it to > the correspoding fop of the lower translators. > > In the same way, you must return in all iatt structures the real size of > the file (not the compressed size). > > I'm not sure what is the intended use of NONSEEKABLE, but I think it is > for special file types, like devices or similar that are sequential in > nature. Anyway, this is a fuse flag that you can't return from a regular > translator open fop. > > Xavi > > On 05/14/2012 03:22 AM, Ian Latter wrote: > > Hello, > > > > > > I'm looking for a seek (lseek) implementation in > > one of the modules and I can't see one. > > > > Do I need to care about seeking if my module > > changes the file size (i.e. compresses) in Gluster? > > I would have thought that I did except that I believe > > that what I'm reading is that Gluster returns a > > NONSEEKABLE flag on file open (fuse_kernel.h at > > line 149). Does this mitigate the need to correct > > the user seeks? > > > > > > Cheers, > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Mon May 14 10:29:54 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 14 May 2012 12:29:54 +0200 Subject: [Gluster-devel] lseek In-Reply-To: <201205140951.q4E9px5H001754@singularity.tronunltd.com> References: <201205140951.q4E9px5H001754@singularity.tronunltd.com> Message-ID: <4FB0DEA2.3030805@datalab.es> Hello Ian, lseek calls are handled internally by the kernel and they never reach the user land for fuse calls. lseek only updates the current file offset that is stored inside the kernel file's structure. This value is what is passed to read/write fuse calls as an absolute offset. There isn't any problem in this behavior as long as you hide all size manipulations from fuse. If you write a translator that compresses a file, you should do so in a transparent manner. This means, basically, that: 1. Whenever you are asked to return the file size, you must return the size of the uncompressed file 2. Whenever you receive an offset, you must translate that offset to the corresponding offset in the compressed file and work with that 3. Whenever you are asked to read or write data, you must return the number of uncompressed bytes read or written (even if you have compressed the chunk of data to a smaller size and you have physically written less bytes). 4. All read requests must return uncompressed data (this seems obvious though) This guarantees that your manipulations are not seen in any way by any upper translator or even fuse, thus everything should work smoothly. If you respect these rules, lseek (and your translator) will work as expected. In particular, when a user calls lseek with SEEK_END, the kernel takes the size of the file from the internal kernel inode's structure. This size is obtained through a previous call to lookup or updated using the result of write operations. If you respect points 1 and 3, this value will be correct. In gluster there are a lot of fops that return a iatt structure. You must guarantee that all these functions return the correct size of the file in the field ia_size to be sure that everything works as expected. Xavi On 05/14/2012 11:51 AM, Ian Latter wrote: > Hello Xavi, > > > Ok - thanks. I was hoping that this was how read > and write were working (i.e. with absolute offsets > and not just getting relative offsets from the current > seek point), however what of the raw seek > command? > > len = lseek(fd, 0, SEEK_END); > > Upon successful completion, lseek() returns > the resulting offset location as measured in > bytes from the beginning of the file. > > Any idea on where the return value comes from? > I will need to fake up a file size for this command .. > > > > ----- Original Message ----- >> From: "Xavier Hernandez" >> To: >> Subject: Re: [Gluster-devel] lseek >> Date: Mon, 14 May 2012 09:48:17 +0200 >> >> Hello Ian, >> >> there is no such thing as an explicit seek in glusterfs. > Each readv, >> writev, (f)truncate and rchecksum have an offset parameter > that tells >> you the position where the operation must be performed. >> >> If you make something that changes the size of the file > you must make it >> in a way that it is transparent to upper translators. This > means that >> all offsets you will receive are "real" (in your case, > offsets in the >> uncompressed version of the file). You should calculate in > some way the >> equivalent offset in the compressed version of the file > and send it to >> the correspoding fop of the lower translators. >> >> In the same way, you must return in all iatt structures > the real size of >> the file (not the compressed size). >> >> I'm not sure what is the intended use of NONSEEKABLE, but > I think it is >> for special file types, like devices or similar that are > sequential in >> nature. Anyway, this is a fuse flag that you can't return > from a regular >> translator open fop. >> >> Xavi >> >> On 05/14/2012 03:22 AM, Ian Latter wrote: >>> Hello, >>> >>> >>> I'm looking for a seek (lseek) implementation in >>> one of the modules and I can't see one. >>> >>> Do I need to care about seeking if my module >>> changes the file size (i.e. compresses) in Gluster? >>> I would have thought that I did except that I believe >>> that what I'm reading is that Gluster returns a >>> NONSEEKABLE flag on file open (fuse_kernel.h at >>> line 149). Does this mitigate the need to correct >>> the user seeks? >>> >>> >>> Cheers, >>> >>> >>> >>> -- >>> Ian Latter >>> Late night coder .. >>> http://midnightcode.org/ >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at nongnu.org >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at nongnu.org >> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Mon May 14 11:18:22 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Mon, 14 May 2012 21:18:22 +1000 Subject: [Gluster-devel] lseek Message-ID: <201205141118.q4EBIMku002113@singularity.tronunltd.com> Hello Xavier, I don't have a problem with the principles, these were effectively how I was traveling (the notable difference is statfs which I want to pass-through unaffected, reporting the true file system capacity such that a du [stat] may sum to a greater value than a df [statfs]). In 2009 I had a mostly- functional hashing write function and a dubious read function (I stumbled when I had to open a file from within a fop). But I think what you're telling/showing me is that I have no deep understanding of the mapping of the system calls to their Fuse->Gluster fops - which is expected :) And, this is a better outcome than learning that Gluster has gaps in its framework with regard to my objective. I.e. I didn't know that lseek mapped to lookup. And the examples aren't comprehensive enough (rot-13 is the only one that really manipulates content, and it only plays with read and write, obviously because it has a 1:1 relationship with the data). This is the key, and not something that I was expecting; > In gluster there are a lot of fops that return a iatt > structure. You must guarantee that all these > functions return the correct size of the file in > the field ia_size to be sure that everything works > as expected. I'll do my best to build a comprehensive list of iatt returning fops from the examples ... but I'd say it'll take a solid peer review to get this hammered out properly. Thanks for steering me straight Xavi, appreciate it. ----- Original Message ----- >From: "Xavier Hernandez" >To: "Ian Latter" >Subject: Re: [Gluster-devel] lseek >Date: Mon, 14 May 2012 12:29:54 +0200 > > Hello Ian, > > lseek calls are handled internally by the kernel and they never reach > the user land for fuse calls. lseek only updates the current file offset > that is stored inside the kernel file's structure. This value is what is > passed to read/write fuse calls as an absolute offset. > > There isn't any problem in this behavior as long as you hide all size > manipulations from fuse. If you write a translator that compresses a > file, you should do so in a transparent manner. This means, basically, that: > > 1. Whenever you are asked to return the file size, you must return the > size of the uncompressed file > 2. Whenever you receive an offset, you must translate that offset to the > corresponding offset in the compressed file and work with that > 3. Whenever you are asked to read or write data, you must return the > number of uncompressed bytes read or written (even if you have > compressed the chunk of data to a smaller size and you have physically > written less bytes). > 4. All read requests must return uncompressed data (this seems obvious > though) > > This guarantees that your manipulations are not seen in any way by any > upper translator or even fuse, thus everything should work smoothly. > > If you respect these rules, lseek (and your translator) will work as > expected. > > In particular, when a user calls lseek with SEEK_END, the kernel takes > the size of the file from the internal kernel inode's structure. This > size is obtained through a previous call to lookup or updated using the > result of write operations. If you respect points 1 and 3, this value > will be correct. > > In gluster there are a lot of fops that return a iatt structure. You > must guarantee that all these functions return the correct size of the > file in the field ia_size to be sure that everything works as expected. > > Xavi > > On 05/14/2012 11:51 AM, Ian Latter wrote: > > Hello Xavi, > > > > > > Ok - thanks. I was hoping that this was how read > > and write were working (i.e. with absolute offsets > > and not just getting relative offsets from the current > > seek point), however what of the raw seek > > command? > > > > len = lseek(fd, 0, SEEK_END); > > > > Upon successful completion, lseek() returns > > the resulting offset location as measured in > > bytes from the beginning of the file. > > > > Any idea on where the return value comes from? > > I will need to fake up a file size for this command .. > > > > > > > > ----- Original Message ----- > >> From: "Xavier Hernandez" > >> To: > >> Subject: Re: [Gluster-devel] lseek > >> Date: Mon, 14 May 2012 09:48:17 +0200 > >> > >> Hello Ian, > >> > >> there is no such thing as an explicit seek in glusterfs. > > Each readv, > >> writev, (f)truncate and rchecksum have an offset parameter > > that tells > >> you the position where the operation must be performed. > >> > >> If you make something that changes the size of the file > > you must make it > >> in a way that it is transparent to upper translators. This > > means that > >> all offsets you will receive are "real" (in your case, > > offsets in the > >> uncompressed version of the file). You should calculate in > > some way the > >> equivalent offset in the compressed version of the file > > and send it to > >> the correspoding fop of the lower translators. > >> > >> In the same way, you must return in all iatt structures > > the real size of > >> the file (not the compressed size). > >> > >> I'm not sure what is the intended use of NONSEEKABLE, but > > I think it is > >> for special file types, like devices or similar that are > > sequential in > >> nature. Anyway, this is a fuse flag that you can't return > > from a regular > >> translator open fop. > >> > >> Xavi > >> > >> On 05/14/2012 03:22 AM, Ian Latter wrote: > >>> Hello, > >>> > >>> > >>> I'm looking for a seek (lseek) implementation in > >>> one of the modules and I can't see one. > >>> > >>> Do I need to care about seeking if my module > >>> changes the file size (i.e. compresses) in Gluster? > >>> I would have thought that I did except that I believe > >>> that what I'm reading is that Gluster returns a > >>> NONSEEKABLE flag on file open (fuse_kernel.h at > >>> line 149). Does this mitigate the need to correct > >>> the user seeks? > >>> > >>> > >>> Cheers, > >>> > >>> > >>> > >>> -- > >>> Ian Latter > >>> Late night coder .. > >>> http://midnightcode.org/ > >>> > >>> _______________________________________________ > >>> Gluster-devel mailing list > >>> Gluster-devel at nongnu.org > >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel > >> _______________________________________________ > >> Gluster-devel mailing list > >> Gluster-devel at nongnu.org > >> https://lists.nongnu.org/mailman/listinfo/gluster-devel > >> > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Mon May 14 11:47:10 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 14 May 2012 13:47:10 +0200 Subject: [Gluster-devel] lseek In-Reply-To: <201205141118.q4EBIMku002113@singularity.tronunltd.com> References: <201205141118.q4EBIMku002113@singularity.tronunltd.com> Message-ID: <4FB0F0BE.9030009@datalab.es> Hello Ian, I didn't thought in statfs. In this special case things are a bit harder for a compression translator. I think it's impossible to return accurate data without a considerable amount of work. Maybe some estimation of the available space based on the current achieved mean compression ratio would be sufficient, but never accurate. With more work you could even be able to say exactly how much space have been used, but the best you can do with the remaining space is an estimation. Regarding lseek, there isn't a map with lookup. Probably I haven't explained it as well as I wanted. There are basically two kinds of user mode calls. Those that use a string containing a filename to operate with (stat, unlink, open, creat, ...), and those that use a file descriptor (fstat, read, write, ...). The kernel does not work with names to handle files, so it has to translate the names to inodes to work with them. This means that any call that uses a string will need to make a "lookup" to get the associated inode (the only exception is creat, that creates a new inode without using lookup). This means that every filename based operation can generate a lookup request (although some caching mechanism may reduce the number of calls). All operations that work with a file descriptor do not generate a lookup request, because the file descriptor is already bound to an inode. In your particular case, to do an lseek you must have made a previous call to open (that would have generated a lookup request) or creat. Hope this better explains how kernel and gluster are bound... Xavi On 05/14/2012 01:18 PM, Ian Latter wrote: > Hello Xavier, > > > I don't have a problem with the principles, these > were effectively how I was traveling (the notable > difference is statfs which I want to pass-through > unaffected, reporting the true file system capacity > such that a du [stat] may sum to a greater value > than a df [statfs]). In 2009 I had a mostly- > functional hashing write function and a dubious > read function (I stumbled when I had to open a > file from within a fop). > > But I think what you're telling/showing me is that > I have no deep understanding of the mapping of > the system calls to their Fuse->Gluster fops - > which is expected :) And, this is a better outcome > than learning that Gluster has gaps in its > framework with regard to my objective. I.e. I > didn't know that lseek mapped to lookup. And > the examples aren't comprehensive enough > (rot-13 is the only one that really manipulates > content, and it only plays with read and write, > obviously because it has a 1:1 relationship with > the data). > > This is the key, and not something that I was > expecting; > >> In gluster there are a lot of fops that return a iatt >> structure. You must guarantee that all these >> functions return the correct size of the file in >> the field ia_size to be sure that everything works >> as expected. > I'll do my best to build a comprehensive list of iatt > returning fops from the examples ... but I'd say it'll > take a solid peer review to get this hammered out > properly. > > Thanks for steering me straight Xavi, appreciate > it. > > > > ----- Original Message ----- >> From: "Xavier Hernandez" >> To: "Ian Latter" >> Subject: Re: [Gluster-devel] lseek >> Date: Mon, 14 May 2012 12:29:54 +0200 >> >> Hello Ian, >> >> lseek calls are handled internally by the kernel and they > never reach >> the user land for fuse calls. lseek only updates the > current file offset >> that is stored inside the kernel file's structure. This > value is what is >> passed to read/write fuse calls as an absolute offset. >> >> There isn't any problem in this behavior as long as you > hide all size >> manipulations from fuse. If you write a translator that > compresses a >> file, you should do so in a transparent manner. This > means, basically, that: >> 1. Whenever you are asked to return the file size, you > must return the >> size of the uncompressed file >> 2. Whenever you receive an offset, you must translate that > offset to the >> corresponding offset in the compressed file and work with that >> 3. Whenever you are asked to read or write data, you must > return the >> number of uncompressed bytes read or written (even if you > have >> compressed the chunk of data to a smaller size and you > have physically >> written less bytes). >> 4. All read requests must return uncompressed data (this > seems obvious >> though) >> >> This guarantees that your manipulations are not seen in > any way by any >> upper translator or even fuse, thus everything should work > smoothly. >> If you respect these rules, lseek (and your translator) > will work as >> expected. >> >> In particular, when a user calls lseek with SEEK_END, the > kernel takes >> the size of the file from the internal kernel inode's > structure. This >> size is obtained through a previous call to lookup or > updated using the >> result of write operations. If you respect points 1 and 3, > this value >> will be correct. >> >> In gluster there are a lot of fops that return a iatt > structure. You >> must guarantee that all these functions return the correct > size of the >> file in the field ia_size to be sure that everything works > as expected. >> Xavi >> >> On 05/14/2012 11:51 AM, Ian Latter wrote: >>> Hello Xavi, >>> >>> >>> Ok - thanks. I was hoping that this was how read >>> and write were working (i.e. with absolute offsets >>> and not just getting relative offsets from the current >>> seek point), however what of the raw seek >>> command? >>> >>> len = lseek(fd, 0, SEEK_END); >>> >>> Upon successful completion, lseek() returns >>> the resulting offset location as measured in >>> bytes from the beginning of the file. >>> >>> Any idea on where the return value comes from? >>> I will need to fake up a file size for this command .. >>> >>> >>> >>> ----- Original Message ----- >>>> From: "Xavier Hernandez" >>>> To: >>>> Subject: Re: [Gluster-devel] lseek >>>> Date: Mon, 14 May 2012 09:48:17 +0200 >>>> >>>> Hello Ian, >>>> >>>> there is no such thing as an explicit seek in glusterfs. >>> Each readv, >>>> writev, (f)truncate and rchecksum have an offset parameter >>> that tells >>>> you the position where the operation must be performed. >>>> >>>> If you make something that changes the size of the file >>> you must make it >>>> in a way that it is transparent to upper translators. This >>> means that >>>> all offsets you will receive are "real" (in your case, >>> offsets in the >>>> uncompressed version of the file). You should calculate in >>> some way the >>>> equivalent offset in the compressed version of the file >>> and send it to >>>> the correspoding fop of the lower translators. >>>> >>>> In the same way, you must return in all iatt structures >>> the real size of >>>> the file (not the compressed size). >>>> >>>> I'm not sure what is the intended use of NONSEEKABLE, but >>> I think it is >>>> for special file types, like devices or similar that are >>> sequential in >>>> nature. Anyway, this is a fuse flag that you can't return >>> from a regular >>>> translator open fop. >>>> >>>> Xavi >>>> >>>> On 05/14/2012 03:22 AM, Ian Latter wrote: >>>>> Hello, >>>>> >>>>> >>>>> I'm looking for a seek (lseek) implementation in >>>>> one of the modules and I can't see one. >>>>> >>>>> Do I need to care about seeking if my module >>>>> changes the file size (i.e. compresses) in Gluster? >>>>> I would have thought that I did except that I believe >>>>> that what I'm reading is that Gluster returns a >>>>> NONSEEKABLE flag on file open (fuse_kernel.h at >>>>> line 149). Does this mitigate the need to correct >>>>> the user seeks? >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> >>>>> -- >>>>> Ian Latter >>>>> Late night coder .. >>>>> http://midnightcode.org/ >>>>> >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at nongnu.org >>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at nongnu.org >>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >>>> >>> -- >>> Ian Latter >>> Late night coder .. >>> http://midnightcode.org/ >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at nongnu.org >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From kkeithle at redhat.com Mon May 14 14:17:12 2012 From: kkeithle at redhat.com (Kaleb S. KEITHLEY) Date: Mon, 14 May 2012 10:17:12 -0400 Subject: [Gluster-devel] python version In-Reply-To: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> Message-ID: <4FB113E8.0@redhat.com> On 05/13/2012 10:42 AM, Emmanuel Dreyfus wrote: > Hi > > There is a problem with python version detection in the configure > script. The machine on which autotools is ran prior releasing glusterfs > expands AM_PATH_PYTHON into a script that fails to accept python> 2.4. > > As I understand, a solution is to concatenate latest > automake-1.12/m4/python.m4 into glusterfs' aclocal.m4. That way python > up to 3.1 should be accepted. Opinions? The aclocal.m4 file is produced when (/usr/bin/)aclocal is invoked by ./autogen.sh file in preparation for building gluster. (You have to run autogen.sh to produce the ./configure file.) aclocal uses whatever python.m4 file you have on your system, e.g. /usr/share/aclocal-1.11/python.m4, which is also from the automake package. I presume whoever packages automake for a particular system is taking into consideration what other packages and versions are standard for the system and picks right version of automake. IOW picks the version of automake that has all the (hard-coded) versions of python to match the python they have on their system. If someone has installed a later version of python and not also updated to a compatible version of automake, that's not a problem that gluster should have to solve, or even try to solve. I don't believe we want to require our build process to download the latest-and-greatest version of automake. As a side note, I sampled a few currently shipping systems and see that the automake shipped with/for Fedora 16 and 17, FreeBSD 8.2 and 8.3, and NetBSD 5.1.2, is automake-1.11, which has all the appearances of supporting python 2.5 (and 3.0). Finally, after all that, note that the configure.ac file appears to be hard-coded to require python 2.x, so if anyone is trying to use python 3.x, that's doomed to fail until configure.ac is "fixed." Do we even know why python 2.x is required and why python 3.x can't be used? -- Kaleb From manu at netbsd.org Mon May 14 14:23:47 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 14 May 2012 14:23:47 +0000 Subject: [Gluster-devel] python version In-Reply-To: <4FB113E8.0@redhat.com> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> Message-ID: <20120514142347.GA3985@homeworld.netbsd.org> On Mon, May 14, 2012 at 10:17:12AM -0400, Kaleb S. KEITHLEY wrote: > The aclocal.m4 file is produced when (/usr/bin/)aclocal is invoked > by ./autogen.sh file in preparation for building gluster. (You have > to run autogen.sh to produce the ./configure file.) Right, then my plan will not work, and the only way to fix the problem is to upgrade automake on the machine that produces the gluterfs releases. > As a side note, I sampled a few currently shipping systems and see > that the automake shipped with/for Fedora 16 and 17, FreeBSD 8.2 and > 8.3, and NetBSD 5.1.2, is automake-1.11, which has all the > appearances of supporting python 2.5 (and 3.0). You seem to take for granted that people building a glusterfs release will run autotools before running configure. This is not the way it should work: a released tarball should contain a configure script that works anywhere. The tarballs released up to at least 3.3.0qa40 have a configure script that cannot detect python > 2.4 -- Emmanuel Dreyfus manu at netbsd.org From kkeithle at redhat.com Mon May 14 14:31:32 2012 From: kkeithle at redhat.com (Kaleb S. KEITHLEY) Date: Mon, 14 May 2012 10:31:32 -0400 Subject: [Gluster-devel] python version In-Reply-To: <20120514142347.GA3985@homeworld.netbsd.org> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> <20120514142347.GA3985@homeworld.netbsd.org> Message-ID: <4FB11744.1040907@redhat.com> On 05/14/2012 10:23 AM, Emmanuel Dreyfus wrote: > On Mon, May 14, 2012 at 10:17:12AM -0400, Kaleb S. KEITHLEY wrote: >> The aclocal.m4 file is produced when (/usr/bin/)aclocal is invoked >> by ./autogen.sh file in preparation for building gluster. (You have >> to run autogen.sh to produce the ./configure file.) > > Right, then my plan will not work, and the only way to fix the problem > is to upgrade automake on the machine that produces the glusterfs > releases. > >> As a side note, I sampled a few currently shipping systems and see >> that the automake shipped with/for Fedora 16 and 17, FreeBSD 8.2 and >> 8.3, and NetBSD 5.1.2, is automake-1.11, which has all the >> appearances of supporting python 2.5 (and 3.0). > > You seem to take for granted that people building a glusterfs > release will run autotools before running configure. This is not > the way it should work: a released tarball should contain a > configure script that works anywhere. The tarballs released up to > at least 3.3.0qa40 have a configure script that cannot detect python> 2.4 > I looked at what I get when I checkout the source from the git repo and what I have to do to build from a freshly checked out source tree. And yes, we need to upgrade the build machines were we package the release tarballs. Right now is not a good time to do that. -- Kaleb From yknev.shankar at gmail.com Mon May 14 15:31:56 2012 From: yknev.shankar at gmail.com (Venky Shankar) Date: Mon, 14 May 2012 21:01:56 +0530 Subject: [Gluster-devel] python version In-Reply-To: <4FB113E8.0@redhat.com> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> Message-ID: [snip] > Finally, after all that, note that the configure.ac file appears to be > hard-coded to require python 2.x, so if anyone is trying to use python 3.x, > that's doomed to fail until configure.ac is "fixed." Do we even know why > python 2.x is required and why python 3.x can't be used? > python 2.x is required by geo-replication. Although geo-replication is code ready for python 3.x, it's not functionally tested with it. That's the reason configure.ac has 2.x hard-coded. > > -- > > Kaleb > > > ______________________________**_________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/**mailman/listinfo/gluster-devel > Thanks, -Venky -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Mon May 14 15:45:48 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 14 May 2012 15:45:48 +0000 Subject: [Gluster-devel] python version In-Reply-To: References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> Message-ID: <20120514154548.GB3985@homeworld.netbsd.org> On Mon, May 14, 2012 at 09:01:56PM +0530, Venky Shankar wrote: > python 2.x is required by geo-replication. Although geo-replication is code > ready for python 3.x, it's not functionally tested with it. That's the > reason configure.ac has 2.x hard-coded. Well, my problem is that python 2.5, python 2.6 and python 2.7 are not detected by configure. One need to patch configure in order to build with python 2.x (x > 4) installed. -- Emmanuel Dreyfus manu at netbsd.org From kkeithle at redhat.com Mon May 14 16:30:12 2012 From: kkeithle at redhat.com (Kaleb S. KEITHLEY) Date: Mon, 14 May 2012 12:30:12 -0400 Subject: [Gluster-devel] python version In-Reply-To: <20120514154548.GB3985@homeworld.netbsd.org> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> <20120514154548.GB3985@homeworld.netbsd.org> Message-ID: <4FB13314.3060708@redhat.com> On 05/14/2012 11:45 AM, Emmanuel Dreyfus wrote: > On Mon, May 14, 2012 at 09:01:56PM +0530, Venky Shankar wrote: >> python 2.x is required by geo-replication. Although geo-replication is code >> ready for python 3.x, it's not functionally tested with it. That's the >> reason configure.ac has 2.x hard-coded. > > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not > detected by configure. One need to patch configure in order to build > with python 2.x (x> 4) installed. > Seems like it would be easier to get autoconf and automake from the NetBSD packages and just run `./autogen.sh && ./configure` (Which, FWIW, is how glusterfs RPMs are built for the Fedora distributions. I'd wager for much the same reason.) -- Kaleb From manu at netbsd.org Mon May 14 18:46:07 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 14 May 2012 20:46:07 +0200 Subject: [Gluster-devel] python version In-Reply-To: <4FB13314.3060708@redhat.com> Message-ID: <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Kaleb S. KEITHLEY wrote: > > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not > > detected by configure. One need to patch configure in order to build > > with python 2.x (x> 4) installed. > > Seems like it would be easier to get autoconf and automake from the > NetBSD packages and just run `./autogen.sh && ./configure` I prefer patching the configure script. Running autogen introduce build dependencies on perl just to substitute a string on a single line: that's overkill. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From abperiasamy at gmail.com Mon May 14 19:25:20 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Mon, 14 May 2012 12:25:20 -0700 Subject: [Gluster-devel] python version In-Reply-To: <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> References: <4FB13314.3060708@redhat.com> <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Message-ID: On Mon, May 14, 2012 at 11:46 AM, Emmanuel Dreyfus wrote: > Kaleb S. KEITHLEY wrote: > >> > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not >> > detected by configure. One need to patch configure in order to build >> > with python 2.x (x> ?4) installed. >> >> Seems like it would be easier to get autoconf and automake from the >> NetBSD packages and just run `./autogen.sh && ./configure` > > I prefer patching the configure script. Running autogen introduce build > dependencies on perl just to substitute a string on a single line: > that's overkill. > Who ever builds from source is required to run autogen.sh to produce env specific configure and build files. "configure" script should not be checked into git repository. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From anand.avati at gmail.com Mon May 14 23:58:18 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 14 May 2012 16:58:18 -0700 Subject: [Gluster-devel] python version In-Reply-To: References: <4FB13314.3060708@redhat.com> <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Message-ID: On Mon, May 14, 2012 at 12:25 PM, Anand Babu Periasamy < abperiasamy at gmail.com> wrote: > On Mon, May 14, 2012 at 11:46 AM, Emmanuel Dreyfus > wrote: > > Kaleb S. KEITHLEY wrote: > > > >> > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not > >> > detected by configure. One need to patch configure in order to build > >> > with python 2.x (x> 4) installed. > >> > >> Seems like it would be easier to get autoconf and automake from the > >> NetBSD packages and just run `./autogen.sh && ./configure` > > > > I prefer patching the configure script. Running autogen introduce build > > dependencies on perl just to substitute a string on a single line: > > that's overkill. > > > > Who ever builds from source is required to run autogen.sh to produce > env specific configure and build files. Not quite. That's the whole point of having a configure script in the first place - to detect the environment at build time. One who builds from source should not require to run autogen.sh, just configure should be sufficient. Since configure itself is a generated script, and can possibly have mistakes and requirements change (like the one being discussed), that's when autogen.sh must be used to re-generate configure script. In this case however, the simplest approach would actually be to run autogen.sh till either: a) we upgrade the release build machine to use newer aclocal macros b) qualify geo-replication to work on python 3 and remove the check. Emmanuel, since the problem is not going to be a long lasting one (either of the two should fix your problem), I suggest you find a solution local to you in the interim. Even better, if someone can actually test and qualify geo-replication to work on python 3 it would ease solution "b" sooner. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 15 01:30:21 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 03:30:21 +0200 Subject: [Gluster-devel] python version In-Reply-To: Message-ID: <1kk4971.wh86xo1gypeoiM%manu@netbsd.org> Anand Avati wrote: > a) we upgrade the release build machine to use newer aclocal macros > > b) qualify geo-replication to work on python 3 and remove the check. Solution b is not enough: even if the configure script does not claim a specific version of python, it will still be unable to detect an installed python > 2.4 because it contains that: for am_cv_pathless_PYTHON in python python2 python2.4 python2.3 python2.2 python2.1 python2.0 none; do What about solution c? c) Tweak autogen.sh so that it patches generated configure and add the checks for python > 2.4 if they are missing: --- autogen.sh.orig 2012-05-15 03:22:48.000000000 +0200 +++ autogen.sh 2012-05-15 03:24:28.000000000 +0200 @@ -5,4 +5,6 @@ (libtoolize --automake --copy --force || glibtoolize --automake --copy --force) autoconf automake --add-missing --copy --foreign cd argp-standalone;./autogen.sh + +sed 's/for am_cv_pathless_PYTHON in python python2 python2.4/for am_cv_pathless_PYTHON in python python2 python3 python3.2 python3.1 python3.0 python2.7 2.6 python2.5 python2.4/' configure > configure.new && mv configure.new configure -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 15 04:20:29 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 06:20:29 +0200 Subject: [Gluster-devel] Fixing Address family mess In-Reply-To: Message-ID: <1kk4hl3.1qjswd01knbbvqM%manu@netbsd.org> Anand Babu Periasamy wrote: > AF_UNSPEC is should be be taken as IPv4/IPv6. It is named > appropriately. Default should be ipv4. > > I have not tested the patch. I did test it and it fixed the problem at mine. Here it is in gerrit: http://review.gluster.com/#change,3319 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 15 04:27:26 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 06:27:26 +0200 Subject: [Gluster-devel] NetBSD support in 3.3? Message-ID: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> Hi I still have a few pending submissions for NetBSD support in latest sources: http://review.gluster.com/3319 Use inet as default transport http://review.gluster.com/3320 Add missing (base|dir)name_r http://review.gluster.com/3321 NetBSD build fixes I would like to have 3.3 building without too many unintegrated patches on NetBSD. Is it worth working on pushing the changes above or is release-3.3 too close to release to expect such changes to get into it now? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From amarts at redhat.com Tue May 15 05:51:55 2012 From: amarts at redhat.com (Amar Tumballi) Date: Tue, 15 May 2012 11:21:55 +0530 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> References: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> Message-ID: <4FB1EEFB.2020509@redhat.com> On 05/15/2012 09:57 AM, Emmanuel Dreyfus wrote: > Hi > > I still have a few pending submissions for NetBSD support in latest > sources: > http://review.gluster.com/3319 Use inet as default transport > http://review.gluster.com/3320 Add missing (base|dir)name_r > http://review.gluster.com/3321 NetBSD build fixes > > I would like to have 3.3 building without too many unintegrated patches > on NetBSD. Is it worth working on pushing the changes above or is > release-3.3 too close to release to expect such changes to get into it > now? > Emmanuel, I understand your concerns, but I suspect we are very close to 3.3.0 release at this point of time, and hence it may be tight for taking these patches in. What we are planing is for a quicker 3.3.1 depending on the community feedback of 3.3.0 release, which should surely have your patches included. Hope that makes sense. Regards, Amar From manu at netbsd.org Tue May 15 10:13:07 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 10:13:07 +0000 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <4FB1EEFB.2020509@redhat.com> References: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> <4FB1EEFB.2020509@redhat.com> Message-ID: <20120515101307.GD3985@homeworld.netbsd.org> On Tue, May 15, 2012 at 11:21:55AM +0530, Amar Tumballi wrote: > I understand your concerns, but I suspect we are very close to 3.3.0 > release at this point of time, and hence it may be tight for taking > these patches in. Riht, I will therefore not request pullups to release-3.3 for theses changes, but I would appreciate if people could review them so that they have a chance to go in master. Will 3.3.1 be based on release-3.3, or will a new branch be forked? -- Emmanuel Dreyfus manu at netbsd.org From vbellur at redhat.com Tue May 15 10:14:38 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Tue, 15 May 2012 15:44:38 +0530 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <20120515101307.GD3985@homeworld.netbsd.org> References: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> <4FB1EEFB.2020509@redhat.com> <20120515101307.GD3985@homeworld.netbsd.org> Message-ID: <4FB22C8E.1@redhat.com> On 05/15/2012 03:43 PM, Emmanuel Dreyfus wrote: > Riht, I will therefore not request pullups to release-3.3 for theses > changes, but I would appreciate if people could review them so that they > have a chance to go in master. > > Will 3.3.1 be based on release-3.3, or will a new branch be forked? All 3.3.x releases will be based on release-3.3. It might be a good idea to rebase these changes to release-3.3 after they have been accepted in master. Vijay From manu at netbsd.org Tue May 15 11:51:36 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 13:51:36 +0200 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <4FB22C8E.1@redhat.com> Message-ID: <1kk51xf.8p0t3l1viyp1mM%manu@netbsd.org> Vijay Bellur wrote: > All 3.3.x releases will be based on release-3.3. It might be a good idea > to rebase these changes to release-3.3 after they have been accepted in > master. But after 3.3 release, as I understand. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From ej1515.park at samsung.com Wed May 16 12:23:12 2012 From: ej1515.park at samsung.com (=?euc-kr?B?udrAusHY?=) Date: Wed, 16 May 2012 12:23:12 +0000 (GMT) Subject: [Gluster-devel] Asking about Gluster Performance Factors Message-ID: <0M44007MX7QO1Z40@mailout1.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201205162123598_1LI1H0JV.jpg Type: image/jpeg Size: 72722 bytes Desc: not available URL: From johnmark at redhat.com Wed May 16 14:38:50 2012 From: johnmark at redhat.com (John Mark Walker) Date: Wed, 16 May 2012 10:38:50 -0400 (EDT) Subject: [Gluster-devel] Asking about Gluster Performance Factors In-Reply-To: <0M44007MX7QO1Z40@mailout1.samsung.com> Message-ID: <931185f2-f1b7-431f-96a0-1e7cb476b7d7@zmail01.collab.prod.int.phx2.redhat.com> Hi Ethan, ----- Original Message ----- > Dear Gluster Dev Team : > I'm Ethan, Assistant engineer in Samsung electronics. Reviewing your > paper, I have some questions of performance factors in gluster. Which paper? Can you provide a link? Also, please note that this is a community mailing list, and we cannot guarantee quick response times here - if you need a fast response, I'm happy to put you through to the right people. Thanks, John Mark Walker Gluster Community Guy > First, what does it mean the option "performance.cache-*"? Does it > mean read cache? If does, what's difference between the options > "prformance.cache-max-file-size" and "performance.cache-size" ? > I read your another paper("performance in a gluster system, versions > 3.1.x") and it says as below on Page 12, > (Gluster Native protocol does not implement write caching, as we > believe that the modest performance improvements from rite caching > do not justify the risk of cache coherency issues.) > Second, how much is the read throughput improved as configuring 2-way > replication? we need any statistics or something like that. > ("performance in a gluster system, versions 3.1.x") and it says as > below on Page 12, > (However, read throughput is generally improved by replication, as > reads can be delivered from either storage node) > I would ask you to return ASAP. From johnmark at redhat.com Wed May 16 15:56:32 2012 From: johnmark at redhat.com (John Mark Walker) Date: Wed, 16 May 2012 11:56:32 -0400 (EDT) Subject: [Gluster-devel] Reminder: community.gluster.org In-Reply-To: <4b117086-34aa-4d8b-aede-ffae2e3abfbd@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1bb98699-b028-4f92-b8fd-603056aef57c@zmail01.collab.prod.int.phx2.redhat.com> Greetings all, Just a friendly reminder that we could use your help on community.gluster.org (hereafter 'c.g.o'). Someday in the near future, we will have 2-way synchronization between our mailing lists and c.g.o, but as of now, there are 2 places to ask and answer questions. I ask that for things with definite answers, even if they start out here on the mailing lists, please provide the question and answer on c.g.o. For lengthy conversations about using or developing GlusterFS, including ideas for new ideas, roadmaps, etc., the mailing lists are ideal for that. Why do we prefer c.g.o? Because it's Google-friendly :) So, if you see any existing questions over there that you are qualified to answer, please do weigh in with an answer. And as always, for quick "real-time" help, you're best served by visiting #gluster on the freenode IRC network. This has been a public service announcement from your friendly community guy. -JM From ndevos at redhat.com Wed May 16 19:56:04 2012 From: ndevos at redhat.com (Niels de Vos) Date: Wed, 16 May 2012 21:56:04 +0200 Subject: [Gluster-devel] Updated Wireshark packages for RHEL-6 and Fedora-17 available for testing Message-ID: <4FB40654.60703@redhat.com> Hi all, today I have merged support for GlusterFS 3.2 and 3.3 into one Wireshark 'dissector'. The packages with date 20120516 in the version support both the current stable 3.2.x version, and the latest 3.3.0qa41. Older 3.3.0 versions will likely have issues due to some changes in the RPC-AUTH protocol used. Updating to the latest qa41 release (or newer) is recommended anyway. I do not expect that we'll add support for earlier 3.3.0 releases. My repository with packages for RHEL-6 and Fedora-17 contains a .repo file for yum (save it in /etc/yum.repos.d): - http://repos.fedorapeople.org/repos/devos/wireshark-gluster/ RPMs for other Fedora or RHEL versions can be provided on request. Let me know if you need an other version (or architecture). Single patches for some different Wireshark versions are available from https://github.com/nixpanic/gluster-wireshark. A full history of commits can be found here: - https://github.com/nixpanic/gluster-wireshark-1.4/commits/master/ (Support for GlusterFS 3.3 was added by Akhila and Shree, thanks!) Please test and report success and problems, file a issues on github: https://github.com/nixpanic/gluster-wireshark-1.4/issues Some functionality is still missing, but with the current status, it should be good for most analysing already. With more issues filed, it makes it easier to track what items are important. Of course, you can also respond to this email and give feedback :-) After some more cleanup of the code, this dissector will be passed on for review and inclusion in the upstream Wireshark project. Some more testing results is therefore much appreciated. Thanks, Niels From johnmark at redhat.com Wed May 16 21:12:41 2012 From: johnmark at redhat.com (John Mark Walker) Date: Wed, 16 May 2012 17:12:41 -0400 (EDT) Subject: [Gluster-devel] GlusterFest! For GlusterFS 3.3 Beta 4 In-Reply-To: Message-ID: <5456de9c-6c8b-4995-ad1e-720c9c52c74f@zmail01.collab.prod.int.phx2.redhat.com> Greetings, We are planning to have one more beta release tomorrow. If all goes as planned, this will be the release candidate. In conjunction with the beta, I thought we should have a 24-hour GlusterFest, starting tomorrow at 8pm - http://www.gluster.org/community/documentation/index.php/GlusterFest 'What's a GlusterFest?' you may be asking. Well, it's all of the below: - Testing the software. Install the new beta (when it's released tomorrow) and put it through its paces. We will put some basic testing procedures on the GlusterFest page here - http://www.gluster.org/community/documentation/index.php/GlusterFest - Feel free to create your own testing procedures and link to it from the GlusterFest page - Finding bugs. See the current list of bugs targeted for this release: http://bit.ly/beta4bugs - Fixing bugs. If you're the kind of person who wants to submit patches, see our development workflow doc: http://www.gluster.org/community/documentation/index.php/Development_Work_Flow - and then get to know Gerritt: http://review.gluster.com/ The GlusterFest page will be updated with some basic testing procedures tomorrow, and GlusterFest will officially begin at 8pm PDT May 17/03:00 UTC May 18 (coinciding with the end of our meetup tomorrow), and ending at 8pm PDT May 18/03:00 UTC May 19. If you need assistance, see #gluster on Freenode for "real-time" questions, gluster-users and community.gluster.org for general usage questions, and gluster-devel for anything related to building, patching, and bug-fixing. To keep up with GlusterFest activity, I'll be sending updates from the @glusterorg account on Twitter, and I'm sure there will be traffic on the mailing lists, as well. Happy testing and bug-hunting! -JM From ej1515.park at samsung.com Thu May 17 01:08:50 2012 From: ej1515.park at samsung.com (=?euc-kr?B?udrAusHY?=) Date: Thu, 17 May 2012 01:08:50 +0000 (GMT) Subject: [Gluster-devel] Asking about Gluster Performance Factors Message-ID: <0M4500FX676Q1150@mailout4.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201205171008201_QKNMBDIF.jpg Type: image/jpeg Size: 72722 bytes Desc: not available URL: From johnmark at redhat.com Thu May 17 04:28:50 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 00:28:50 -0400 (EDT) Subject: [Gluster-devel] Fwd: Asking about Gluster Performance Factors In-Reply-To: Message-ID: <153525d7-fe8c-4f5c-aa06-097fcb4b0980@zmail01.collab.prod.int.phx2.redhat.com> See response below from Ben England. Also, note that this question should probably go in gluster-users. -JM ----- Forwarded Message ----- From: "Ben England" To: "John Mark Walker" Sent: Wednesday, May 16, 2012 8:23:30 AM Subject: Re: [Gluster-devel] Asking about Gluster Performance Factors JM, see comments marked with ben>>> below. ----- Original Message ----- From: "???" To: gluster-devel at nongnu.org Sent: Wednesday, May 16, 2012 5:23:12 AM Subject: [Gluster-devel] Asking about Gluster Performance Factors Samsung Enterprise Portal mySingle May 16, 2012 Dear Gluster Dev Team : I'm Ethan, Assistant engineer in Samsung electronics. Reviewing your paper, I have some questions of performance factors in gluster. First, what does it mean the option "performance.cache-*"? Does it mean read cache? If does, what's difference between the options "prformance.cache-max-file-size" and "performance.cache-size" ? I read your another paper("performance in a gluster system, versions 3.1.x") and it says as below on Page 12, (Gluster Native protocol does not implement write caching, as we believe that the modest performance improvements from rite caching do not justify the risk of cache coherency issues.) ben>>> While gluster processes do not implement write caching internally, there are at least 3 ways to improve write performance in a Gluster system. - If you use a RAID controller with a non-volatile writeback cache, the RAID controller can buffer writes on behalf of the Gluster server and thereby reduce latency. - XFS or any other local filesystem used within the server "bricks" can do "write-thru" caching, meaning that the writes can be aggregated and can be kept in the Linux buffer cache so that subsequent read requests can be satisfied from this cache, transparent to Gluster processes. - there is a "write-behind" translator in the native client that will aggregate small sequential write requests at the FUSE layer into larger network-level write requests. If the smallest possible application I/O size is a requirement, sequential writes can also be efficiently aggregated by an NFS client. Second, how much is the read throughput improved as configuring 2-way replication? we need any statistics or something like that. ("performance in a gluster system, versions 3.1.x") and it says as below on Page 12, (However, read throughput is generally improved by replication, as reads can be delivered from either storage node) ben>>> Yes, reads can be satisfied by either server in a replication pair. Since the gluster native client only reads one of the two replicas, read performance should be approximately the same for 2-replica file system as it would be for a 1-replica file system. The difference in performance is with writes, as you would expect. Sincerely yours, Ethan Eunjun Park Assistant Engineer, Solution Development Team, Media Solution Center 416, Maetan 3-dong, Yeongtong-gu, Suwon-si, Gyeonggi-do 443-742, Korea Mobile : 010-8609-9532 E-mail : ej1515.park at samsung.com http://www.samsung.com/sec _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnmark at redhat.com Thu May 17 06:35:10 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 02:35:10 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: Message-ID: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> I was thinking about sending these gerritt notifications to gluster-devel by default - what do y'all think? -JM ----- Forwarded Message ----- From: "Kaushal M (Code Review)" Sent: Wednesday, May 16, 2012 11:32:26 PM Subject: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() Kaushal M has uploaded a new change for review. Change subject: client/protocol : Changes in client3_1_getxattr() ...................................................................... client/protocol : Changes in client3_1_getxattr() Backporting change 1d02db63ae from master. Copy args->loc to local->loc in client3_1_getxattr(). This prevents logs with "(null) (--)" in client3_1_getxattr_cbk(). Also save args->name in local->name and print it in the log as well. BUG: 812199 Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Signed-off-by: Kaushal M --- M xlators/protocol/client/src/client-helpers.c M xlators/protocol/client/src/client.h M xlators/protocol/client/src/client3_1-fops.c 3 files changed, 11 insertions(+), 2 deletions(-) git pull ssh://*/glusterfs refs/changes/50/3350/1 -- To view, visit http://review.gluster.com/3350 To unsubscribe, visit http://review.gluster.com/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Gerrit-PatchSet: 1 Gerrit-Project: glusterfs Gerrit-Branch: release-3.3 Gerrit-Owner: Kaushal M From rajesh at redhat.com Thu May 17 06:42:56 2012 From: rajesh at redhat.com (Rajesh Amaravathi) Date: Thu, 17 May 2012 02:42:56 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: +1 Regards, Rajesh Amaravathi, Software Engineer, GlusterFS RedHat Inc. ----- Original Message ----- From: "John Mark Walker" To: gluster-devel at nongnu.org Sent: Thursday, May 17, 2012 12:05:10 PM Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() I was thinking about sending these gerritt notifications to gluster-devel by default - what do y'all think? -JM ----- Forwarded Message ----- From: "Kaushal M (Code Review)" Sent: Wednesday, May 16, 2012 11:32:26 PM Subject: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() Kaushal M has uploaded a new change for review. Change subject: client/protocol : Changes in client3_1_getxattr() ...................................................................... client/protocol : Changes in client3_1_getxattr() Backporting change 1d02db63ae from master. Copy args->loc to local->loc in client3_1_getxattr(). This prevents logs with "(null) (--)" in client3_1_getxattr_cbk(). Also save args->name in local->name and print it in the log as well. BUG: 812199 Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Signed-off-by: Kaushal M --- M xlators/protocol/client/src/client-helpers.c M xlators/protocol/client/src/client.h M xlators/protocol/client/src/client3_1-fops.c 3 files changed, 11 insertions(+), 2 deletions(-) git pull ssh://*/glusterfs refs/changes/50/3350/1 -- To view, visit http://review.gluster.com/3350 To unsubscribe, visit http://review.gluster.com/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Gerrit-PatchSet: 1 Gerrit-Project: glusterfs Gerrit-Branch: release-3.3 Gerrit-Owner: Kaushal M _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From vijay at gluster.com Thu May 17 06:55:42 2012 From: vijay at gluster.com (Vijay Bellur) Date: Thu, 17 May 2012 12:25:42 +0530 Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> References: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FB4A0EE.40102@gluster.com> On 05/17/2012 12:05 PM, John Mark Walker wrote: > I was thinking about sending these gerritt notifications to gluster-devel by default - what do y'all think? Gerrit automatically sends out a notification to all registered users who are watching the project. Do we need an additional notification to gluster-devel if there's a considerable overlap between registered users of gluster-devel and gerrit? -Vijay From johnmark at redhat.com Thu May 17 07:26:23 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 03:26:23 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <4FB4A0EE.40102@gluster.com> Message-ID: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> There are close to 600 people now subscribed to gluster-devel - how many of them actually have an account on Gerritt? I honestly have no idea. Another thing this would do is send a subtle message to subscribers that this is not the place to discuss user issues, but perhaps there are better ways to do that. I've seen many projects do this - as well as send all bugzilla and github notifications, but I could also see some people getting annoyed. -JM ----- Original Message ----- > On 05/17/2012 12:05 PM, John Mark Walker wrote: > > I was thinking about sending these gerritt notifications to > > gluster-devel by default - what do y'all think? > > Gerrit automatically sends out a notification to all registered users > who are watching the project. Do we need an additional notification > to > gluster-devel if there's a considerable overlap between registered > users > of gluster-devel and gerrit? > > > -Vijay > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From ashetty at redhat.com Thu May 17 07:35:27 2012 From: ashetty at redhat.com (Anush Shetty) Date: Thu, 17 May 2012 13:05:27 +0530 Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> References: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FB4AA3F.1090700@redhat.com> On 05/17/2012 12:56 PM, John Mark Walker wrote: > There are close to 600 people now subscribed to gluster-devel - how many of them actually have an account on Gerritt? I honestly have no idea. Another thing this would do is send a subtle message to subscribers that this is not the place to discuss user issues, but perhaps there are better ways to do that. > > I've seen many projects do this - as well as send all bugzilla and github notifications, but I could also see some people getting annoyed. > How about a weekly digest of the same. - Anush From manu at netbsd.org Thu May 17 09:02:32 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 17 May 2012 11:02:32 +0200 Subject: [Gluster-devel] Crashes with latest git code Message-ID: <1kk8js0.b6kp732ejixeM%manu@netbsd.org> Hi I get a lot of crashes on NetBSD with latest git code. looking at core dumps, it is obvious I get memory corruption, as I find various structure overwritten by texts (file path or content). Linking with electric fence produces a much earlier crash, always at the same place. Here is how it looks: Program terminated with signal 11, Segmentation fault. #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 (gdb) bt #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 #1 0xbaa5ec1e in gf_strdup (src=0x0) at ../../../../libglusterfs/src/mem-pool.h:119 #2 0xbaa76dbf in client3_1_getxattr (frame=0xbb77f5c0, this=0xba3cd000, data=0xbfbfe18c) at client3_1-fops.c:4641 #3 0xbaa59ab8 in client_getxattr (frame=0xbb77f5c0, this=0xba3cd000, loc=0xb9402dd0, name=0x0, xdata=0x0) at client.c:1452 #4 0xb9ac3c7d in afr_sh_metadata_sync_prepare (frame=0xba8026bc, this=0xba3ce000) at afr-self-heal-metadata.c:419 #5 0xb9ac428b in afr_sh_metadata_fix (frame=0xba8026bc, this=0xba3ce000, op_ret=0, op_errno=0) at afr-self-heal-metadata.c:522 #6 0xb9abeb2b in afr_sh_common_lookup_cbk (frame=0xba8026bc, cookie=0x1, this=0xba3ce000, op_ret=0, op_errno=0, inode=0xb8b001a0, buf=0xbfbfe424, xattr=0xba401394, postparent=0xbfbfe3bc) at afr-self-heal-common.c:1311 #7 0xbaa6dc10 in client3_1_lookup_cbk (req=0xb92010d8, iov=0xb92010f8, count=1, myframe=0xbb77f550) at client3_1-fops.c:2636 Frame 4 is this: STACK_WIND (frame, afr_sh_metadata_getxattr_cbk, priv->children[source], priv->children[source]->fops->getxattr, &local->loc, NULL, NULL); Then in frame 3, I get args.name = NULL client_getxattr (call_frame_t *frame, xlator_t *this, loc_t *loc, const char *name, dict_t *xdata) (...) args.name = name; (...) ret = proc->fn (frame, this, &args); In frame 2, args->name = NULL client3_1_getxattr (call_frame_t *frame, xlator_t *this, void *data) (...) args = data; (...) local->name = gf_strdup (args->name); And there we will crash in gf_strdup(). The root cause is afr_sh_metadata_sync_prepare() calling client_getxattr with NULL arguments. The fix is beyond my knowledge of glusterfs internals, but I am sure that some folks here will be able to comment. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From rabhat at redhat.com Thu May 17 09:11:55 2012 From: rabhat at redhat.com (Raghavendra Bhat) Date: Thu, 17 May 2012 05:11:55 -0400 (EDT) Subject: [Gluster-devel] Crashes with latest git code In-Reply-To: <1kk8js0.b6kp732ejixeM%manu@netbsd.org> Message-ID: <3044c1d3-8b15-4a9d-9d18-7343cf8a33f4@zmail01.collab.prod.int.phx2.redhat.com> Hi Emmanuel, A bug has already been filed for this (822385) and patch has been sent for the review (http://review.gluster.com/#change,3353). Regards, Raghavendra Bhat ----- Original Message ----- From: "Emmanuel Dreyfus" To: gluster-devel at nongnu.org Sent: Thursday, May 17, 2012 2:32:32 PM Subject: [Gluster-devel] Crashes with latest git code Hi I get a lot of crashes on NetBSD with latest git code. looking at core dumps, it is obvious I get memory corruption, as I find various structure overwritten by texts (file path or content). Linking with electric fence produces a much earlier crash, always at the same place. Here is how it looks: Program terminated with signal 11, Segmentation fault. #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 (gdb) bt #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 #1 0xbaa5ec1e in gf_strdup (src=0x0) at ../../../../libglusterfs/src/mem-pool.h:119 #2 0xbaa76dbf in client3_1_getxattr (frame=0xbb77f5c0, this=0xba3cd000, data=0xbfbfe18c) at client3_1-fops.c:4641 #3 0xbaa59ab8 in client_getxattr (frame=0xbb77f5c0, this=0xba3cd000, loc=0xb9402dd0, name=0x0, xdata=0x0) at client.c:1452 #4 0xb9ac3c7d in afr_sh_metadata_sync_prepare (frame=0xba8026bc, this=0xba3ce000) at afr-self-heal-metadata.c:419 #5 0xb9ac428b in afr_sh_metadata_fix (frame=0xba8026bc, this=0xba3ce000, op_ret=0, op_errno=0) at afr-self-heal-metadata.c:522 #6 0xb9abeb2b in afr_sh_common_lookup_cbk (frame=0xba8026bc, cookie=0x1, this=0xba3ce000, op_ret=0, op_errno=0, inode=0xb8b001a0, buf=0xbfbfe424, xattr=0xba401394, postparent=0xbfbfe3bc) at afr-self-heal-common.c:1311 #7 0xbaa6dc10 in client3_1_lookup_cbk (req=0xb92010d8, iov=0xb92010f8, count=1, myframe=0xbb77f550) at client3_1-fops.c:2636 Frame 4 is this: STACK_WIND (frame, afr_sh_metadata_getxattr_cbk, priv->children[source], priv->children[source]->fops->getxattr, &local->loc, NULL, NULL); Then in frame 3, I get args.name = NULL client_getxattr (call_frame_t *frame, xlator_t *this, loc_t *loc, const char *name, dict_t *xdata) (...) args.name = name; (...) ret = proc->fn (frame, this, &args); In frame 2, args->name = NULL client3_1_getxattr (call_frame_t *frame, xlator_t *this, void *data) (...) args = data; (...) local->name = gf_strdup (args->name); And there we will crash in gf_strdup(). The root cause is afr_sh_metadata_sync_prepare() calling client_getxattr with NULL arguments. The fix is beyond my knowledge of glusterfs internals, but I am sure that some folks here will be able to comment. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From manu at netbsd.org Thu May 17 09:18:29 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 17 May 2012 11:18:29 +0200 Subject: [Gluster-devel] Crashes with latest git code In-Reply-To: <3044c1d3-8b15-4a9d-9d18-7343cf8a33f4@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1kk8kq6.1m2qnzf1so7qgfM%manu@netbsd.org> Raghavendra Bhat wrote: > A bug has already been filed for this (822385) and patch has been sent for > the review (http://review.gluster.com/#change,3353). I looked at the patch, it does not fix the problem I reported: args->name is still NULL. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From rabhat at redhat.com Thu May 17 09:46:20 2012 From: rabhat at redhat.com (Raghavendra Bhat) Date: Thu, 17 May 2012 05:46:20 -0400 (EDT) Subject: [Gluster-devel] Crashes with latest git code In-Reply-To: <1kk8kq6.1m2qnzf1so7qgfM%manu@netbsd.org> Message-ID: In getxattr name is NULL means its equivalent listxattr. So args->name being NULL is ok. Process was crashing because it tried to do strdup (actually strlen in the gf_strdup) of the NULL pointer to a string. On wire we will send it as a null string with namelen set to 0 and protocol/server will understand it. On client side: req.name = (char *)args->name; if (!req.name) { req.name = ""; req.namelen = 0; } On server side: if (args.namelen) state->name = gf_strdup (args.name); ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Raghavendra Bhat" Cc: gluster-devel at nongnu.org Sent: Thursday, May 17, 2012 2:48:29 PM Subject: Re: [Gluster-devel] Crashes with latest git code Raghavendra Bhat wrote: > A bug has already been filed for this (822385) and patch has been sent for > the review (http://review.gluster.com/#change,3353). I looked at the patch, it does not fix the problem I reported: args->name is still NULL. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From jdarcy at redhat.com Thu May 17 11:47:52 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 17 May 2012 07:47:52 -0400 Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <4FB4AA3F.1090700@redhat.com> References: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> <4FB4AA3F.1090700@redhat.com> Message-ID: <4FB4E568.8050601@redhat.com> On 05/17/2012 03:35 AM, Anush Shetty wrote: > > On 05/17/2012 12:56 PM, John Mark Walker wrote: >> There are close to 600 people now subscribed to gluster-devel - how many >> of them actually have an account on Gerritt? I honestly have no idea. >> Another thing this would do is send a subtle message to subscribers that >> this is not the place to discuss user issues, but perhaps there are better >> ways to do that. >> >> I've seen many projects do this - as well as send all bugzilla and github >> notifications, but I could also see some people getting annoyed. > > How about a weekly digest of the same. Excellent idea. From johnmark at redhat.com Thu May 17 16:15:59 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 12:15:59 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <4FB4E568.8050601@redhat.com> Message-ID: ----- Original Message ----- > On 05/17/2012 03:35 AM, Anush Shetty wrote: > > > > How about a weekly digest of the same. Sounds reasonable. Now we just have to figure out how to implement :) -JM From vijay at build.gluster.com Thu May 17 16:51:43 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Thu, 17 May 2012 09:51:43 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0beta4 released Message-ID: <20120517165144.1BB041803EB@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0beta4/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0beta4.tar.gz This release is made off From johnmark at redhat.com Thu May 17 18:08:01 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 14:08:01 -0400 (EDT) Subject: [Gluster-devel] glusterfs-3.3.0beta4 released In-Reply-To: <20120517165144.1BB041803EB@build.gluster.com> Message-ID: <864fe250-bfd3-49ca-9310-2fc601411b83@zmail01.collab.prod.int.phx2.redhat.com> Reminder: GlusterFS 3.3 has been branched on GitHub, so you can pull the latest code from this branch if you want to test new fixes after the beta was released: https://github.com/gluster/glusterfs/tree/release-3.3 Also, note that this release features a license change in some files. We noted that some developers could not contribute code to the project because of compatibility issues around GPLv3. So, as a compromise, we changed the licensing in files that we deemed client-specific to allow for more contributors and a stronger developer community. Those files are now dual-licensed under the LGPLv3 and the GPLv2. For text of both of these license, see these URLs: http://www.gnu.org/licenses/lgpl.html http://www.gnu.org/licenses/old-licenses/gpl-2.0.html To see the list of files we modified with the new licensing, see this patchset from Kaleb: http://review.gluster.com/#change,3304 If you have questions or comments about this change, please do reach out to me. Thanks, John Mark ----- Original Message ----- > > http://bits.gluster.com/pub/gluster/glusterfs/3.3.0beta4/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0beta4.tar.gz > > This release is made off > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From johnmark at redhat.com Thu May 17 20:34:56 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 16:34:56 -0400 (EDT) Subject: [Gluster-devel] GlusterFest! For GlusterFS 3.3 Beta 4 In-Reply-To: <5456de9c-6c8b-4995-ad1e-720c9c52c74f@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <88ecc073-688f-4edc-8ff3-ccba3b6142a3@zmail01.collab.prod.int.phx2.redhat.com> An update: Kaleb was kind enough to port his HekaFS testing page for Fedora to GlusterFS. If you're looking for a series of things to test, see this URL: http://www.gluster.org/community/documentation/index.php/3.3.0_Beta_4_Tests By tonight, I'll have a handy form for reporting your results. We are at T-6:30 hours and counting until GlusterFest begins in earnest. For all updates related to GlusterFest, see this page: http://www.gluster.org/community/documentation/index.php/GlusterFest Please do post any series of tests that you would like to run. In particular, we're looking to test some of the new features of GlusterFS 3.3: - Object storage - HDFS compatibility library - Granular locking - More proactive self-heal Happy hacking, JM ----- Original Message ----- > Greetings, > > We are planning to have one more beta release tomorrow. If all goes > as planned, this will be the release candidate. In conjunction with > the beta, I thought we should have a 24-hour GlusterFest, starting > tomorrow at 8pm - > http://www.gluster.org/community/documentation/index.php/GlusterFest > > 'What's a GlusterFest?' you may be asking. Well, it's all of the > below: > > > - Testing the software. Install the new beta (when it's released > tomorrow) and put it through its paces. We will put some basic > testing procedures on the GlusterFest page here - > http://www.gluster.org/community/documentation/index.php/GlusterFest > > - Feel free to create your own testing procedures and link to it > from the GlusterFest page > > > - Finding bugs. See the current list of bugs targeted for this > release: http://bit.ly/beta4bugs > > > - Fixing bugs. If you're the kind of person who wants to submit > patches, see our development workflow doc: > http://www.gluster.org/community/documentation/index.php/Development_Work_Flow > > - and then get to know Gerritt: http://review.gluster.com/ > > > The GlusterFest page will be updated with some basic testing > procedures tomorrow, and GlusterFest will officially begin at 8pm > PDT May 17/03:00 UTC May 18 (coinciding with the end of our meetup > tomorrow), and ending at 8pm PDT May 18/03:00 UTC May 19. > > > If you need assistance, see #gluster on Freenode for "real-time" > questions, gluster-users and community.gluster.org for general usage > questions, and gluster-devel for anything related to building, > patching, and bug-fixing. > > > To keep up with GlusterFest activity, I'll be sending updates from > the @glusterorg account on Twitter, and I'm sure there will be > traffic on the mailing lists, as well. > > > Happy testing and bug-hunting! > > -JM > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From manu at netbsd.org Fri May 18 07:49:29 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 18 May 2012 07:49:29 +0000 Subject: [Gluster-devel] python version In-Reply-To: References: <4FB13314.3060708@redhat.com> <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Message-ID: <20120518074929.GJ3985@homeworld.netbsd.org> On Mon, May 14, 2012 at 04:58:18PM -0700, Anand Avati wrote: > Emmanuel, since the problem is not going to be a long lasting one (either > of the two should fix your problem), I suggest you find a solution local to > you in the interim. I submitted a tiny hack that solves the problem for everyone until automake is upgraded on glusterfs build system: http://review.gluster.com/3360 -- Emmanuel Dreyfus manu at netbsd.org From johnmark at redhat.com Fri May 18 15:02:50 2012 From: johnmark at redhat.com (John Mark Walker) Date: Fri, 18 May 2012 11:02:50 -0400 (EDT) Subject: [Gluster-devel] GlusterFest! For GlusterFS 3.3 Beta 4 In-Reply-To: <88ecc073-688f-4edc-8ff3-ccba3b6142a3@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: Looks like we have a few testers who have reported their results already: http://www.gluster.org/community/documentation/index.php/GlusterFest 12 more hours! -JM ----- Original Message ----- > An update: > > Kaleb was kind enough to port his HekaFS testing page for Fedora to > GlusterFS. If you're looking for a series of things to test, see > this URL: > http://www.gluster.org/community/documentation/index.php/3.3.0_Beta_4_Tests > > > By tonight, I'll have a handy form for reporting your results. We are > at T-6:30 hours and counting until GlusterFest begins in earnest. > For all updates related to GlusterFest, see this page: > http://www.gluster.org/community/documentation/index.php/GlusterFest > > > Please do post any series of tests that you would like to run. In > particular, we're looking to test some of the new features of > GlusterFS 3.3: > > - Object storage > - HDFS compatibility library > - Granular locking > - More proactive self-heal > > > Happy hacking, > JM > > > ----- Original Message ----- > > Greetings, > > > > We are planning to have one more beta release tomorrow. If all goes > > as planned, this will be the release candidate. In conjunction with > > the beta, I thought we should have a 24-hour GlusterFest, starting > > tomorrow at 8pm - > > http://www.gluster.org/community/documentation/index.php/GlusterFest > > > > 'What's a GlusterFest?' you may be asking. Well, it's all of the > > below: > > > > > > - Testing the software. Install the new beta (when it's released > > tomorrow) and put it through its paces. We will put some basic > > testing procedures on the GlusterFest page here - > > http://www.gluster.org/community/documentation/index.php/GlusterFest > > > > - Feel free to create your own testing procedures and link to it > > from the GlusterFest page > > > > > > - Finding bugs. See the current list of bugs targeted for this > > release: http://bit.ly/beta4bugs > > > > > > - Fixing bugs. If you're the kind of person who wants to submit > > patches, see our development workflow doc: > > http://www.gluster.org/community/documentation/index.php/Development_Work_Flow > > > > - and then get to know Gerritt: http://review.gluster.com/ > > > > > > The GlusterFest page will be updated with some basic testing > > procedures tomorrow, and GlusterFest will officially begin at 8pm > > PDT May 17/03:00 UTC May 18 (coinciding with the end of our meetup > > tomorrow), and ending at 8pm PDT May 18/03:00 UTC May 19. > > > > > > If you need assistance, see #gluster on Freenode for "real-time" > > questions, gluster-users and community.gluster.org for general > > usage > > questions, and gluster-devel for anything related to building, > > patching, and bug-fixing. > > > > > > To keep up with GlusterFest activity, I'll be sending updates from > > the @glusterorg account on Twitter, and I'm sure there will be > > traffic on the mailing lists, as well. > > > > > > Happy testing and bug-hunting! > > > > -JM > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > From manu at netbsd.org Fri May 18 16:15:20 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 18 May 2012 16:15:20 +0000 Subject: [Gluster-devel] memory corruption in release-3.3 Message-ID: <20120518161520.GL3985@homeworld.netbsd.org> Hi I still get crashes caused by memory corruption with latest release-3.3. My test case is a rm -Rf on a large tree. It seems I crash in two places: First crash flavor (trav is sometimes unmapped memory, sometimes NULL) #0 0xbbbb60ad in __iobuf_select_arena (iobuf_pool=0xbb70d400, page_size=128) at iobuf.c:453 453 if (trav->passive_cnt) { (gdb) print trav $1 = (struct iobuf_arena *) 0x414d202c (gdb) bt #0 0xbbbb60ad in __iobuf_select_arena (iobuf_pool=0xbb70d400, page_size=128) at iobuf.c:453 #1 0xbbbb655a in iobuf_get2 (iobuf_pool=0xbb70d400, page_size=24) at iobuf.c:604 #2 0xbaa549c7 in client_submit_request () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #3 0xbaa732c5 in client3_1_open () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #4 0xbaa574e6 in client_open () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #5 0xb9abac10 in afr_sh_data_open () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #6 0xb9abacb9 in afr_self_heal_data () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #7 0xb9ac2751 in afr_sh_metadata_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #8 0xb9ac457a in afr_self_heal_metadata () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #9 0xb9abd93f in afr_sh_missing_entries_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #10 0xb9ac169b in afr_self_heal () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #11 0xb9ae2e5b in afr_launch_self_heal () #12 0xb9ae3de9 in afr_lookup_perform_self_heal () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #13 0xb9ae4804 in afr_lookup_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #14 0xb9ae4fab in afr_lookup_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #15 0xbaa6dc10 in client3_1_lookup_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #16 0xbbb69716 in rpc_clnt_handle_reply () from /usr/local/lib/libgfrpc.so.0 #17 0xbbb699b3 in rpc_clnt_notify () from /usr/local/lib/libgfrpc.so.0 #18 0xbbb65989 in rpc_transport_notify () from /usr/local/lib/libgfrpc.so.0 #19 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #20 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #21 0xbbbb270f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=3) at event.c:357 #22 0xbbbb297b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #23 0xbbbb2ca7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #24 0x08050078 in main () Second crash flavor (it looks more like a double free) Program terminated with signal 11, Segmentation fault. #0 0xbb92661e in ?? () from /lib/libc.so.12 (gdb) bt #0 0xbb92661e in ?? () from /lib/libc.so.12 #1 0xbb92891b in free () from /lib/libc.so.12 #2 0xbbbb376f in __gf_free (free_ptr=0xbb70d160) at mem-pool.c:258 #3 0xbbb7e17d in data_destroy (data=0xba301d4c) at dict.c:135 #4 0xbbb7ee18 in data_unref (this=0xba301d4c) at dict.c:470 #5 0xbbb7eb6b in dict_destroy (this=0xba4022d0) at dict.c:395 #6 0xbbb7ecab in dict_unref (this=0xba4022d0) at dict.c:432 #7 0xbaa164ba in __qr_inode_free () from /usr/local/lib/glusterfs/3.3git/xlator/performance/quick-read.so #8 0xbaa27164 in qr_forget () from /usr/local/lib/glusterfs/3.3git/xlator/performance/quick-read.so #9 0xbbb9b221 in __inode_destroy (inode=0xb8b017e4) at inode.c:320 #10 0xbbb9d0a5 in inode_table_prune (table=0xba3cc160) at inode.c:1235 #11 0xbbb9b64e in inode_unref (inode=0xb8b017e4) at inode.c:445 #12 0xbbb85249 in loc_wipe (loc=0xb9402dd0) at xlator.c:530 #13 0xb9ae126e in afr_local_cleanup () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #14 0xb9a9c66b in afr_unlink_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #15 0xb9ad2d5b in afr_unlock_common_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #16 0xb9ad38a2 in afr_unlock_entrylk_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so ---Type to continue, or q to quit--- #17 0xbaa68370 in client3_1_entrylk_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #18 0xbbb69716 in rpc_clnt_handle_reply () from /usr/local/lib/libgfrpc.so.0 #19 0xbbb699b3 in rpc_clnt_notify () from /usr/local/lib/libgfrpc.so.0 #20 0xbbb65989 in rpc_transport_notify () from /usr/local/lib/libgfrpc.so.0 #21 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #22 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #23 0xbbbb270f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=3) at event.c:357 #24 0xbbbb297b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #25 0xbbbb2ca7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #26 0x08050078 in main () (gdb) frame 2 #2 0xbbbb376f in __gf_free (free_ptr=0xbb70d160) at mem-pool.c:258 258 FREE (free_ptr); (gdb) x/1w free_ptr 0xbb70d160: 538978863 -- Emmanuel Dreyfus manu at netbsd.org From amarts at redhat.com Sat May 19 06:15:09 2012 From: amarts at redhat.com (Amar Tumballi) Date: Sat, 19 May 2012 11:45:09 +0530 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <20120518161520.GL3985@homeworld.netbsd.org> References: <20120518161520.GL3985@homeworld.netbsd.org> Message-ID: <4FB73A6D.9050601@redhat.com> On 05/18/2012 09:45 PM, Emmanuel Dreyfus wrote: > Hi > > I still get crashes caused by memory corruption with latest release-3.3. > My test case is a rm -Rf on a large tree. It seems I crash in two places: > Emmanuel, Can you please file bug report? different bugs corresponding to different crash dumps will help us. That helps in tracking development internally. Regards, Amar From manu at netbsd.org Sat May 19 10:29:55 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 12:29:55 +0200 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <20120518161520.GL3985@homeworld.netbsd.org> Message-ID: <1kkccac.1lkm3tq166jzftM%manu@netbsd.org> Emmanuel Dreyfus wrote: > Second crash flavor (it looks more like a double free) Here it is again at a different place. This is in loc_wipe, where loc->path is free'ed. Looking at the code, I see that there are places where loc->path is allocated by gf_strdup(). I see other places where it is copied from another buffer. Since this is done without reference counts, it seems likely that there is a double free somewhere. Opinions? (gdb) bt #0 0xbb92652a in ?? () from /lib/libc.so.12 #1 0xbb92891b in free () from /lib/libc.so.12 #2 0xbbbb376f in __gf_free (free_ptr=0xb8250040) at mem-pool.c:258 #3 0xbbb85269 in loc_wipe (loc=0xba4cd010) at xlator.c:534 #4 0xbaa5e68a in client_local_wipe (local=0xba4cd010) at client-helpers.c:125 #5 0xbaa614d5 in client3_1_open_cbk (req=0xb92010d8, iov=0xb92010f8, count=1, myframe=0xbb77fa20) at client3_1-fops.c:421 #6 0xbbb69716 in rpc_clnt_handle_reply (clnt=0xba3c51c0, pollin=0xbb77d220) at rpc-clnt.c:788 #7 0xbbb699b3 in rpc_clnt_notify (trans=0xbb70ec00, mydata=0xba3c51e0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-clnt.c:907 #8 0xbbb65989 in rpc_transport_notify (this=0xbb70ec00, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-transport.c:489 #9 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #10 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #11 0xbbbb270f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=3) at event.c:357 #12 0xbbbb297b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #13 0xbbbb2ca7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #14 0x08050078 in main () -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vijay at build.gluster.com Sat May 19 12:35:21 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Sat, 19 May 2012 05:35:21 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa42 released Message-ID: <20120519123524.842501803FC@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa42/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa42.tar.gz This release is made off v3.3.0qa42 From manu at netbsd.org Sat May 19 13:50:25 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 15:50:25 +0200 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <1kkccac.1lkm3tq166jzftM%manu@netbsd.org> Message-ID: <1kkcml0.c7hab41bl4auaM%manu@netbsd.org> Emmanuel Dreyfus wrote: > Looking at the code, I see that there are places where loc->path is > allocated by gf_strdup(). I see other places where it is copied from > another buffer. Since this is done without reference counts, it seems > likely that there is a double free somewhere. Opinions? I added a second argument to gf_strdup() so that the calling function can pass __func__, and I started logging gf_strdup() allocations to track a possible double free. ANd the result is... the offending free() is done on a loc->path that was not allocated by gf_strdup(). Can it be allocated by another function? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 19 15:07:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 17:07:53 +0200 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <1kkccac.1lkm3tq166jzftM%manu@netbsd.org> Message-ID: <1kkcpny.16h3fbd1pfhutzM%manu@netbsd.org> Emmanuel Dreyfus wrote: > Looking at the code, I see that there are places where loc->path is > allocated by gf_strdup(). I see other places where it is copied from > another buffer. Since this is done without reference counts, it seems > likely that there is a double free somewhere. Opinions? I found a bug: Thou shalt not free(3) memory dirname(3) returned On Linux basename() and dirname() return a pointer with the string passed as argument. On BSD flavors, basename() and dirname() return static storage, or pthread specific storage. Both behaviour are compliant, but calling free on the result in the second case is a bug. --- xlators/cluster/afr/src/afr-dir-write.c.orig 2012-05-19 16:45:30.000000000 +0200 +++ xlators/cluster/afr/src/afr-dir-write.c 2012-05-19 17:03:17.000000000 +0200 @@ -55,14 +55,22 @@ if (op_errno) *op_errno = ENOMEM; goto out; } - parent->path = dirname (child_path); + parent->path = gf_strdup( dirname (child_path) ); + if (!parent->path) { + if (op_errno) + *op_errno = ENOMEM; + goto out; + } parent->inode = inode_ref (child->parent); uuid_copy (parent->gfid, child->pargfid); ret = 0; out: + if (child_path) + GF_FREE(child_path); + return ret; } /* {{{ create */-- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 19 17:34:51 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 19:34:51 +0200 Subject: [Gluster-devel] mkdir race condition Message-ID: <1kkcuhi.1vqgbxy1lxb8w2M%manu@netbsd.org> On a replicated volume, mkdir quickly followed by the rename of a new directory child fails. # rm -Rf test && mkdir test && touch test/a && mv test/a test/b mv: rename test/a to test/b: No such file or directory # rm -Rf test && mkdir test && sleep 1 && touch test/a && mv test/a test/b (it works) Client log: [2012-05-19 18:49:43.933090] W [client3_1-fops.c:327:client3_1_mkdir_cbk] 0-pfs-client-0: remote operation failed: No such file or directory. Path: /test (00000000-0000-0000-0000-000000000000) [2012-05-19 18:49:43.944883] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-pfs-client-0: remote operation failed: No such file or directory [2012-05-19 18:49:43.946265] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-pfs-client-0: remote operation failed: No such file or directory [2012-05-19 18:49:43.961028] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-pfs-client-0: remote operation failed: No such file or directory [2012-05-19 18:49:43.961528] W [fuse-bridge.c:1515:fuse_rename_cbk] 0-glusterfs-fuse: 27: /test/a -> /test/b => -1 (No such file or directory) Server log: [2012-05-19 18:49:58.455280] E [posix-handle.c:412:posix_handle_mkdir_hashes] 0-pfs-posix: error mkdir hash-1 /export/wd3a/.glusterfs/f6/8b (No such file or directory) [2012-05-19 18:49:58.455384] W [posix-handle.c:521:posix_handle_soft] 0-pfs-posix: mkdir /export/wd3a/.glusterfs/f6/8b/f68b2a33-a649-4705-9dfd-40a15f22589a failed (No such file or directory) [2012-05-19 18:49:58.455425] E [posix.c:968:posix_mkdir] 0-pfs-posix: setting gfid on /export/wd3a/test failed [2012-05-19 18:49:58.455558] E [posix.c:1010:posix_mkdir] 0-pfs-posix: post-operation lstat on parent of /export/wd3a/test failed: No such file or directory [2012-05-19 18:49:58.455664] I [server3_1-fops.c:529:server_mkdir_cbk] 0-pfs-server: 41: MKDIR /test (00000000-0000-0000-0000-000000000000) ==> -1 (No such file or directory) [2012-05-19 18:49:58.467548] I [server3_1-fops.c:346:server_entrylk_cbk] 0-pfs-server: 46: ENTRYLK (null) (--) ==> -1 (No such file or directory) [2012-05-19 18:49:58.468990] I [server3_1-fops.c:346:server_entrylk_cbk] 0-pfs-server: 47: ENTRYLK (null) (--) ==> -1 (No such file or directory) [2012-05-19 18:49:58.483726] I [server3_1-fops.c:346:server_entrylk_cbk] 0-pfs-server: 51: ENTRYLK (null) (--) ==> -1 (No such file or directory) It says it fails, but it seems it succeeded: silo# getextattr -x trusted.gfid /export/wd3a/test /export/wd3a/test 000 f6 8b 2a 33 a6 49 47 05 9d fd 40 a1 5f 22 58 9a ..*3.IG... at ._"X. Client is release-3.3 from yesterday. Server is master branch from may 14th. Is it a known problem? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 05:36:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 07:36:02 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / Message-ID: <1kkdu4r.aq5gouehux9cM%manu@netbsd.org> On release-3.3, glusterd_is_path_in_use() in xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has extended attribute enabled, and makes it impossible to create a volume with bricks from other filesystems (with EA enabled), if / does not support extended attributes. Is it on purpose? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From kparthas at redhat.com Sun May 20 05:53:35 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 01:53:35 -0400 (EDT) Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kkdu4r.aq5gouehux9cM%manu@netbsd.org> Message-ID: <1d6e3018-e614-4273-883c-1cca9efaf0b8@zmail01.collab.prod.int.phx2.redhat.com> Emmanuel, The assumption of EA being enabled in / filesystem or any prefix of brick path is an accidental side-effect of the way glusterd_is_path_in_use() is used in glusterd_brick_create_path(). The error handling should be accommodative to ENOTSUP. In short it is a bug. Will send out a patch immediately. thanks, krish ----- Original Message ----- From: "Emmanuel Dreyfus" To: gluster-devel at nongnu.org Sent: Sunday, May 20, 2012 11:06:02 AM Subject: [Gluster-devel] 3.3 requires extended attribute on / On release-3.3, glusterd_is_path_in_use() in xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has extended attribute enabled, and makes it impossible to create a volume with bricks from other filesystems (with EA enabled), if / does not support extended attributes. Is it on purpose? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From manu at netbsd.org Sun May 20 05:56:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 07:56:53 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kkdu4r.aq5gouehux9cM%manu@netbsd.org> Message-ID: <1kkdul5.4vmrbe1owph67M%manu@netbsd.org> Emmanuel Dreyfus wrote: > On release-3.3, glusterd_is_path_in_use() in > xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has > extended attribute enabled, and makes it impossible to create a volume > with bricks from other filesystems (with EA enabled), if / does not > support extended attributes. And even with EA enabled on root, creating a volume loops forever on reading unexistant trusted.gfid and trusted.glusterfs.volume-id on brick's parent directory. It gets ENODATA and retry forever. If I patch the function to just set in_use = 0 and return 0, I can create a volume. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From ian.latter at midnightcode.org Sun May 20 06:12:39 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 16:12:39 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> Hello, Couple of questions that might help make my module a little more sane; 0) Is there any developer docco? I've just done another quick search and I can't see any. Let me know if there is and I'll try and answer the below myself. 1) What is the difference between STACK_WIND and STACK_WIND_COOKIE? I.e. I've only ever used STACK_WIND, when should I use it versus the other? 2) Is there a way to write linearly within a single function within Gluster (or is there a reason why I wouldn't want to do that)? RE 2: This may stem from my lack of understanding of the broader Gluster internals. I am performing multiple fops per fop, which is creating structural inelegances in the code that make me think I'm heading down the wrong rabbit hole. I want to say; read() { // pull in other content while(want more) { _lookup() _open() _read() _close() } return iovec } But the way I've understood the Gluster internal structure is that I need to operate in a chain of related functions; _read_lookup_cbk_open_cbk_read_cbk() { wind _close() } _read_lookup_cbk_open_cbk() { wind _read() add to local->iovec } _lookup_cbk() { wind _open() } read() { while(want more) { wind _lookup() } return local->iovec } Am I missing something - or is there a nicer way of doing this? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From manu at netbsd.org Sun May 20 06:13:32 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 08:13:32 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1d6e3018-e614-4273-883c-1cca9efaf0b8@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1kkdvl3.1p663u6iyul1oM%manu@netbsd.org> Krishnan Parthasarathi wrote: > Will send out a patch immediately. Great :-) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 06:13:33 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 08:13:33 +0200 Subject: [Gluster-devel] mkdir race condition In-Reply-To: <1kkcuhi.1vqgbxy1lxb8w2M%manu@netbsd.org> Message-ID: <1kkdvma.10s8o2rtrmcvpM%manu@netbsd.org> Emmanuel Dreyfus wrote: > On a replicated volume, mkdir quickly followed by the rename of a new > directory child fails. > > # rm -Rf test && mkdir test && touch test/a && mv test/a test/b > mv: rename test/a to test/b: No such file or directory > # rm -Rf test && mkdir test && sleep 1 && touch test/a && mv test/a test/b > (it works) I just reinstalled server from release-3.3 and now things make more sense. Any directory creation will report failure but will succeed: bacasel# mkdir /gfs/manu mkdir: /gfs/manu: No such file or directory bacasel# cd /gfs bacasel# ls manu Server log reports it fails because: [2012-05-20 07:59:23.775789] E [posix-handle.c:412:posix_handle_mkdir_hashes] 0-pfs-posix: error mkdir hash-1 /export/wd3a/.glusterfs/ec/e2 (No such file or directory) It seems posix_handle_mkdir_hashes() attempts to mkdir two directories at once: ec/ec2. How is it supposed to work? Should parent directory be created somewhere else? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From kparthas at redhat.com Sun May 20 06:36:44 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 02:36:44 -0400 (EDT) Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kkdul5.4vmrbe1owph67M%manu@netbsd.org> Message-ID: <80c2c170-133e-4509-9ac5-062293a199ad@zmail01.collab.prod.int.phx2.redhat.com> ----- Original Message ----- From: "Emmanuel Dreyfus" To: gluster-devel at nongnu.org Sent: Sunday, May 20, 2012 11:26:53 AM Subject: Re: [Gluster-devel] 3.3 requires extended attribute on / Emmanuel Dreyfus wrote: > On release-3.3, glusterd_is_path_in_use() in > xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has > extended attribute enabled, and makes it impossible to create a volume > with bricks from other filesystems (with EA enabled), if / does not > support extended attributes. > And even with EA enabled on root, creating a volume loops forever on > reading unexistant trusted.gfid and trusted.glusterfs.volume-id on > brick's parent directory. It gets ENODATA and retry forever. If I patch > the function to just set in_use = 0 and return 0, I can create a volume. It is strange that the you see glusterd_path_in_use() loop forever. If I am not wrong, the inner loop checks for presence of trusted.gfid and trusted.glusterfs.volume-id and should exit after that, and the outer loop performs dirname on the path repeatedly and dirname(3) guarantees such an operation should return "/" eventually, which we check. It would be great if you could provide values of local variables, "used" and "curdir" when you see the looping forever. I dont have a setup to check this immediately. thanks, krish -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Sun May 20 06:47:57 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 16:47:57 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205200647.q4K6lvdN009529@singularity.tronunltd.com> > > And I am sick of the word-wrap on this client .. I think > > you've finally convinced me to fix it ... what's normal > > these days - still 80 chars? > > I used to line-wrap (gnus and cool emacs extensions). It doesn't make > sense to line wrap any more. Let the email client handle it depending > on the screen size of the device (mobile / tablet / desktop). FYI found this; an hour of code parsing in the mail software and it turns out that it had no wrapping .. it came from the stupid textarea tag in the browser (wrap="hard"). Same principle (server side coded, non client savvy) - now set to "soft". So hopefully fixed :) Cheers. -- Ian Latter Late night coder .. http://midnightcode.org/ From kparthas at redhat.com Sun May 20 06:54:54 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 02:54:54 -0400 (EDT) Subject: [Gluster-devel] Gluster internals In-Reply-To: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> Message-ID: Couple of questions that might help make my module a little more sane; 0) Is there any developer docco? I've just done another quick search and I can't see any. Let me know if there is and I'll try and answer the below myself. 1) What is the difference between STACK_WIND and STACK_WIND_COOKIE? I.e. I've only ever used STACK_WIND, when should I use it versus the other? STACK_WIND_COOKIE is used when we need to 'tie' the call wound with its corresponding callback. You can see this variant being used extensively in cluster xlators where it is used to identify the callback with the subvolume no. it is coming from. 2) Is there a way to write linearly within a single function within Gluster (or is there a reason why I wouldn't want to do that)? RE 2: This may stem from my lack of understanding of the broader Gluster internals. I am performing multiple fops per fop, which is creating structural inelegances in the code that make me think I'm heading down the wrong rabbit hole. I want to say; read() { // pull in other content while(want more) { _lookup() _open() _read() _close() } return iovec } But the way I've understood the Gluster internal structure is that I need to operate in a chain of related functions; _read_lookup_cbk_open_cbk_read_cbk() { wind _close() } _read_lookup_cbk_open_cbk() { wind _read() add to local->iovec } _lookup_cbk() { wind _open() } read() { while(want more) { wind _lookup() } return local->iovec } Am I missing something - or is there a nicer way of doing this? The above method you are trying to use is the "continuation passing style" that is extensively used in afr-inode-read.c and afr-transaction.c to perform multiple internal fops on the trigger of a single fop from the application. cluster/afr may give you some ideas on how you could structure it if you like that more. The other method I can think of (not sure if it would suit your needs) is to use the syncop framework (see libglusterfs/src/syncop.c). This allows one to make a 'synchronous' glusterfs fop. inside a xlator. The downside is that you can only make one call at a time. This may not be acceptable for cluster xlators (ie, xlator with more than one child xlator). Hope that helps, krish _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Sun May 20 07:23:12 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 17:23:12 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205200723.q4K7NCO3009706@singularity.tronunltd.com> > 1) What is the difference between STACK_WIND > and STACK_WIND_COOKIE? I.e. I've only > ever used STACK_WIND, when should I use > it versus the other? > > STACK_WIND_COOKIE is used when we need to 'tie' the call > wound with its corresponding callback. You can see this > variant being used extensively in cluster xlators where it > is used to identify the callback with the subvolume no. it > is coming from. Ok - thanks. I will take a closer look at the examples for this .. this may help me ... > 2) Is there a way to write linearly within a single > function within Gluster (or is there a reason > why I wouldn't want to do that)? > > > RE 2: > > This may stem from my lack of understanding > of the broader Gluster internals. I am performing > multiple fops per fop, which is creating structural > inelegances in the code that make me think I'm > heading down the wrong rabbit hole. I want to > say; > > read() { > // pull in other content > while(want more) { > _lookup() > _open() > _read() > _close() > } > return iovec > } > > > But the way I've understood the Gluster internal > structure is that I need to operate in a chain of > related functions; > > _read_lookup_cbk_open_cbk_read_cbk() { > wind _close() > } > > _read_lookup_cbk_open_cbk() { > wind _read() > add to local->iovec > } > > _lookup_cbk() { > wind _open() > } > > read() { > while(want more) { > wind _lookup() > } > return local->iovec > } > > > > Am I missing something - or is there a nicer way of > doing this? > > The above method you are trying to use is the "continuation passing style" that > is extensively used in afr-inode-read.c and afr-transaction.c to perform multiple > internal fops on the trigger of a single fop from the application. cluster/afr may > give you some ideas on how you could structure it if you like that more. These may have been where I got that code style from originally .. I will go back to these two programs, thanks for the reference. I'm currently working my way through the afr-heal programs .. > The other method I can think of (not sure if it would suit your needs) > is to use the syncop framework (see libglusterfs/src/syncop.c). > This allows one to make a 'synchronous' glusterfs fop. inside a xlator. > The downside is that you can only make one call at a time. This may not > be acceptable for cluster xlators (ie, xlator with more than one child xlator). In the syncop framework, how much gets affected when I use it in my xlator. Does it mean that there's only one call at a time in the whole xlator (so the current write will stop all other reads) or is the scope only the fop (so that within this write, my child->fops are serial, but neighbouring reads on my xlator will continue in other threads)? And does that then restrict what can go above and below my xlator? I mean that my xlator isn't a cluster xlator but I would like it to be able to be used on top of (or underneath) a cluster xlator, will that no longer be possible? > Hope that helps, > krish Thanks Krish, every bit helps! -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Sun May 20 07:40:54 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 17:40:54 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205200740.q4K7esfl009777@singularity.tronunltd.com> > > The other method I can think of (not sure if it would suit your needs) > > is to use the syncop framework (see libglusterfs/src/syncop.c). > > This allows one to make a 'synchronous' glusterfs fop. inside a xlator. > > The downside is that you can only make one call at a time. This may not > > be acceptable for cluster xlators (ie, xlator with more than one child xlator). > > In the syncop framework, how much gets affected when I > use it in my xlator. Does it mean that there's only one call > at a time in the whole xlator (so the current write will stop > all other reads) or is the scope only the fop (so that within > this write, my child->fops are serial, but neighbouring reads > on my xlator will continue in other threads)? And does that > then restrict what can go above and below my xlator? I > mean that my xlator isn't a cluster xlator but I would like it > to be able to be used on top of (or underneath) a cluster > xlator, will that no longer be possible? > I've just taken a look at xlators/cluster/afr/src/pump.c for some syncop usage examples and I really like what I see there. If syncop only serialises/syncs activity that I code within a given fop of my xlator and doesn't impose serial/ sync limits on the parents or children of my xlator then this looks like the right path. I want to be sure that it won't result in a globally syncronous outcome though (like ignoring a cache xlator under mine to get a true disk read) - I just need the internals of my calls to be linear. -- Ian Latter Late night coder .. http://midnightcode.org/ From manu at netbsd.org Sun May 20 08:11:04 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 10:11:04 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <80c2c170-133e-4509-9ac5-062293a199ad@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1kke141.ffp9fr1meqkgbM%manu@netbsd.org> Krishnan Parthasarathi wrote: > It is strange that the you see glusterd_path_in_use() loop forever. If I > am not wrong, the inner loop checks for presence of trusted.gfid and > trusted.glusterfs.volume-id and should exit after that, and the outer loop > performs dirname on the path repeatedly and dirname(3) guarantees such an > operation should return "/" eventually, which we check. Here is my setup when I tried that: / with EA enabled /export/wd3a ibrick with EA enabled But I may have been testing with an untintended patch in glusterd_path_in_use(). I will retry with the right fix once it will be available. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 08:30:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 10:30:53 +0200 Subject: [Gluster-devel] mkdir race condition In-Reply-To: <1kkdvma.10s8o2rtrmcvpM%manu@netbsd.org> Message-ID: <1kke28c.rugeav1w049sdM%manu@netbsd.org> Emmanuel Dreyfus wrote: > It seems posix_handle_mkdir_hashes() attempts to mkdir two directories > at once: ec/ec2. How is it supposed to work? Should parent directory be > created somewhere else? This fixes the problem. Any comment? --- xlators/storage/posix/src/posix-handle.c.orig +++ xlators/storage/posix/src/posix-handle.c @@ -405,8 +405,16 @@ parpath = dirname (duppath); parpath = dirname (duppath); ret = mkdir (parpath, 0700); + if (ret == -1 && errno == ENOENT) { + char *tmppath = NULL; + + tmppath = strdupa(parpath); + ret = mkdir (dirname (tmppath), 0700); + if (ret == 0) + ret = mkdir (parpath, 0700); + } if (ret == -1 && errno != EEXIST) { gf_log (this->name, GF_LOG_ERROR, "error mkdir hash-1 %s (%s)", parpath, strerror (errno)); -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 08:47:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 10:47:02 +0200 Subject: [Gluster-devel] rename(2) race condition Message-ID: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> After I patched to fix the mkdir issue, I now encounter a race in rename(2). Most of the time it works, but sometimes: 3548 1 tar CALL open(0xbb9010e0,0xa02,0x180) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET open 8 3548 1 tar CALL __fstat50(8,0xbfbfe69c) 3548 1 tar RET __fstat50 0 3548 1 tar CALL write(8,0x8067880,0x16) 3548 1 tar GIO fd 8 wrote 22 bytes "Nnetbsd-5-1-2-RELEASE\n" 3548 1 tar RET write 22/0x16 3548 1 tar CALL close(8) 3548 1 tar RET close 0 3548 1 tar CALL lchmod(0xbb9010e0,0x1a4) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET lchmod 0 3548 1 tar CALL __lutimes50(0xbb9010e0,0xbfbfe6d8) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET __lutimes50 0 3548 1 tar CALL rename(0xbb9010e0,0x8071584) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET rename -1 errno 13 Permission denied I can reproduce it with the command below. It runs fine for a few seconds and then hit permission denied. It needs a level of hierarchy to exhibit the hebavior: just install a b will not fail. mkdir test && echo "xxx" > tmp/a while [ 1 ] ; do rm -f test/b && install test/a test/b ; done -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From mihai at patchlog.com Sun May 20 09:19:34 2012 From: mihai at patchlog.com (Mihai Secasiu) Date: Sun, 20 May 2012 12:19:34 +0300 Subject: [Gluster-devel] glusterfs on MacOSX Message-ID: <4FB8B726.10500@patchlog.com> Hello, I am trying to get glusterfs ( 3.2.6, server ) to work on MacOSX ( Lion - I think , darwin kernel 11.3 ). So far I've been able to make it compile with a few patches and --disable-fuse-client. I want to create a volume on a MacMini that will be a replica of another volume stored on a linux server in a different location. The volume stored on the MacMini would also have to be mounted on the macmini. Since the fuse client is broken because it's built to use macfuse and that doesn't work anymore on the latest MacOSX I want to mount the volume over nfs and I've been able to do that ( with a small patch to the xdr code ) but it's really really slow. It's so slow that mounting the volume through a remote node is a lot faster. Also mounting the same volume on a remote node is fast so the problem is definitely in the nfs server on the MacOSX. I did a strace ( dtruss ) on it and it seems like it's doing a lot of polling. Could this be the cause of the slowness ? If anyone wants to try this you can fetch it from https://github.com/mihaisecasiu/glusterfs/tree/release-3.2 Thanks From manu at netbsd.org Sun May 20 12:43:52 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 14:43:52 +0200 Subject: [Gluster-devel] rename(2) race condition In-Reply-To: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> Message-ID: <1kkee8d.8hdhfs177z5zdM%manu@netbsd.org> Emmanuel Dreyfus wrote: > After I patched to fix the mkdir issue, I now encounter a race in > rename(2). Most of the time it works, but sometimes: And the problem onoy happens when running as an unprivilegied user. It works fine for root. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From kparthas at redhat.com Sun May 20 14:14:10 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 10:14:10 -0400 (EDT) Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kke141.ffp9fr1meqkgbM%manu@netbsd.org> Message-ID: Emmanuel, I have submitted the fix for review: http://review.gluster.com/3380 I have not tested the fix with "/" having EA disabled. It would be great if you could confirm the looping forever doesn't happen with this fix. thanks, krish ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Krishnan Parthasarathi" Cc: gluster-devel at nongnu.org Sent: Sunday, May 20, 2012 1:41:04 PM Subject: Re: [Gluster-devel] 3.3 requires extended attribute on / Krishnan Parthasarathi wrote: > It is strange that the you see glusterd_path_in_use() loop forever. If I > am not wrong, the inner loop checks for presence of trusted.gfid and > trusted.glusterfs.volume-id and should exit after that, and the outer loop > performs dirname on the path repeatedly and dirname(3) guarantees such an > operation should return "/" eventually, which we check. Here is my setup when I tried that: / with EA enabled /export/wd3a ibrick with EA enabled But I may have been testing with an untintended patch in glusterd_path_in_use(). I will retry with the right fix once it will be available. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 21 04:51:59 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 21 May 2012 06:51:59 +0200 Subject: [Gluster-devel] zero'ed local data in mdc_setattr_cbk Message-ID: <1kkfmt2.2blugrtqcea2M%manu@netbsd.org> Hi Here is a backtrace for a SIGSEGV in md-cache code. Note inode = NULL when mdc_inode_iatt_set() is called by mdc_setattr_cbk(). It seems that local got corupted in the later function #0 0xbbb3a7c9 in pthread_spin_lock () from /usr/lib/libpthread.so.1 #1 0xbaa09d8c in mdc_inode_prep (this=0xba3e5000, inode=0x0) at md-cache.c:267 #2 0xbaa0a1bf in mdc_inode_iatt_set (this=0xba3e5000, inode=0x0, iatt=0xb9401d40) at md-cache.c:384 #3 0xbaa0ee16 in mdc_setattr_cbk (frame=0xbb77f400, cookie=0xbb77f470, this=0xba3e5000, op_ret=0, op_errno=0, prebuf=0xb9401cd8, postbuf=0xb9401d40, xdata=0x0) at md-cache.c:1423 #4 0xbaa1d0ec in qr_fsetattr_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/performance/quick-read.so #5 0xbbb8ac72 in default_fsetattr_cbk (frame=0xbb77f0f0, cookie=0xbb77f160, this=0xba3e3000, op_ret=0, op_errno=0, statpre=0xb9401cd8, statpost=0xb9401d40, xdata=0x0) at defaults.c:452 #6 0xbbb8ac72 in default_fsetattr_cbk (frame=0xbb77f160, cookie=0xbb77f1d0, this=0xba3e2000, op_ret=0, op_errno=0, statpre=0xb9401cd8, statpost=0xb9401d40, xdata=0x0) at defaults.c:452 #7 0xbbb8ac72 in default_fsetattr_cbk (frame=0xbb77f1d0, cookie=0xbb77f240, this=0xba3e1000, op_ret=0, op_errno=0, statpre=0xb9401cd8, statpost=0xb9401d40, xdata=0x0) at defaults.c:452 #8 0xb9aa9d23 in afr_fsetattr_unwind (frame=0xba801ee8, this=0xba3d1000) at afr-inode-write.c:1160 #9 0xb9aa9f01 in afr_fsetattr_wind_cbk (frame=0xba801ee8, cookie=0x0, this=0xba3d1000, op_ret=0, op_errno=0, preop=0xbfbfe880, postop=0xbfbfe818, xdata=0x0) at afr-inode-write.c:1221 #10 0xbaa6a099 in client3_1_fsetattr_cbk (req=0xb90010d8, iov=0xb90010f8, count=1, myframe=0xbb77f010) at client3_1-fops.c:1897 #11 0xbbb6975e in rpc_clnt_handle_reply (clnt=0xba3c5270, pollin=0xbb77d220) at rpc-clnt.c:788 #12 0xbbb699fb in rpc_clnt_notify (trans=0xbb70f000, mydata=0xba3c5290, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-clnt.c:907 #13 0xbbb659c7 in rpc_transport_notify (this=0xbb70f000, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-transport.c:489 #14 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #15 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #16 0xbbbb281f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=2) at event.c:357 #17 0xbbbb2a8b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #18 0xbbbb2db7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #19 0x0805015e in main () (gdb) frame 3 #3 0xbaa0ee16 in mdc_setattr_cbk (frame=0xbb77f400, cookie=0xbb77f470, this=0xba3e5000, op_ret=0, op_errno=0, prebuf=0xb9401cd8, postbuf=0xb9401d40, xdata=0x0) at md-cache.c:1423 1423 mdc_inode_iatt_set (this, local->loc.inode, postbuf); (gdb) print *local $2 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, fd = 0xb8f9d054, linkname = 0x0, xattr = 0x0} -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 21 10:14:24 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 21 May 2012 10:14:24 +0000 Subject: [Gluster-devel] zero'ed local data in mdc_setattr_cbk In-Reply-To: <1kkfmt2.2blugrtqcea2M%manu@netbsd.org> References: <1kkfmt2.2blugrtqcea2M%manu@netbsd.org> Message-ID: <20120521101424.GA10504@homeworld.netbsd.org> On Mon, May 21, 2012 at 06:51:59AM +0200, Emmanuel Dreyfus wrote: > Here is a backtrace for a SIGSEGV in md-cache code. Note inode = NULL > when mdc_inode_iatt_set() is called by mdc_setattr_cbk(). I submitted a patch to fix it, please review http://review.gluster.com/3383 -- Emmanuel Dreyfus manu at netbsd.org From jdarcy at redhat.com Mon May 21 12:24:30 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 21 May 2012 08:24:30 -0400 Subject: [Gluster-devel] Gluster internals In-Reply-To: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> References: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> Message-ID: <4FBA33FE.3050602@redhat.com> On 05/20/2012 02:12 AM, Ian Latter wrote: > Hello, > > > Couple of questions that might help make my > module a little more sane; > > 0) Is there any developer docco? I've just done > another quick search and I can't see any. Let > me know if there is and I'll try and answer the > below myself. Your best bet right now (if I may say so) is the stuff I've posted on hekafs.org - the "Translator 101" articles plus the API overview at http://hekafs.org/dist/xlator_api_2.html > 1) What is the difference between STACK_WIND > and STACK_WIND_COOKIE? I.e. I've only > ever used STACK_WIND, when should I use > it versus the other? I see Krishnan has already covered this. > 2) Is there a way to write linearly within a single > function within Gluster (or is there a reason > why I wouldn't want to do that)? Any blocking ops would have to be built on top of async ops plus semaphores etc. because (unlike e.g. an HTTP server) the underlying sockets etc. are shared/multiplexed between users and activities. Thus you'd get much more context switching that way than if you stay within the async/continuation style. Some day in the distant future, I'd like to work some more on a preprocessor that turns linear code into async code so that it's easier to write but retains the performance and resource-efficiency advantages of an essentially async style. I did some work (http://pl.atyp.us/ripper/UserGuide.html) in this area several years ago, but it has probably bit-rotted to hell since then. With more recent versions of gcc and LLVM it should be possible to overcome some of the limitations that version had. From manu at netbsd.org Mon May 21 16:27:21 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 21 May 2012 18:27:21 +0200 Subject: [Gluster-devel] rename(2) race condition In-Reply-To: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> Message-ID: <1kkgieh.28gd5j1k9erqhM%manu@netbsd.org> Emmanuel Dreyfus wrote: > 3548 1 tar CALL rename(0xbb9010e0,0x8071584) > 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" > 3548 1 tar RET rename -1 errno 13 Permission denied I tracked this down to FUSE LOOKUP operation that do not set fuse_entry's attr.uid correctly (it is left set to 0). Here is the summary of my findings so far: - as un unprivilegied user, I create and delete files like crazy - most of the time everything is fine - sometime a LOOKUP for a file I created (as an unprivilegied user) will return a fuse_entry with uid set to 0, which cause the kernel to raise EACCESS when I try to delete the file. Here is an example of a FUSE trace, produced by the test case while [ 1 ] ; do cp /etc/fstab test/foo1 ; rm test/foo1 ; done > unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1) < unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1), error = -2 > unique = 1436, nodeid = 3098542296, opcode = CREATE (35) < unique = 1436, nodeid = 3098542296, opcode = CREATE (35), error = 0 > unique = 1437, nodeid = 3098542396, opcode = SETATTR (4) < unique = 1437, nodeid = 3098542396, opcode = SETATTR (4), error = 0 > unique = 1438, nodeid = 3098542396, opcode = WRITE (16) < unique = 1438, nodeid = 3098542396, opcode = WRITE (16), error = 0 > unique = 1439, nodeid = 3098542396, opcode = FSYNC (20) < unique = 1439, nodeid = 3098542396, opcode = FSYNC (20), error = 0 > unique = 1440, nodeid = 3098542396, opcode = RELEASE (18) < unique = 1440, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > unique = 1441, nodeid = 3098542396, opcode = GETATTR (3) < unique = 1441, nodeid = 3098542396, opcode = GETATTR (3), error = 0 > unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1) < unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 --> here I sometimes get fuse_entry's attr.uid incorrectly set to 0 --> When this happens, LOOKUP fails and returns EACCESS. > unique = 1443, nodeid = 3098542296, opcode = UNLINK (10) < unique = 1443, nodeid = 3098542296, opcode = UNLINK (10), error = 0 > unique = 1444, nodeid = 3098542396, opcode = FORGET (2) Is it possible that metadata writes are now so asynchronous that a subsequent lookup cannot retreive the up to date value? If that is the problem, how can I fix it? There is nothing telling the FUSE implementation that a CREATE or SETATTR has just partially completed and has metadata pending. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From ian.latter at midnightcode.org Mon May 21 23:02:44 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 22 May 2012 09:02:44 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205212302.q4LN2idg017478@singularity.tronunltd.com> > > 0) Is there any developer docco? I've just done > > another quick search and I can't see any. Let > > me know if there is and I'll try and answer the > > below myself. > > Your best bet right now (if I may say so) is the stuff I've posted on > hekafs.org - the "Translator 101" articles plus the API overview at > > http://hekafs.org/dist/xlator_api_2.html You must say so - there is so little docco. Actually before I posted I went and re-read your Translator 101 docs as you referred them to me on 10 May, but I hadn't found your API overview - thanks (for both)! > > 2) Is there a way to write linearly within a single > > function within Gluster (or is there a reason > > why I wouldn't want to do that)? > > Any blocking ops would have to be built on top of async ops plus semaphores > etc. because (unlike e.g. an HTTP server) the underlying sockets etc. are > shared/multiplexed between users and activities. Thus you'd get much more > context switching that way than if you stay within the async/continuation style. Interesting - I haven't ever done semaphore coding, but it may not be needed. The syncop framework that Krish referred too seems to do this via a mutex lock (synctask_yawn) and a context switch (synctask_yield). What's the drawback with increased context switching? After my email thread with Krish I decided against syncop, but the flow without was going to be horrific. The only way I could bring it back to anything even half as sane as the afr code (which can cleverly loop through its own _cbk's recursively - I like that, whoever put that together) was to have the last cbk in a chain (say the "close_cbk") call the original function with an index or stepper increment. But after sitting on the idea for a couple of days I actually came to the same conclusion as Manu did in the last message. I.e. without docco I have been writing to what seems to work, and in my 2009 code (I saw last night) a "mkdir" wind followed by "create" code in the same function - which I believe, now, is probably a race condition (because of the threaded/async structure forced through the wind/call macro model). In that case I *do* want a synchronous write - but only within my xlator (which, if I'm reading this right, *is* what syncop does) - as opposed to an end-to-end synchronous write (being sync'd through the full stack of xlators: ignoring caching, waiting for replication to be validated, etc). Although, the same synchronous outcome comes from the chained async calls ... but then we get back to the readability/ fixability of the code. > Some day in the distant future, I'd like to work some more on a preprocessor > that turns linear code into async code so that it's easier to write but retains > the performance and resource-efficiency advantages of an essentially async > style. I did some work (http://pl.atyp.us/ripper/UserGuide.html) in this area > several years ago, but it has probably bit-rotted to hell since then. With > more recent versions of gcc and LLVM it should be possible to overcome some of > the limitations that version had. Yes, I had a very similar thought - a C pre-processor isn't in my experience or time scale though; I considered writing up a script that would chain it out in C for me. I was going to borrow from a script that I wrote which builds one of the libMidnightCode header files but even that seemed impractical .. would anyone be able to debug it? Would I even understand in 2yrs from now - lol So I think the long and the short of it is that anything I do here won't be pretty .. or perhaps: one will look pretty and the other will run pretty :) -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Mon May 21 23:59:07 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 16:59:07 -0700 Subject: [Gluster-devel] rename(2) race condition In-Reply-To: <1kkgieh.28gd5j1k9erqhM%manu@netbsd.org> References: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> <1kkgieh.28gd5j1k9erqhM%manu@netbsd.org> Message-ID: Is the FUSE SETATTR implementation in NetBSD synchronous? i.e, does the chown() or chmod() syscall issued by the application strictly block till GlusterFS's fuse_setattr_cbk() is called? Avati On Mon, May 21, 2012 at 9:27 AM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > > > 3548 1 tar CALL rename(0xbb9010e0,0x8071584) > > 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" > > 3548 1 tar RET rename -1 errno 13 Permission denied > > I tracked this down to FUSE LOOKUP operation that do not set > fuse_entry's attr.uid correctly (it is left set to 0). > > Here is the summary of my findings so far: > - as un unprivilegied user, I create and delete files like crazy > - most of the time everything is fine > - sometime a LOOKUP for a file I created (as an unprivilegied user) will > return a fuse_entry with uid set to 0, which cause the kernel to raise > EACCESS when I try to delete the file. > > Here is an example of a FUSE trace, produced by the test case > while [ 1 ] ; do cp /etc/fstab test/foo1 ; rm test/foo1 ; done > > > unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1) > < unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1), error = -2 > > unique = 1436, nodeid = 3098542296, opcode = CREATE (35) > < unique = 1436, nodeid = 3098542296, opcode = CREATE (35), error = 0 > > unique = 1437, nodeid = 3098542396, opcode = SETATTR (4) > < unique = 1437, nodeid = 3098542396, opcode = SETATTR (4), error = 0 > > unique = 1438, nodeid = 3098542396, opcode = WRITE (16) > < unique = 1438, nodeid = 3098542396, opcode = WRITE (16), error = 0 > > unique = 1439, nodeid = 3098542396, opcode = FSYNC (20) > < unique = 1439, nodeid = 3098542396, opcode = FSYNC (20), error = 0 > > unique = 1440, nodeid = 3098542396, opcode = RELEASE (18) > < unique = 1440, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > > unique = 1441, nodeid = 3098542396, opcode = GETATTR (3) > < unique = 1441, nodeid = 3098542396, opcode = GETATTR (3), error = 0 > > unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1) > < unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 > > --> here I sometimes get fuse_entry's attr.uid incorrectly set to 0 > --> When this happens, LOOKUP fails and returns EACCESS. > > > unique = 1443, nodeid = 3098542296, opcode = UNLINK (10) > < unique = 1443, nodeid = 3098542296, opcode = UNLINK (10), error = 0 > > unique = 1444, nodeid = 3098542396, opcode = FORGET (2) > > > Is it possible that metadata writes are now so asynchronous that a > subsequent lookup cannot retreive the up to date value? If that is the > problem, how can I fix it? There is nothing telling the FUSE > implementation that a CREATE or SETATTR has just partially completed and > has metadata pending. > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anand.avati at gmail.com Tue May 22 00:11:47 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 17:11:47 -0700 Subject: [Gluster-devel] A healing translator In-Reply-To: <4FA8E8AB.2040604@datalab.es> References: <4FA8E8AB.2040604@datalab.es> Message-ID: On Tue, May 8, 2012 at 2:34 AM, Xavier Hernandez wrote: > Hello developers, > > I would like to expose some ideas we are working on to create a new kind > of translator that should be able to unify and simplify to some extent the > healing procedures of complex translators. > > Currently, the only translator with complex healing capabilities that we > are aware of is AFR. We are developing another translator that will also > need healing capabilities, so we thought that it would be interesting to > create a new translator able to handle the common part of the healing > process and hence to simplify and avoid duplicated code in other > translators. > > The basic idea of the new translator is to handle healing tasks nearer the > storage translator on the server nodes instead to control everything from a > translator on the client nodes. Of course the heal translator is not able > to handle healing entirely by itself, it needs a client translator which > will coordinate all tasks. The heal translator is intended to be used by > translators that work with multiple subvolumes. > > I will try to explain how it works without entering into too much details. > > There is an important requisite for all client translators that use > healing: they must have exactly the same list of subvolumes and in the same > order. Currently, I think this is not a problem. > > The heal translator treats each file as an independent entity, and each > one can be in 3 modes: > > 1. Normal mode > > This is the normal mode for a copy or fragment of a file when it is > synchronized and consistent with the same file on other nodes (for example > with other replicas. It is the client translator who decides if it is > synchronized or not). > > 2. Healing mode > > This is the mode used when a client detects an inconsistency in the copy > or fragment of the file stored on this node and initiates the healing > procedures. > > 3. Provider mode (I don't like very much this name, though) > > This is the mode used by client translators when an inconsistency is > detected in this file, but the copy or fragment stored in this node is > considered good and it will be used as a source to repair the contents of > this file on other nodes. > > Initially, when a file is created, it is set in normal mode. Client > translators that make changes must guarantee that they send the > modification requests in the same order to all the servers. This should be > done using inodelk/entrylk. > > When a change is sent to a server, the client must include a bitmap mask > of the clients to which the request is being sent. Normally this is a > bitmap containing all the clients, however, when a server fails for some > reason some bits will be cleared. The heal translator uses this bitmap to > early detect failures on other nodes from the point of view of each client. > When this condition is detected, the request is aborted with an error and > the client is notified with the remaining list of valid nodes. If the > client considers the request can be successfully server with the remaining > list of nodes, it can resend the request with the updated bitmap. > > The heal translator also updates two file attributes for each change > request to mantain the "version" of the data and metadata contents of the > file. A similar task is currently made by AFR using xattrop. This would not > be needed anymore, speeding write requests. > > The version of data and metadata is returned to the client for each read > request, allowing it to detect inconsistent data. > > When a client detects an inconsistency, it initiates healing. First of > all, it must lock the entry and inode (when necessary). Then, from the data > collected from each node, it must decide which nodes have good data and > which ones have bad data and hence need to be healed. There are two > possible cases: > > 1. File is not a regular file > > In this case the reconstruction is very fast and requires few requests, so > it is done while the file is locked. In this case, the heal translator does > nothing relevant. > > 2. File is a regular file > > For regular files, the first step is to synchronize the metadata to the > bad nodes, including the version information. Once this is done, the file > is set in healing mode on bad nodes, and provider mode on good nodes. Then > the entry and inode are unlocked. > > When a file is in provider mode, it works as in normal mode, but refuses > to start another healing. Only one client can be healing a file. > > When a file is in healing mode, each normal write request from any client > are handled as if the file were in normal mode, updating the version > information and detecting possible inconsistencies with the bitmap. > Additionally, the healing translator marks the written region of the file > as "good". > > Each write request from the healing client intended to repair the file > must be marked with a special flag. In this case, the area that wants to be > written is filtered by the list of "good" ranges (if there are any > intersection with a good range, it is removed from the request). The > resulting set of ranges are propagated to the lower translator and added to > the list of "good" ranges but the version information is not updated. > > Read requests are only served if the range requested is entirely contained > into the "good" regions list. > > There are some additional details, but I think this is enough to have a > general idea of its purpose and how it works. > > The main advantages of this translator are: > > 1. Avoid duplicated code in client translators > 2. Simplify and unify healing methods in client translators > 3. xattrop is not needed anymore in client translators to keep track of > changes > 4. Full file contents are repaired without locking the file > 5. Better detection and prevention of some split brain situations as soon > as possible > > I think it would be very useful. It seems to me that it works correctly in > all situations, however I don't have all the experience that other > developers have with the healing functions of AFR, so I will be happy to > answer any question or suggestion to solve problems it may have or to > improve it. > > What do you think about it ? > > The goals you state above are all valid. What would really help (adoption) is if you can implement this as a modification of AFR by utilizing all the work already done, and you get brownie points if it is backward compatible with existing AFR. If you already have any code in a publishable state, please share it with us (github link?). Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Tue May 22 00:40:03 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 22 May 2012 10:40:03 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205220040.q4M0e3ah017846@singularity.tronunltd.com> Actually, while we're at this level I'd like to bolt on another thought / query - these were my words; > But after sitting on the idea for a couple of days I actually came > to the same conclusion as Manu did in the last message. I.e. > without docco I have been writing to what seems to work, and > in my 2009 code (I saw last night) a "mkdir" wind followed by "create" > code in the same function - which I believe, now, is probably a > race condition (because of the threaded/async structure forced > through the wind/call macro model). But they include an assumption. The query is: are async writes and reads sequential? The two specific cases are; 1) Are all reads that are initiated in time after a write guaranteed to occur after that write has taken affect? 2) Are all writes that are initiated in time after a write (x) guaranteed to occur after that write (x) has taken affect? I could also appreciate that there may be a difference between the top/user layer view and the xlator internals .. if there is then can you please include that view in the explanation? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Tue May 22 01:27:41 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 18:27:41 -0700 Subject: [Gluster-devel] Gluster internals In-Reply-To: <201205220040.q4M0e3ah017846@singularity.tronunltd.com> References: <201205220040.q4M0e3ah017846@singularity.tronunltd.com> Message-ID: On Mon, May 21, 2012 at 5:40 PM, Ian Latter wrote: > > But they include an assumption. > > The query is: are async writes and reads sequential? The > two specific cases are; > > 1) Are all reads that are initiated in time after a write > guaranteed to occur after that write has taken affect? > Yes > > 2) Are all writes that are initiated in time after a write (x) > guaranteed to occur after that write (x) has taken > affect? > Only overlapping offsets/regions retain causal ordering of completion. It is write-behind which acknowledges writes pre-maturely and therefore the layer which must maintain the 'effects' for further reads and writes by making the dependent IOs (overlapping offset/regions) wait for previous write's actual completion. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 22 05:33:37 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 22 May 2012 07:33:37 +0200 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: Message-ID: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Anand Avati wrote: > Is the FUSE SETATTR implementation in NetBSD synchronous? i.e, does the > chown() or chmod() syscall issued by the application strictly block till > GlusterFS's fuse_setattr_cbk() is called? I have been able to narrow the test down to the code below, which does not even call chown(). #include #include #include #include #include #include int main(void) { int fd; (void)mkdir("subdir", 0755); do { if ((fd = open("subdir/bugc1.txt", O_CREAT|O_RDWR, 0644)) == -1) err(EX_OSERR, "open failed"); if (close(fd) == -1) err(EX_OSERR, "close failed"); if (unlink("subdir/bugc1.txt") == -1) err(EX_OSERR, "unlink failed"); } while (1 /*CONSTCOND */); /* NOTREACHED */ return EX_OK; } It produces a FUSE trace without SETATTR: > unique = 393, nodeid = 3098542496, opcode = LOOKUP (1) < unique = 393, nodeid = 3098542496, opcode = LOOKUP (1), error = -2 > unique = 394, nodeid = 3098542496, opcode = CREATE (35) < unique = 394, nodeid = 3098542496, opcode = CREATE (35), error = 0 -> I suspect (not yet checked) this is the place where I get fuse_entry_out with attr.uid = 0. This will be cached since attr_valid tells us to do so. > unique = 395, nodeid = 3098542396, opcode = RELEASE (18) < unique = 395, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > unique = 396, nodeid = 3098542296, opcode = LOOKUP (1) < unique = 396, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 >From other traces, I can tell that this last lookup is for the parent directory (subdir). The FUSE request for looking up bugc1.txt with the intent of deleting is not even sent: from cached uid we obtained from fuse_entry_out, we know that permissions shall be denied (I had a debug printf to check that). We do not even ask. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From anand.avati at gmail.com Tue May 22 05:44:30 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 22:44:30 -0700 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Message-ID: On Mon, May 21, 2012 at 10:33 PM, Emmanuel Dreyfus wrote: > Anand Avati wrote: > > > Is the FUSE SETATTR implementation in NetBSD synchronous? i.e, does the > > chown() or chmod() syscall issued by the application strictly block till > > GlusterFS's fuse_setattr_cbk() is called? > > I have been able to narrow the test down to the code below, which does not > even > call chown(). > > #include > #include > #include > #include > #include > #include > > int > main(void) > { > int fd; > > (void)mkdir("subdir", 0755); > > do { > if ((fd = open("subdir/bugc1.txt", O_CREAT|O_RDWR, 0644)) > == -1) > err(EX_OSERR, "open failed"); > > if (close(fd) == -1) > err(EX_OSERR, "close failed"); > > if (unlink("subdir/bugc1.txt") == -1) > err(EX_OSERR, "unlink failed"); > } while (1 /*CONSTCOND */); > > /* NOTREACHED */ > return EX_OK; > } > > It produces a FUSE trace without SETATTR: > > > unique = 393, nodeid = 3098542496, opcode = LOOKUP (1) > < unique = 393, nodeid = 3098542496, opcode = LOOKUP (1), error = -2 > > unique = 394, nodeid = 3098542496, opcode = CREATE (35) > < unique = 394, nodeid = 3098542496, opcode = CREATE (35), error = 0 > > -> I suspect (not yet checked) this is the place where I get > fuse_entry_out > with attr.uid = 0. This will be cached since attr_valid tells us to do so. > > > unique = 395, nodeid = 3098542396, opcode = RELEASE (18) > < unique = 395, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > > unique = 396, nodeid = 3098542296, opcode = LOOKUP (1) > < unique = 396, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 > > From other traces, I can tell that this last lookup is for the parent > directory > (subdir). The FUSE request for looking up bugc1.txt with the intent of > deleting > is not even sent: from cached uid we obtained from fuse_entry_out, we know > that > permissions shall be denied (I had a debug printf to check that). We do > not even > ask. > > Even in the case where bugc1.txt got a wrong uid returned (assuming so), it should not influence the permissibility of it getting deleted. The deletability of a file is based on the permissions on the parent directory and not the ownership of the file (unless +t sticky bit was set on the directory). Is there a way you can extend the trace code above to show the UIDs getting returned? Maybe it was the parent directory (subdir) that got a wrong UID returned? Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From aavati at redhat.com Tue May 22 07:11:36 2012 From: aavati at redhat.com (Anand Avati) Date: Tue, 22 May 2012 00:11:36 -0700 Subject: [Gluster-devel] RFC on fix to bug #802414 In-Reply-To: <7a290bd4-c833-4a35-af04-adb0052f6ff2@zmail01.collab.prod.int.phx2.redhat.com> References: <7a290bd4-c833-4a35-af04-adb0052f6ff2@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FBB3C28.2020106@redhat.com> The PARENT_DOWN_HANDLED approach will take us backwards from the current state where we are resiliant to frame losses and other class of bugs (i.e, if a frame loss happens on either server or client, it only results in prevented graph cleanup but the graph switch still happens). The root "cause" here is that we are giving up on a very important and fundamental principle of immutability on the fd object. The real solution here is to never modify fd->inode. Instead we must bring about a more native fd "migration" than just re-opening an existing fd on the new graph. Think of the inode migration analogy. The handle coming from FUSE (the address of the object) is a "hint". Usually the hint is right, if the object in the address belongs to the latest graph. If not, using the GFID we resolve a new inode on the latest graph and use it. In case of FD we can do something similar, except there are not GFIDs (which should not be a problem). We need to make the handle coming from FUSE (the address of fd_t) just a hint. If the fd->inode->table->xl->graph is the latest, then the hint was a HIT. If the graph was not the latest, we look for a previous migration attempt+result in the "base" (original) fd's context. If that does not exist or is not fresh (on the latest graph) then we do a new fd creation, open on new graph, fd_unref the old cached result in the fd context of the "base fd" and keep ref to this new result. All this must happen from fuse_resolve_fd(). The setting of the latest fd and updation of the latest fd pointer happens under the scope of the base_fd->lock() which gives it a very clear and unambiguous scope which was missing with the old scheme. [The next step will be to nuke the fd->inode swapping in fuse_create_cbk] Avati On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote: > > > ----- Original Message ----- >> From: "Pranith Kumar Karampuri" >> To: "Anand Avati" >> Cc: "Vijay Bellur", "Amar Tumballi", "Krishnan Parthasarathi" >> , "Raghavendra Gowdappa" >> Sent: Tuesday, May 22, 2012 8:42:58 AM >> Subject: Re: RFC on fix to bug #802414 >> >> Dude, >> We have already put logs yesterday in LOCK and UNLOCK and saw >> that the&fd->inode->lock address changed from LOCK to UNLOCK. > > Yes, even I too believe that the hang is because of fd->inode swap in fuse_migrate_fd and not the one in fuse_create_cbk. We could clearly see in the log files following race: > fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this was a naive fix - hold lock on inode in old graph - to the race-condition caused by swapping fd->inode, which didn't work) > > poll-thr: tries to acquire fd->inode->lock (inode is old_inode present in old-graph) in afr_local_cleanup > fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock > poll-thr: gets woken up from lock call on old_inode->lock. > poll-thr: does its work, but while unlocking, uses fd->inode where inode belongs to new graph. > > we had logs printing lock address before and after acquisition of lock and we could clearly see that lock address changed after acquiring lock in afr_local_cleanup. > >> >>>> "The hang in fuse_migrate_fd is _before_ the inode swap performed >>>> there." >> All the fds are opened on the same file. So all fds in the fd >> migration point to same inode. The race is hit by nth fd, (n+1)th fd >> hangs. We have seen that afr_local_cleanup was doing fd_unref, and >> LOCK(fd->inode->lock) was done with one address then by the time >> UNLOCK(fd->inode->lock) is done the address changed. So the next fd >> that has to migrate hung because the prev inode lock is not >> unlocked. >> >> If after nth fd introduces the race a _cbk comes in epoll thread on >> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will >> hang. >> Which is my theory for the hang we observed on Saturday. >> >> Pranith. >> ----- Original Message ----- >> From: "Anand Avati" >> To: "Raghavendra Gowdappa" >> Cc: "Vijay Bellur", "Amar Tumballi" >> , "Krishnan Parthasarathi" >> , "Pranith Kumar Karampuri" >> >> Sent: Tuesday, May 22, 2012 2:09:33 AM >> Subject: Re: RFC on fix to bug #802414 >> >> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote: >>> Avati, >>> >>> fuse_migrate_fd (running in reader thread - rdthr) assigns new >>> inode to fd, once it looks up inode in new graph. But this >>> assignment can race with code that accesses fd->inode->lock >>> executing in poll-thread (pthr) as follows >>> >>> pthr: LOCK (fd->inode->lock); (inode in old graph) >>> rdthr: fd->inode = inode (resolved in new graph) >>> pthr: UNLOCK (fd->inode->lock); (inode in new graph) >>> >> >> The way I see it (the backtrace output in the other mail), the swap >> happening in fuse_create_cbk() must be the one causing lock/unlock to >> land on different inode objects. The hang in fuse_migrate_fd is >> _before_ >> the inode swap performed there. Can you put some logs in >> fuse_create_cbk()'s inode swap code and confirm this? >> >> >>> Now, any lock operations on inode in old graph will block. Thanks >>> to pranith for pointing to this race-condition. >>> >>> The problem here is we don't have a single lock that can >>> synchronize assignment "fd->inode = inode" and other locking >>> attempts on fd->inode->lock. So, we are thinking that instead of >>> trying to synchronize, eliminate the parallel accesses altogether. >>> This can be done by splitting fd migration into two tasks. >>> >>> 1. Actions on old graph (like fsync to flush writes to disk) >>> 2. Actions in new graph (lookup, open) >>> >>> We can send PARENT_DOWN when, >>> 1. Task 1 is complete. >>> 2. No fop sent by fuse is pending. >>> >>> on receiving PARENT_DOWN, protocol/client will shutdown transports. >>> As part of transport cleanup, all pending frames are unwound and >>> protocol/client will notify its parents with PARENT_DOWN_HANDLED >>> event. Each of the translator will pass this event to its parents >>> once it is convinced that there are no pending fops started by it >>> (like background self-heal, reads as part of read-ahead etc). Once >>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there >>> will be no replies that will be racing with migration (note that >>> migration is done using syncops). At this point in time, it is >>> safe to start Task 2 (which associates fd with an inode in new >>> graph). >>> >>> Also note that reader thread will not do other operations till it >>> completes both tasks. >>> >>> As far as the implementation of this patch goes, major work is in >>> translators like read-ahead, afr, dht to provide the guarantee >>> required to send PARENT_DOWN_HANDLED event to their parents. >>> >>> Please let me know your thoughts on this. >>> >> >> All the above steps might not apply if it is caused by the swap in >> fuse_create_cbk(). Let's confirm that first. >> >> Avati >> From ian.latter at midnightcode.org Tue May 22 07:18:08 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 22 May 2012 17:18:08 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205220718.q4M7I8sJ019827@singularity.tronunltd.com> > > But they include an assumption. > > > > The query is: are async writes and reads sequential? The > > two specific cases are; > > > > 1) Are all reads that are initiated in time after a write > > guaranteed to occur after that write has taken affect? > > > > Yes > Excellent. > > > > 2) Are all writes that are initiated in time after a write (x) > > guaranteed to occur after that write (x) has taken > > affect? > > > > Only overlapping offsets/regions retain causal ordering of completion. It > is write-behind which acknowledges writes pre-maturely and therefore the > layer which must maintain the 'effects' for further reads and writes by > making the dependent IOs (overlapping offset/regions) wait for previous > write's actual completion. > Ok, that should do the trick. Let me mull over this for a while .. Thanks for that info. > Avati > -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Tue May 22 07:44:25 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Tue, 22 May 2012 09:44:25 +0200 Subject: [Gluster-devel] A healing translator In-Reply-To: References: <4FA8E8AB.2040604@datalab.es> Message-ID: <4FBB43D9.9070605@datalab.es> On 05/22/2012 02:11 AM, Anand Avati wrote: > > > On Tue, May 8, 2012 at 2:34 AM, Xavier Hernandez > > wrote: > > Hello developers, > > I would like to expose some ideas we are working on to create a > new kind of translator that should be able to unify and simplify > to some extent the healing procedures of complex translators. > > Currently, the only translator with complex healing capabilities > that we are aware of is AFR. We are developing another translator > that will also need healing capabilities, so we thought that it > would be interesting to create a new translator able to handle the > common part of the healing process and hence to simplify and avoid > duplicated code in other translators. > > The basic idea of the new translator is to handle healing tasks > nearer the storage translator on the server nodes instead to > control everything from a translator on the client nodes. Of > course the heal translator is not able to handle healing entirely > by itself, it needs a client translator which will coordinate all > tasks. The heal translator is intended to be used by translators > that work with multiple subvolumes. > > I will try to explain how it works without entering into too much > details. > > There is an important requisite for all client translators that > use healing: they must have exactly the same list of subvolumes > and in the same order. Currently, I think this is not a problem. > > The heal translator treats each file as an independent entity, and > each one can be in 3 modes: > > 1. Normal mode > > This is the normal mode for a copy or fragment of a file when > it is synchronized and consistent with the same file on other > nodes (for example with other replicas. It is the client > translator who decides if it is synchronized or not). > > 2. Healing mode > > This is the mode used when a client detects an inconsistency > in the copy or fragment of the file stored on this node and > initiates the healing procedures. > > 3. Provider mode (I don't like very much this name, though) > > This is the mode used by client translators when an > inconsistency is detected in this file, but the copy or > fragment stored in this node is considered good and it will be > used as a source to repair the contents of this file on other > nodes. > > Initially, when a file is created, it is set in normal mode. > Client translators that make changes must guarantee that they send > the modification requests in the same order to all the servers. > This should be done using inodelk/entrylk. > > When a change is sent to a server, the client must include a > bitmap mask of the clients to which the request is being sent. > Normally this is a bitmap containing all the clients, however, > when a server fails for some reason some bits will be cleared. The > heal translator uses this bitmap to early detect failures on other > nodes from the point of view of each client. When this condition > is detected, the request is aborted with an error and the client > is notified with the remaining list of valid nodes. If the client > considers the request can be successfully server with the > remaining list of nodes, it can resend the request with the > updated bitmap. > > The heal translator also updates two file attributes for each > change request to mantain the "version" of the data and metadata > contents of the file. A similar task is currently made by AFR > using xattrop. This would not be needed anymore, speeding write > requests. > > The version of data and metadata is returned to the client for > each read request, allowing it to detect inconsistent data. > > When a client detects an inconsistency, it initiates healing. > First of all, it must lock the entry and inode (when necessary). > Then, from the data collected from each node, it must decide which > nodes have good data and which ones have bad data and hence need > to be healed. There are two possible cases: > > 1. File is not a regular file > > In this case the reconstruction is very fast and requires few > requests, so it is done while the file is locked. In this > case, the heal translator does nothing relevant. > > 2. File is a regular file > > For regular files, the first step is to synchronize the > metadata to the bad nodes, including the version information. > Once this is done, the file is set in healing mode on bad > nodes, and provider mode on good nodes. Then the entry and > inode are unlocked. > > When a file is in provider mode, it works as in normal mode, but > refuses to start another healing. Only one client can be healing a > file. > > When a file is in healing mode, each normal write request from any > client are handled as if the file were in normal mode, updating > the version information and detecting possible inconsistencies > with the bitmap. Additionally, the healing translator marks the > written region of the file as "good". > > Each write request from the healing client intended to repair the > file must be marked with a special flag. In this case, the area > that wants to be written is filtered by the list of "good" ranges > (if there are any intersection with a good range, it is removed > from the request). The resulting set of ranges are propagated to > the lower translator and added to the list of "good" ranges but > the version information is not updated. > > Read requests are only served if the range requested is entirely > contained into the "good" regions list. > > There are some additional details, but I think this is enough to > have a general idea of its purpose and how it works. > > The main advantages of this translator are: > > 1. Avoid duplicated code in client translators > 2. Simplify and unify healing methods in client translators > 3. xattrop is not needed anymore in client translators to keep > track of changes > 4. Full file contents are repaired without locking the file > 5. Better detection and prevention of some split brain situations > as soon as possible > > I think it would be very useful. It seems to me that it works > correctly in all situations, however I don't have all the > experience that other developers have with the healing functions > of AFR, so I will be happy to answer any question or suggestion to > solve problems it may have or to improve it. > > What do you think about it ? > > > The goals you state above are all valid. What would really help > (adoption) is if you can implement this as a modification of AFR by > utilizing all the work already done, and you get brownie points if it > is backward compatible with existing AFR. If you already have any code > in a publishable state, please share it with us (github link?). > > Avati I've tried to understand how AFR works and, in some way, some of the ideas have been taken from it. However it is very complex and a lot of changes have been carried out in the master branch over the latest months. It's hard for me to follow them while actively working on my translator. Nevertheless, the main reason to take a separate path was that AFR is strongly bound to replication (at least from what I saw when I analyzed it more deeply. Maybe things have changed now, but haven't had time to review them). The requirements for my translator didn't fit very well with AFR, and the needed effort to understand and modify it to adapt it was too high. It also seems that there isn't any detailed developer info about internals of AFR that could have helped to be more confident to modify it (at least I haven't found it). I'm currenty working on it, but it's not ready yet. As soon as it is in a minimally stable state we will publish it, probably on github. I'll write the url to this list. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From anand.avati at gmail.com Tue May 22 07:48:43 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 22 May 2012 00:48:43 -0700 Subject: [Gluster-devel] A healing translator In-Reply-To: <4FBB43D9.9070605@datalab.es> References: <4FA8E8AB.2040604@datalab.es> <4FBB43D9.9070605@datalab.es> Message-ID: > > > I've tried to understand how AFR works and, in some way, some of the > ideas have been taken from it. However it is very complex and a lot of > changes have been carried out in the master branch over the latest months. > It's hard for me to follow them while actively working on my translator. > Nevertheless, the main reason to take a separate path was that AFR is > strongly bound to replication (at least from what I saw when I analyzed it > more deeply. Maybe things have changed now, but haven't had time to review > them). > Have you reviewed the proactive self-heal daemon (+ changelog indexing translator) which is a potential functional replacement for what you might be attempting? Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 22 08:16:06 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 22 May 2012 08:16:06 +0000 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Message-ID: <20120522081606.GA3976@homeworld.netbsd.org> On Mon, May 21, 2012 at 10:44:30PM -0700, Anand Avati wrote: > Even in the case where bugc1.txt got a wrong uid returned (assuming so), it > should not influence the permissibility of it getting deleted. The > deletability of a file is based on the permissions on the parent directory > and not the ownership of the file (unless +t sticky bit was set on the > directory). This is interesting: I get the behavior you describe on Linux (ext2fs), but NetBSD (FFS) hehaves differently (these are native test, without glusterfs). Is it a grey area in standards? $ ls -la test/ total 16 drwxr-xr-x 2 root wheel 512 May 22 10:10 . drwxr-xr-x 19 manu wheel 5632 May 22 10:10 .. -rw-r--r-- 1 manu wheel 0 May 22 10:10 toto $ whoami manu $ rm -f test/toto rm: test/toto: Permission denied $ uname -sr NetBSD 5.1_STABLE -- Emmanuel Dreyfus manu at netbsd.org From rgowdapp at redhat.com Tue May 22 08:44:00 2012 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Tue, 22 May 2012 04:44:00 -0400 (EDT) Subject: [Gluster-devel] RFC on fix to bug #802414 In-Reply-To: <4FBB3C28.2020106@redhat.com> Message-ID: <96991134-54b7-4e4b-a325-b0cdafec8abb@zmail01.collab.prod.int.phx2.redhat.com> ----- Original Message ----- > From: "Anand Avati" > To: "Raghavendra Gowdappa" > Cc: "Pranith Kumar Karampuri" , "Vijay Bellur" , "Amar Tumballi" > , "Krishnan Parthasarathi" , gluster-devel at nongnu.org > Sent: Tuesday, May 22, 2012 12:41:36 PM > Subject: Re: RFC on fix to bug #802414 > > > > The PARENT_DOWN_HANDLED approach will take us backwards from the > current > state where we are resiliant to frame losses and other class of bugs > (i.e, if a frame loss happens on either server or client, it only > results in prevented graph cleanup but the graph switch still > happens). > > The root "cause" here is that we are giving up on a very important > and > fundamental principle of immutability on the fd object. The real > solution here is to never modify fd->inode. Instead we must bring > about > a more native fd "migration" than just re-opening an existing fd on > the > new graph. > > Think of the inode migration analogy. The handle coming from FUSE > (the > address of the object) is a "hint". Usually the hint is right, if the > object in the address belongs to the latest graph. If not, using the > GFID we resolve a new inode on the latest graph and use it. > > In case of FD we can do something similar, except there are not GFIDs > (which should not be a problem). We need to make the handle coming > from > FUSE (the address of fd_t) just a hint. If the > fd->inode->table->xl->graph is the latest, then the hint was a HIT. > If > the graph was not the latest, we look for a previous migration > attempt+result in the "base" (original) fd's context. If that does > not > exist or is not fresh (on the latest graph) then we do a new fd > creation, open on new graph, fd_unref the old cached result in the fd > context of the "base fd" and keep ref to this new result. All this > must > happen from fuse_resolve_fd(). The setting of the latest fd and > updation > of the latest fd pointer happens under the scope of the > base_fd->lock() > which gives it a very clear and unambiguous scope which was missing > with > the old scheme. I remember discussing this solution during initial design. But, not sure why we dropped it. So, Can I go ahead with the implementation? Is this fix required post 3.3? > > [The next step will be to nuke the fd->inode swapping in > fuse_create_cbk] > > Avati > > On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote: > > > > > > ----- Original Message ----- > >> From: "Pranith Kumar Karampuri" > >> To: "Anand Avati" > >> Cc: "Vijay Bellur", "Amar > >> Tumballi", "Krishnan Parthasarathi" > >> , "Raghavendra Gowdappa" > >> Sent: Tuesday, May 22, 2012 8:42:58 AM > >> Subject: Re: RFC on fix to bug #802414 > >> > >> Dude, > >> We have already put logs yesterday in LOCK and UNLOCK and saw > >> that the&fd->inode->lock address changed from LOCK to UNLOCK. > > > > Yes, even I too believe that the hang is because of fd->inode swap > > in fuse_migrate_fd and not the one in fuse_create_cbk. We could > > clearly see in the log files following race: > > fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this > > was a naive fix - hold lock on inode in old graph - to the > > race-condition caused by swapping fd->inode, which didn't work) > > > > poll-thr: tries to acquire fd->inode->lock (inode is old_inode > > present in old-graph) in afr_local_cleanup > > fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock > > poll-thr: gets woken up from lock call on old_inode->lock. > > poll-thr: does its work, but while unlocking, uses fd->inode where > > inode belongs to new graph. > > > > we had logs printing lock address before and after acquisition of > > lock and we could clearly see that lock address changed after > > acquiring lock in afr_local_cleanup. > > > >> > >>>> "The hang in fuse_migrate_fd is _before_ the inode swap > >>>> performed > >>>> there." > >> All the fds are opened on the same file. So all fds in the fd > >> migration point to same inode. The race is hit by nth fd, (n+1)th > >> fd > >> hangs. We have seen that afr_local_cleanup was doing fd_unref, and > >> LOCK(fd->inode->lock) was done with one address then by the time > >> UNLOCK(fd->inode->lock) is done the address changed. So the next > >> fd > >> that has to migrate hung because the prev inode lock is not > >> unlocked. > >> > >> If after nth fd introduces the race a _cbk comes in epoll thread > >> on > >> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will > >> hang. > >> Which is my theory for the hang we observed on Saturday. > >> > >> Pranith. > >> ----- Original Message ----- > >> From: "Anand Avati" > >> To: "Raghavendra Gowdappa" > >> Cc: "Vijay Bellur", "Amar Tumballi" > >> , "Krishnan Parthasarathi" > >> , "Pranith Kumar Karampuri" > >> > >> Sent: Tuesday, May 22, 2012 2:09:33 AM > >> Subject: Re: RFC on fix to bug #802414 > >> > >> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote: > >>> Avati, > >>> > >>> fuse_migrate_fd (running in reader thread - rdthr) assigns new > >>> inode to fd, once it looks up inode in new graph. But this > >>> assignment can race with code that accesses fd->inode->lock > >>> executing in poll-thread (pthr) as follows > >>> > >>> pthr: LOCK (fd->inode->lock); (inode in old graph) > >>> rdthr: fd->inode = inode (resolved in new graph) > >>> pthr: UNLOCK (fd->inode->lock); (inode in new graph) > >>> > >> > >> The way I see it (the backtrace output in the other mail), the > >> swap > >> happening in fuse_create_cbk() must be the one causing lock/unlock > >> to > >> land on different inode objects. The hang in fuse_migrate_fd is > >> _before_ > >> the inode swap performed there. Can you put some logs in > >> fuse_create_cbk()'s inode swap code and confirm this? > >> > >> > >>> Now, any lock operations on inode in old graph will block. Thanks > >>> to pranith for pointing to this race-condition. > >>> > >>> The problem here is we don't have a single lock that can > >>> synchronize assignment "fd->inode = inode" and other locking > >>> attempts on fd->inode->lock. So, we are thinking that instead of > >>> trying to synchronize, eliminate the parallel accesses > >>> altogether. > >>> This can be done by splitting fd migration into two tasks. > >>> > >>> 1. Actions on old graph (like fsync to flush writes to disk) > >>> 2. Actions in new graph (lookup, open) > >>> > >>> We can send PARENT_DOWN when, > >>> 1. Task 1 is complete. > >>> 2. No fop sent by fuse is pending. > >>> > >>> on receiving PARENT_DOWN, protocol/client will shutdown > >>> transports. > >>> As part of transport cleanup, all pending frames are unwound and > >>> protocol/client will notify its parents with PARENT_DOWN_HANDLED > >>> event. Each of the translator will pass this event to its parents > >>> once it is convinced that there are no pending fops started by it > >>> (like background self-heal, reads as part of read-ahead etc). > >>> Once > >>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there > >>> will be no replies that will be racing with migration (note that > >>> migration is done using syncops). At this point in time, it is > >>> safe to start Task 2 (which associates fd with an inode in new > >>> graph). > >>> > >>> Also note that reader thread will not do other operations till it > >>> completes both tasks. > >>> > >>> As far as the implementation of this patch goes, major work is in > >>> translators like read-ahead, afr, dht to provide the guarantee > >>> required to send PARENT_DOWN_HANDLED event to their parents. > >>> > >>> Please let me know your thoughts on this. > >>> > >> > >> All the above steps might not apply if it is caused by the swap in > >> fuse_create_cbk(). Let's confirm that first. > >> > >> Avati > >> > > From xhernandez at datalab.es Tue May 22 08:51:22 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Tue, 22 May 2012 10:51:22 +0200 Subject: [Gluster-devel] A healing translator In-Reply-To: References: <4FA8E8AB.2040604@datalab.es> <4FBB43D9.9070605@datalab.es> Message-ID: <4FBB538A.70201@datalab.es> On 05/22/2012 09:48 AM, Anand Avati wrote: > >> > I've tried to understand how AFR works and, in some way, some of > the ideas have been taken from it. However it is very complex and > a lot of changes have been carried out in the master branch over > the latest months. It's hard for me to follow them while actively > working on my translator. Nevertheless, the main reason to take a > separate path was that AFR is strongly bound to replication (at > least from what I saw when I analyzed it more deeply. Maybe things > have changed now, but haven't had time to review them). > > > Have you reviewed the proactive self-heal daemon (+ changelog indexing > translator) which is a potential functional replacement for what you > might be attempting? > > Avati I must admit that I've read something about it but I haven't had time to explore it in detail. If I understand it correctly, the self-heal daemon works as a client process but can be executed on server nodes. I suppose that multiple self-heal daemons can be running on different nodes. Then, each daemon detects invalid files (not sure exactly how) and replicates the changes from one good node to the bad nodes. The problem is that in the translator I'm working on, the information is dispersed among multiple nodes, so there isn't a single server node that contains the whole data. To repair a node, data must be read from at least two other nodes (it depends on configuration). From what I've read from AFR and the self-healing daemon, it's not straightforward to adapt them to this mechanism because they would need to know a subset of nodes with consistent data, not only one. Each daemon would have to contact all other nodes, read data from each one, determine which ones are valid, rebuild the data and send it to the bad nodes. This means that the daemon will have to be as complex as the clients. My impression (but I may be wrong) is that AFR and the self-healing daemon are closely bound to the replication schema, so it is very hard to try to use them for other purposes. The healing translator I'm writing tries to offer generic server side helpers for the healing process, but it is the client side who really manages the healing operation (though heavily simplified) and could use it to replicate data, to disperse data, or some other schema. Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 22 09:08:48 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 22 May 2012 09:08:48 +0000 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Message-ID: <20120522090848.GC3976@homeworld.netbsd.org> On Mon, May 21, 2012 at 10:44:30PM -0700, Anand Avati wrote: > Is there a way you can extend the trace code above to show the UIDs getting > returned? Maybe it was the parent directory (subdir) that got a wrong UID > returned? Further investigation shows you are right. I traced the struct fuse_entry_out returned by glusterfs on LOOKUP; "/subdir", uid = 500, gid = 500, mode = 040755, attr_valid = 1 "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 bugc1.txt is looked up many times as I loop creating/deleting it subdir is not looked up often since it is cached for 1 second. New subdir lookups will return correct uid/gid/mode. After some time, though, it will return incorrect information: "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 "/subdir", uid = 0, gid = 0, mode = 040700, attr_valid = 1 -- Emmanuel Dreyfus manu at netbsd.org From aavati at redhat.com Tue May 22 17:47:49 2012 From: aavati at redhat.com (Anand Avati) Date: Tue, 22 May 2012 10:47:49 -0700 Subject: [Gluster-devel] RFC on fix to bug #802414 In-Reply-To: <96991134-54b7-4e4b-a325-b0cdafec8abb@zmail01.collab.prod.int.phx2.redhat.com> References: <96991134-54b7-4e4b-a325-b0cdafec8abb@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FBBD145.3030303@redhat.com> On 05/22/2012 01:44 AM, Raghavendra Gowdappa wrote: > > > ----- Original Message ----- >> From: "Anand Avati" >> To: "Raghavendra Gowdappa" >> Cc: "Pranith Kumar Karampuri", "Vijay Bellur", "Amar Tumballi" >> , "Krishnan Parthasarathi", gluster-devel at nongnu.org >> Sent: Tuesday, May 22, 2012 12:41:36 PM >> Subject: Re: RFC on fix to bug #802414 >> >> >> >> The PARENT_DOWN_HANDLED approach will take us backwards from the >> current >> state where we are resiliant to frame losses and other class of bugs >> (i.e, if a frame loss happens on either server or client, it only >> results in prevented graph cleanup but the graph switch still >> happens). >> >> The root "cause" here is that we are giving up on a very important >> and >> fundamental principle of immutability on the fd object. The real >> solution here is to never modify fd->inode. Instead we must bring >> about >> a more native fd "migration" than just re-opening an existing fd on >> the >> new graph. >> >> Think of the inode migration analogy. The handle coming from FUSE >> (the >> address of the object) is a "hint". Usually the hint is right, if the >> object in the address belongs to the latest graph. If not, using the >> GFID we resolve a new inode on the latest graph and use it. >> >> In case of FD we can do something similar, except there are not GFIDs >> (which should not be a problem). We need to make the handle coming >> from >> FUSE (the address of fd_t) just a hint. If the >> fd->inode->table->xl->graph is the latest, then the hint was a HIT. >> If >> the graph was not the latest, we look for a previous migration >> attempt+result in the "base" (original) fd's context. If that does >> not >> exist or is not fresh (on the latest graph) then we do a new fd >> creation, open on new graph, fd_unref the old cached result in the fd >> context of the "base fd" and keep ref to this new result. All this >> must >> happen from fuse_resolve_fd(). The setting of the latest fd and >> updation >> of the latest fd pointer happens under the scope of the >> base_fd->lock() >> which gives it a very clear and unambiguous scope which was missing >> with >> the old scheme. > > I remember discussing this solution during initial design. But, not sure why we dropped it. So, Can I go ahead with the implementation? Is this fix required post 3.3? > The solution you are probably referring to was dropped because there we were talking about chaining FDs to the one on the "next graph" as graphs keep getting changed. The one described above is different because here there will one base fd (the original one on which open() by fuse was performed) and new graphs result in creation of an internal new fd directly referred by the base fd (and naturally unref the previous "new fd") thereby keeping things quite trim. Avati From anand.avati at gmail.com Tue May 22 20:09:52 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 22 May 2012 13:09:52 -0700 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: <20120522090848.GC3976@homeworld.netbsd.org> References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> <20120522090848.GC3976@homeworld.netbsd.org> Message-ID: On Tue, May 22, 2012 at 2:08 AM, Emmanuel Dreyfus wrote: > > Further investigation shows you are right. I traced the > struct fuse_entry_out returned by glusterfs on LOOKUP; > > "/subdir", uid = 500, gid = 500, mode = 040755, attr_valid = 1 > ... > "/subdir", uid = 0, gid = 0, mode = 040700, attr_valid = 1 > Note that even mode has changed, not just the uid/gid. It will probably help if you can put a breakpoint in this case and inspect the stack about where these attribute fields are fetched from (some cache? from posix?) Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Wed May 23 02:04:25 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 23 May 2012 04:04:25 +0200 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: Message-ID: <1kkj4ca.1knxmw01kr7wlgM%manu@netbsd.org> Anand Avati wrote: > Note that even mode has changed, not just the uid/gid. It will probably > help if you can put a breakpoint in this case and inspect the stack about > where these attribute fields are fetched from (some cache? from posix?) My tests shows that the garbage is introduced by mdc_inode_iatt_get() in mdc_lookup(). -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vijay at build.gluster.com Wed May 23 13:57:15 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Wed, 23 May 2012 06:57:15 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa43 released Message-ID: <20120523135718.0E6111008C@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa43/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz This release is made off v3.3.0qa43 From manu at netbsd.org Wed May 23 16:58:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 23 May 2012 16:58:02 +0000 Subject: [Gluster-devel] preparent and postparent? Message-ID: <20120523165802.GC17268@homeworld.netbsd.org> Hi in the protocol/server xlator, there are many occurences where callbacks have a struct iatt for preparent and postparent. What are these for? Is it a normal behavior to have different things in preparent and postparent? -- Emmanuel Dreyfus manu at netbsd.org From jdarcy at redhat.com Wed May 23 17:03:41 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Wed, 23 May 2012 13:03:41 -0400 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <20120523165802.GC17268@homeworld.netbsd.org> References: <20120523165802.GC17268@homeworld.netbsd.org> Message-ID: <20120523130341.1ee693a3@jdarcy-dt.usersys.redhat.com> On Wed, 23 May 2012 16:58:02 +0000 Emmanuel Dreyfus wrote: > in the protocol/server xlator, there are many occurences where > callbacks have a struct iatt for preparent and postparent. What are > these for? NFS needs them to support its style of caching. From manu at netbsd.org Thu May 24 01:31:18 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 24 May 2012 03:31:18 +0200 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <20120523130341.1ee693a3@jdarcy-dt.usersys.redhat.com> Message-ID: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> Jeff Darcy wrote: > > in the protocol/server xlator, there are many occurences where > > callbacks have a struct iatt for preparent and postparent. What are > > these for? > > NFS needs them to support its style of caching. Let me rephrase: what information is stored in preparent and postparent? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Thu May 24 04:29:39 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 24 May 2012 06:29:39 +0200 Subject: [Gluster-devel] gerrit Message-ID: <1kkl5w4.dyowb9lel6oM%manu@netbsd.org> Hi In gerrit, if I sign it and look at the Download field in a patchset, I see this: git fetch ssh://manu@/glusterfs refs/changes/13/3413/2 && git format-patch -1 --stdout FETCH_HEAD It would be nice if I had ssh://manu at git.gluster.com/glusterfs.git so that the line can be copy/pasted without the need to edit each time. Is it something I need to configure (where?), or is it a global setting beyond my reach (in that case, please someone fix it!) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From anand.avati at gmail.com Thu May 24 06:30:20 2012 From: anand.avati at gmail.com (Anand Avati) Date: Wed, 23 May 2012 23:30:20 -0700 Subject: [Gluster-devel] gerrit In-Reply-To: <1kkl5w4.dyowb9lel6oM%manu@netbsd.org> References: <1kkl5w4.dyowb9lel6oM%manu@netbsd.org> Message-ID: fixed! On Wed, May 23, 2012 at 9:29 PM, Emmanuel Dreyfus wrote: > Hi > > In gerrit, if I sign it and look at the Download field in a patchset, I > see this: > > git fetch ssh://manu@/glusterfs refs/changes/13/3413/2 && git > format-patch -1 --stdout FETCH_HEAD > > It would be nice if I had ssh://manu at git.gluster.com/glusterfs.git > so that the line can be copy/pasted without the need to edit each time. > Is it something I need to configure (where?), or is it a global setting > beyond my reach (in that case, please someone fix it!) > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at datalab.es Thu May 24 07:10:59 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Thu, 24 May 2012 09:10:59 +0200 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> References: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> Message-ID: <4FBDDF03.8080203@datalab.es> On 05/24/2012 03:31 AM, Emmanuel Dreyfus wrote: > Jeff Darcy wrote: > >>> in the protocol/server xlator, there are many occurences where >>> callbacks have a struct iatt for preparent and postparent. What are >>> these for? >> NFS needs them to support its style of caching. > Let me rephrase: what information is stored in preparent and postparent? preparent and postparent have the attributes (modification time, size, permissions, ...) of the parent directory of the file being modified before and after the modification is done. Xavi From jdarcy at redhat.com Thu May 24 13:05:08 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 24 May 2012 09:05:08 -0400 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <4FBDDF03.8080203@datalab.es> References: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> <4FBDDF03.8080203@datalab.es> Message-ID: <4FBE3204.7050005@redhat.com> On 05/24/2012 03:10 AM, Xavier Hernandez wrote: > preparent and postparent have the attributes (modification time, size, > permissions, ...) of the parent directory of the file being modified > before and after the modification is done. Thank you, Xavi. :) If you really want to have some fun, you can take a look at the rename callback, which has pre- and post-attributes for both the old and new parent. From johnmark at redhat.com Thu May 24 19:21:22 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 24 May 2012 15:21:22 -0400 (EDT) Subject: [Gluster-devel] glusterfs-3.3.0qa43 released In-Reply-To: <20120523135718.0E6111008C@build.gluster.com> Message-ID: <7c8ea685-d794-451e-820a-25f784e7873d@zmail01.collab.prod.int.phx2.redhat.com> A reminder: As we come down to the final days, it is vitally important that we test these last few qa releases. This one, in particular, contains fixes added to the 3.3 branch after beta 4 was release last week: http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz Please consider using the testing page when evaluating: http://www.gluster.org/community/documentation/index.php/3.3.0_Beta_4_Tests Also, if someone would like to test the object storage as well as the HDFS piece, please report here, or create another test page on the wiki. Finally, you can track all commits to the master and 3.3 branches on Twitter (@glusterdev) ...and via Atom/Rss - https://github.com/gluster/glusterfs/commits/release-3.3.atom https://github.com/gluster/glusterfs/commits/master.atom -JM ----- Original Message ----- > > http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa43/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz > > This release is made off v3.3.0qa43 > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From xhernandez at datalab.es Fri May 25 07:28:43 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Fri, 25 May 2012 09:28:43 +0200 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <4FBE3204.7050005@redhat.com> References: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> <4FBDDF03.8080203@datalab.es> <4FBE3204.7050005@redhat.com> Message-ID: <4FBF34AB.6070606@datalab.es> On 05/24/2012 03:05 PM, Jeff Darcy wrote: > On 05/24/2012 03:10 AM, Xavier Hernandez wrote: >> preparent and postparent have the attributes (modification time, size, >> permissions, ...) of the parent directory of the file being modified >> before and after the modification is done. > Thank you, Xavi. :) If you really want to have some fun, you can take a look > at the rename callback, which has pre- and post-attributes for both the old and > new parent. Yes, I've had some "fun" with them. Without them almost all callbacks would seem too short to me now... hehehe From fernando.frediani at qubenet.net Fri May 25 09:44:10 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 09:44:10 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> Hi, I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 beta4 and the new type of volume striped + replicated. My go is to use it to run Virtual Machines (.vmdk files). Volume is created fine and the ESXi server mountw the Datastore using Gluster built-in NFS, however when trying to use the Datastore or even read, it hangs. Looking at the Gluster NFS logs I see: "[socket.c:195:__socket_rwv] 0-socket.nfs-server: readv failed (Connection reset by peer)" In order to get the rpm files installed I had first to install these two because of the some libraries: "compat-readline5-5.2-17.1.el6.x86_64".rpm and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has anything to do with that. Has anyone ever used Gluster as a backend storage for ESXi ? Does it actually work ? Regards, Fernando Frediani Lead Systems Engineer Qube Managed Services Limited 260-266 Goswell Road, London, EC1V 7EB, United Kingdom sales: +44 (0) 20 7150 3800 ddi: +44 (0) 20 7150 3803 fax: +44 (0) 20 7336 8420 web: http://www.qubenet.net/ P Please consider the environment before printing this email -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.frediani at qubenet.net Fri May 25 11:36:55 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 11:36:55 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> Actually, even on another Linux machine mounting NFS has the same behaviour. I am able to mount it with "mount -t nfs ..." but when I try "ls" it hangs as well. One particular thing of the Gluster servers is that they have two networks, one for management with default gateway and another only for storage. I am only able to mount on the storage network. The hosts file has all nodes' names with the ips on the storage network. I tried to use this but didn't work either. gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* Watching the nfs logs when I try a "ls" from the remote client it shows: pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2012-05-25 11:38:09 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.3.0beta4 /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup+0xa5)[0x7f1c8e7a6ac5] /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdirp_cbk+0x536)[0x7f1c8e543346] /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdirp_cbk+0x1ca)[0x7f1c8e76269a] /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_readdirp_cbk+0x170)[0x7f1c8e9dbbe0] /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x7f1c8f818c8f] /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_handler+0x188)[0x7f1c8f818e38] /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] /usr/sbin/glusterfs(main+0x502)[0x406612] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] /usr/sbin/glusterfs[0x404399] Thanks Fernando From: Fernando Frediani (Qube) Sent: 25 May 2012 10:44 To: 'gluster-devel at nongnu.org' Subject: Can't use NFS with VMware ESXi Hi, I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 beta4 and the new type of volume striped + replicated. My go is to use it to run Virtual Machines (.vmdk files). Volume is created fine and the ESXi server mountw the Datastore using Gluster built-in NFS, however when trying to use the Datastore or even read, it hangs. Looking at the Gluster NFS logs I see: "[socket.c:195:__socket_rwv] 0-socket.nfs-server: readv failed (Connection reset by peer)" In order to get the rpm files installed I had first to install these two because of the some libraries: "compat-readline5-5.2-17.1.el6.x86_64".rpm and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has anything to do with that. Has anyone ever used Gluster as a backend storage for ESXi ? Does it actually work ? Regards, Fernando Frediani Lead Systems Engineer Qube Managed Services Limited 260-266 Goswell Road, London, EC1V 7EB, United Kingdom sales: +44 (0) 20 7150 3800 ddi: +44 (0) 20 7150 3803 fax: +44 (0) 20 7336 8420 web: http://www.qubenet.net/ P Please consider the environment before printing this email -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Fri May 25 13:35:19 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 25 May 2012 13:35:19 +0000 Subject: [Gluster-devel] mismatching ino/dev between file Message-ID: <20120525133519.GC19383@homeworld.netbsd.org> Hi Here is a bug with release-3.3. It happens on a 2 way replicated. Here is what I have in one brick: [2012-05-25 15:03:09.463446] W [posix-handle.c:487:posix_handle_hard] 0-pfs-posix: mismatching ino/dev between file /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (57943061/16) and handle /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce (57943060/16) [2012-05-25 15:03:09.463552] E [posix.c:1277:posix_symlink] 0-pfs-posix: setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 failed On the other one: [2012-05-25 15:03:09.447682] W [posix-handle.c:487:posix_handle_hard] 0-pfs-posix: mismatching ino/dev between file /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (50557989/24) and handle /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce (50557988/24) [2012-05-25 15:03:09.447774] E [posix.c:1277:posix_symlink] 0-pfs-posix: setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 failed Someone can give me a hint of what happens, and how to track it down? -- Emmanuel Dreyfus manu at netbsd.org From abperiasamy at gmail.com Fri May 25 17:09:09 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Fri, 25 May 2012 10:09:09 -0700 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with ?mount ?t nfs ?? but when I try ?ls? it hangs as > well. > > One particular thing of the Gluster servers is that they have two networks, > one for management with default gateway and another only for storage. I am > only able to mount on the storage network. > > The hosts file has all nodes? names with the ips on the storage network. > > > > I tried to use this but didn?t work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a ?ls? from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup+0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdirp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdirp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_readdirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_handler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I?ve setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 beta4 > and the new type of volume striped + replicated. My go is to use it to run > Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or even > read, it hangs. > > > > Looking at the Gluster NFS logs I see: ????[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)? > > > > In order to get the rpm files installed I had first to install these two > because of the some libraries: ?compat-readline5-5.2-17.1.el6.x86_64?.rpm > and ?openssl098e-0.9.8e-17.el6.centos.x86_64.rpm?.Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From pmatthaei at debian.org Fri May 25 18:56:37 2012 From: pmatthaei at debian.org (=?ISO-8859-1?Q?Patrick_Matth=E4i?=) Date: Fri, 25 May 2012 20:56:37 +0200 Subject: [Gluster-devel] glusterfs-3.2.7qa1 released In-Reply-To: <20120412172933.6A2A8102E6@build.gluster.com> References: <20120412172933.6A2A8102E6@build.gluster.com> Message-ID: <4FBFD5E5.1060901@debian.org> Am 12.04.2012 19:29, schrieb Vijay Bellur: > > http://bits.gluster.com/pub/gluster/glusterfs/3.2.7qa1/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.2.7qa1.tar.gz > > This release is made off v3.2.7qa1 Hey, I have tested this qa release and could not find any regression/problem. It would be realy nice to have a 3.2.7 release in the next days (max 2 weeks from now on) so that we could ship glusterfs 3.2.7 instead of 3.2.6 with our next release Debian Wheezy! -- /* Mit freundlichem Gru? / With kind regards, Patrick Matth?i GNU/Linux Debian Developer E-Mail: pmatthaei at debian.org patrick at linux-dev.org */ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: OpenPGP digital signature URL: From fernando.frediani at qubenet.net Fri May 25 19:33:37 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 19:33:37 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From fernando.frediani at qubenet.net Fri May 25 20:32:25 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 20:32:25 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> Anand, Further to that I managed to mount the Datastore and deploy machines there, but when trying to power them On I get an error as if it couldn't find a file. Has anyone seen these kind of error before ? I would say that it could be a lock problem, but it doesn't seem to. Permissions maybe ? Or the way the NFS is exported ? (root_squash, no_root_squash, etc) Here is the log: An unexpected error was received from the ESX host while powering on VM vm-21112. Failed to power on VM. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 20:34 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From manu at netbsd.org Sat May 26 05:37:51 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 07:37:51 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate Message-ID: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> here is a bug in release-3.3: ./xinstall -c -p -r -m 555 xinstall /pfs/manu/netbsd/usr/src/tooldir.NetBSD-6.99.4-i386/bin/i386--netbsdelf-instal xinstall: /pfs/manu/netbsd/usr/src/tooldir.NetBSD-6.99.4-i386/bin/inst.00033a: chmod: Permission denied Kernel trace, client side: 33 1 xinstall CALL open(0xbfbfd8e0,0xa02,0x180) 33 1 xinstall NAMI "/pfs/manu/netbsd/usr/src/tooldir.NetBSD-6.99.4-i38 6/bin/inst.00033a" 33 1 xinstall RET open 3 33 1 xinstall CALL open(0x (...) 33 1 xinstall CALL fchmod(3,0x16d) 33 1 xinstall RET fchmod -1 errno 13 Permission denied I tracked this down to posix_acl_truncate() on the server, where loc->inode and loc->pah are NULL. This code goes red and raise EACCESS: if (acl_permits (frame, loc->inode, POSIX_ACL_WRITE)) goto green; else goto red; Here is the relevant baccktrace: #9 0xb99d1ca6 in posix_acl_truncate (frame=0xbb77f780, this=0xb9d20000, loc=0xb9d41020, off=48933, xdata=0x0) at posix-acl.c:898 #10 0xb99ba4f8 in truncate_stat_cbk (frame=0xbb77f400, cookie=0xbb77f6a0, this=0xb9d22000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at posix.c:204 #11 0xbbb87933 in default_stat_cbk (frame=0xbb77f6a0, cookie=0xbb77f710, this=0xb9d20000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at defaults.c:47 #12 0xb99e1751 in posix_stat (frame=0xbb77f710, this=0xb9d1f000, loc=0xba60091c, xdata=0x0) at posix.c:231 In frame 12, loc->inode is not NULL, and loc->path makes sense: "/netbsd/usr/src/tooldir.NetBSD-6.9 9.4-i386/bin/inst.01911a" In frame 10, loc->path and loc->inode are NULL. In note that xlators/features/locks/src/posix.c:pl_ftruncate() sets truncate_stat_cbk() as the callback, and not ftruncate_stat_cbk(). That later function does not even exist. f-style functions not calling f-style callbacks have been the root of various bugs so far, is it one more of them? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vbellur at redhat.com Sat May 26 07:44:52 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Sat, 26 May 2012 13:14:52 +0530 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> References: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> Message-ID: <4FC089F4.3070004@redhat.com> On 05/26/2012 11:07 AM, Emmanuel Dreyfus wrote: > here is a bug in release-3.3: > > > I tracked this down to posix_acl_truncate() on the server, where loc->inode > and loc->pah are NULL. This code goes red and raise EACCESS: > > if (acl_permits (frame, loc->inode, POSIX_ACL_WRITE)) > goto green; > else > goto red; > > Here is the relevant baccktrace: > > #9 0xb99d1ca6 in posix_acl_truncate (frame=0xbb77f780, this=0xb9d20000, > loc=0xb9d41020, off=48933, xdata=0x0) at posix-acl.c:898 > #10 0xb99ba4f8 in truncate_stat_cbk (frame=0xbb77f400, cookie=0xbb77f6a0, > this=0xb9d22000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) > at posix.c:204 > #11 0xbbb87933 in default_stat_cbk (frame=0xbb77f6a0, cookie=0xbb77f710, > this=0xb9d20000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) > at defaults.c:47 > #12 0xb99e1751 in posix_stat (frame=0xbb77f710, this=0xb9d1f000, > loc=0xba60091c, xdata=0x0) at posix.c:231 > > In frame 12, loc->inode is not NULL, and loc->path makes sense: > "/netbsd/usr/src/tooldir.NetBSD-6.9 > 9.4-i386/bin/inst.01911a" > > In frame 10, loc->path and loc->inode are NULL. > > In note that xlators/features/locks/src/posix.c:pl_ftruncate() sets > truncate_stat_cbk() as the callback, and not ftruncate_stat_cbk(). That later > function does not even exist. f-style functions not calling f-style callbacks > have been the root of various bugs so far, is it one more of them? I don't think it is a f-style problem. I do not get a EPERM with the testcase that you posted for qa39. Can you please provide a bigger bt? Thanks, Vijay > > From manu at netbsd.org Sat May 26 09:00:22 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 11:00:22 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC089F4.3070004@redhat.com> Message-ID: <1kkp7w9.1a5c4mz1tiqw8rM%manu@netbsd.org> Vijay Bellur wrote: > I don't think it is a f-style problem. I do not get a EPERM with the > testcase that you posted for qa39. Can you please provide a bigger bt? #3 0xb99414c4 in server_truncate_cbk (frame=0xba901714, cookie=0xbb77f010, this=0xb9d27000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at server3_1-fops.c:1218 #4 0xb9968bd6 in io_stats_truncate_cbk (frame=0xbb77f010, cookie=0xbb77f080, this=0xb9d26000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at io-stats.c:1600 #5 0xb998036e in marker_truncate_cbk (frame=0xbb77f080, cookie=0xbb77f0f0, this=0xb9d25000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at marker.c:1535 #6 0xbbb87a85 in default_truncate_cbk (frame=0xbb77f0f0, cookie=0xbb77f160, this=0xb9d24000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at defaults.c:58 #7 0xb99a8fa2 in iot_truncate_cbk (frame=0xbb77f160, cookie=0xbb77f400, this=0xb9d23000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at io-threads.c:1270 #8 0xb99b9fe0 in pl_truncate_cbk (frame=0xbb77f400, cookie=0xbb77f780, this=0xb9d22000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at posix.c:119 #9 0xb99d1ca6 in posix_acl_truncate (frame=0xbb77f780, this=0xb9d20000, loc=0xb9d41020, off=48933, xdata=0x0) at posix-acl.c:898 #10 0xb99ba4f8 in truncate_stat_cbk (frame=0xbb77f400, cookie=0xbb77f6a0, this=0xb9d22000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at posix.c:204 #11 0xbbb87933 in default_stat_cbk (frame=0xbb77f6a0, cookie=0xbb77f710, this=0xb9d20000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at defaults.c:47 #12 0xb99e1751 in posix_stat (frame=0xbb77f710, this=0xb9d1f000, loc=0xba60091c, xdata=0x0) at posix.c:231 #13 0xbbb94d76 in default_stat (frame=0xbb77f6a0, this=0xb9d20000, loc=0xba60091c, xdata=0x0) at defaults.c:1231 #14 0xb99babb0 in pl_truncate (frame=0xbb77f400, this=0xb9d22000, loc=0xba60091c, offset=48933, xdata=0x0) at posix.c:249 #15 0xb99a91ac in iot_truncate_wrapper (frame=0xbb77f160, this=0xb9d23000, loc=0xba60091c, offset=48933, xdata=0x0) at io-threads.c:1280 #16 0xbbba76d8 in call_resume_wind (stub=0xba6008fc) at call-stub.c:2474 #17 0xbbbae729 in call_resume (stub=0xba6008fc) at call-stub.c:4151 #18 0xb99a22a3 in iot_worker (data=0xb9d12110) at io-threads.c:131 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 26 11:51:46 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 13:51:46 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC089F4.3070004@redhat.com> Message-ID: <1kkpd53.bn09pz1v8qmwtM%manu@netbsd.org> Vijay Bellur wrote: > I don't think it is a f-style problem. I do not get a EPERM with the > testcase that you posted for qa39. I wonder if the bug can occur because some mess in the .glusterfs directory cause by an earlier problem. Is it possible? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 26 12:55:08 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 14:55:08 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkpd53.bn09pz1v8qmwtM%manu@netbsd.org> Message-ID: <1kkpirc.geu5yvq0165fM%manu@netbsd.org> Emmanuel Dreyfus wrote: > I wonder if the bug can occur because some mess in the .glusterfs > directory cause by an earlier problem. Is it possible? That is not the problem: I nuked .glusterfs on all bricks and the problem remain. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 26 14:20:10 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 16:20:10 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC089F4.3070004@redhat.com> Message-ID: <1kkpmmr.rrgubdjz6w9fM%manu@netbsd.org> Vijay Bellur wrote: > I don't think it is a f-style problem. I do not get a EPERM with the > testcase that you posted for qa39. Can you please provide a bigger bt? Here is a minimal test case that reproduces the problem at mine. Run it as un unprivilegied user in a directory you on which you have write access: $ pwd /pfs/manu/xinstall $ ls -ld . drwxr-xr-x 4 manu manu 512 May 26 16:17 . $ id uid=500(manu) gid=500(manu) groups=500(manu),0(wheel) $ ./test test: fchmod failed: Permission denied #include #include #include #include #include #include #include #define TESTFILE "testfile" int main(void) { int fd; char buf[16384]; if ((unlink(TESTFILE) == -1) && (errno != ENOENT)) err(EX_OSERR, "unlink failed"); if ((fd = open(TESTFILE, O_CREAT|O_EXCL|O_RDWR, 0600)) == -1) err(EX_OSERR, "open failed"); if (write(fd, buf, sizeof(buf)) != sizeof(buf)) err(EX_OSERR, "write failed"); if (fchmod(fd, 0555) == -1) err(EX_OSERR, "fchmod failed"); if (close(fd) == -1) err(EX_OSERR, "close failed"); return EX_OK; } -- Emmanuel Dreyfus http://hcpnet.free.fr/pubzx@ manu at netbsd.org From manu at netbsd.org Sun May 27 05:17:51 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 27 May 2012 07:17:51 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> Message-ID: <1kkqh7z.uvmz7na7peuaM%manu@netbsd.org> Emmanuel Dreyfus wrote: > In frame 10, loc->path and loc->inode are NULL. Here is the investigation so far: xlators/features/locks/src/posix.c:truncate_stat_cbk() has a NULL loc->inode, and this leads to the acl check that fails. As I understand this is a FUSE implentation problem. fchmod() produces a FUSE SETATTR. If the file is being written, NetBSD FUSE will set mode, size, atime, mtime, and fh in this operation. I suspect Linux FUSE only sets mode and fh and this is why the bug does not appear on Linux: the truncate code path is probably not involved. Can someone confirm? If this is the case, it suggests the code path may have never been tested. I suspect there are bugs there, for instance, in pl_truncate_cbk, local is erased after being retreived, which does not look right: local = frame->local; local = mem_get0 (this->local_pool); if (local->op == TRUNCATE) loc_wipe (&local->loc); I tried fixing that one without much improvments. There may be other problems. About fchmod() setting size: is it a reasonable behavior? FUSE does not specify what must happens, so if glusterfs rely on the Linux kernel not doing it may be begging for future bugs if that behavior change. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vbellur at redhat.com Sun May 27 06:54:43 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Sun, 27 May 2012 12:24:43 +0530 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkqh7z.uvmz7na7peuaM%manu@netbsd.org> References: <1kkqh7z.uvmz7na7peuaM%manu@netbsd.org> Message-ID: <4FC1CFB3.7050808@redhat.com> On 05/27/2012 10:47 AM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > >> In frame 10, loc->path and loc->inode are NULL. > > > As I understand this is a FUSE implentation problem. fchmod() produces a > FUSE SETATTR. If the file is being written, NetBSD FUSE will set mode, > size, atime, mtime, and fh in this operation. I suspect Linux FUSE only > sets mode and fh and this is why the bug does not appear on Linux: the > truncate code path is probably not involved. For the testcase that you sent out, I see fsi->valid being set to 1 which indicates only mode on Linux. The truncate path does not get involved. I modified the testcase to send ftruncate/truncate and it completed successfully. > > > Can someone confirm? If this is the case, it suggests the code path may > have never been tested. I suspect there are bugs there, for instance, in > pl_truncate_cbk, local is erased after being retreived, which does not > look right: > > local = frame->local; > > local = mem_get0 (this->local_pool); I don't see this in pl_truncate_cbk(). mem_get0 is done only in pl_truncate(). A code inspection in pl_(f)truncate did not raise any suspicions to me. > > > About fchmod() setting size: is it a reasonable behavior? FUSE does not > specify what must happens, so if glusterfs rely on the Linux kernel not > doing it may be begging for future bugs if that behavior change. > I am not sure why fchmod() should set size. Csaba, any thoughts on this? Vijay From manu at netbsd.org Sun May 27 07:34:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 27 May 2012 09:34:02 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC1CFB3.7050808@redhat.com> Message-ID: <1kkqxw0.1smapd1jsih9iM%manu@netbsd.org> Vijay Bellur wrote: > For the testcase that you sent out, I see fsi->valid being set to 1 > which indicates only mode on Linux. The truncate path does not get > involved. I modified the testcase to send ftruncate/truncate and it > completed successfully. I modified by FUSE implementation to send FATTR_SIZE|FATTR_FH in one request, and FATTR_MODE|FATTR_FH|FATTR_MTIME|FATTR_ATIME in a separate one, and the test passes fine. On your test not raising the bug: Is it possible that Linux already sent a FATTR_SIZE|FATTR_FH when fchmod() is invoked, and that glusterfs discards a FATTR_SIZE that does not really resize? Did you try with supplying a bigger size? > > local = mem_get0 (this->local_pool); > I don't see this in pl_truncate_cbk(). mem_get0 is done only in > pl_truncate(). A code inspection in pl_(f)truncate did not raise any > suspicions to me. Right, this was an unfortunate copy/paste. However reverting to correct code does not fix the bug when FUSE sends FATTR_SIZE is set with FATTR_MODE at the same time. > I am not sure why fchmod() should set size. Csaba, any thoughts on this? This is an optimization. You have an open file, you just grew it and you change mode. The NetBSD kernel and its FUSE implementation do the two operations in a single FUSE request, because they are smart :-) I will commit the fix in NetBSD FUSE. But one day the Linux kernel could decide to use the same shortcut too. It may be wise to fix glusterfs so that it does not assume FATTR_SIZE is not sent with other metadata changes. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From anand.avati at gmail.com Sun May 27 21:40:35 2012 From: anand.avati at gmail.com (Anand Avati) Date: Sun, 27 May 2012 14:40:35 -0700 Subject: [Gluster-devel] mismatching ino/dev between file In-Reply-To: <20120525133519.GC19383@homeworld.netbsd.org> References: <20120525133519.GC19383@homeworld.netbsd.org> Message-ID: Can you give some more steps how you reproduced this? This has never happened in any of our testing. This might probably related to the dirname() differences in BSD? Have you noticed this after the GNU dirname usage? Avati On Fri, May 25, 2012 at 6:35 AM, Emmanuel Dreyfus wrote: > Hi > > Here is a bug with release-3.3. It happens on a 2 way replicated. Here is > what I have in one brick: > > [2012-05-25 15:03:09.463446] W [posix-handle.c:487:posix_handle_hard] > 0-pfs-posix: mismatching ino/dev between file > /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (57943061/16) > and handle > /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce > (57943060/16) > [2012-05-25 15:03:09.463552] E [posix.c:1277:posix_symlink] 0-pfs-posix: > setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 > failed > > > On the other one: > > [2012-05-25 15:03:09.447682] W [posix-handle.c:487:posix_handle_hard] > 0-pfs-posix: mismatching ino/dev between file > /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (50557989/24) > and handle > /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce > (50557988/24) > [2012-05-25 15:03:09.447774] E [posix.c:1277:posix_symlink] 0-pfs-posix: > setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 > failed > > Someone can give me a hint of what happens, and how to track it down? > -- > Emmanuel Dreyfus > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Mon May 28 01:52:41 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 03:52:41 +0200 Subject: [Gluster-devel] mismatching ino/dev between file In-Reply-To: Message-ID: <1kkscxr.1k0ou1xcxcd7rM%manu@netbsd.org> Anand Avati wrote: > Can you give some more steps how you reproduced this? This has never > happened in any of our testing. This might probably related to the > dirname() differences in BSD? Have you noticed this after the GNU dirname > usage? I will investigate further. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 28 02:08:19 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 04:08:19 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkqxw0.1smapd1jsih9iM%manu@netbsd.org> Message-ID: <1kkscze.1y0ip7wj3y9uoM%manu@netbsd.org> Emmanuel Dreyfus wrote: > I modified by FUSE implementation to send FATTR_SIZE|FATTR_FH in one > request, and FATTR_MODE|FATTR_FH|FATTR_MTIME|FATTR_ATIME in a separate > one, and the test passes fine. Um, I spoke too fast. Please disreagard the previous post. The problem was not setting size, and mode in the same request. That works fine. The bug appear when setting size, atime and mtime. It also appear when setting mode, atime and mtime. So here is the summary so far: ATTR_SIZE|FATTR_FH -> ok ATTR_SIZE|FATTR_FH|FATTR_ATIME|FATTR_MTIME -> breaks (*) ATTR_MODE|FATTR_FH -> ok ATTR_MODE|FATTR_FH|FATTR_ATIME|FATTR_MTIME -> breaks ATTR_MODE|FATTR_SIZE|FATTR_FH -> ok (I was wrong here) (*) I noticed that one long time ago, and NetBSD FUSE already strips atime and mtime if ATTR_SIZE is set without ATTR_MODE|ATTR_UID|ATTR_GID. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 28 05:07:46 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 07:07:46 +0200 Subject: [Gluster-devel] Testing server down in replicated volume Message-ID: <1kkslvx.fbj1ua1gom7oyM%manu@netbsd.org> Hi everybody After the last fix in NetBSD FUSE (cf NULL loc in posix_acl_truncate), glusterfs release-3.3 now behaves quite nicely on NetBSD. I have been able to build stuff in a replicated glusterfs volume for a few hours, and it seems much faster than 3.2.6. However things turn badly when I tried to kill glusterfsd on a server. Since the volume is replicated, I would have expected the build to carry on unaffected. but this is now what happens: a ENOTCONN is raised up to the processes using the glusterfs volume: In file included from /pfs/manu/netbsd/usr/src/sys/sys/signal.h:114, from /pfs/manu/netbsd/usr/src/sys/sys/param.h:150, from /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/net/__cmsg_align bytes.c:40: /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string /machine/signal.h: Socket is not connected Is it the intended behavior? Here is the client log: [2012-05-28 05:48:27.440017] W [socket.c:195:__socket_rwv] 0-pfs-client-1: writev failed (Broken pipe) [2012-05-28 05:48:27.440989] W [socket.c:195:__socket_rwv] 0-pfs-client-1: readv failed (Connection reset by peer) [2012-05-28 05:48:27.441496] W [socket.c:1512:__socket_proto_state_machine] 0-pfs-client-1: reading from socket failed. Error (Connection reset by peer), peer (193.54.82.98:24011) [2012-05-28 05:48:27.441825] E [rpc-clnt.c:373:saved_frames_unwind] 0-pfs-client-1: forced unwinding frame type(GlusterFS 3.1) op(INODELK(29)) called at 2012-05-28 05:48:27.439249 (xid=0x1715867x) [2012-05-28 05:48:27.442222] W [client3_1-fops.c:1495:client3_1_inodelk_cbk] 0-pfs-client-1: remote operation failed: Socket is not connected [2012-05-28 05:48:27.442528] E [rpc-clnt.c:373:saved_frames_unwind] 0-pfs-client-1: forced unwinding frame type(GlusterFS 3.1) op(SETATTR(38)) called at 2012-05-28 05:48:27.440397 (xid=0x1715868x) [2012-05-28 05:48:27.442971] W [client3_1-fops.c:1954:client3_1_setattr_cbk] 0-pfs-client-1: remote operation failed: Socket is not connected (and so on with other saved_frames_unwind) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 28 05:08:36 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 07:08:36 +0200 Subject: [Gluster-devel] mismatching ino/dev between file In-Reply-To: <1kkscxr.1k0ou1xcxcd7rM%manu@netbsd.org> Message-ID: <1kksmhc.zfnn6i6bllp8M%manu@netbsd.org> Emmanuel Dreyfus wrote: > > Can you give some more steps how you reproduced this? This has never > > happened in any of our testing. This might probably related to the > > dirname() differences in BSD? Have you noticed this after the GNU dirname > > usage? > I will investigate further. It does not happen anymore. I think it was a consequence of the other bug I fixed. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 29 07:55:09 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 29 May 2012 07:55:09 +0000 Subject: [Gluster-devel] Testing server down in replicated volume In-Reply-To: <1kkslvx.fbj1ua1gom7oyM%manu@netbsd.org> References: <1kkslvx.fbj1ua1gom7oyM%manu@netbsd.org> Message-ID: <20120529075509.GE19383@homeworld.netbsd.org> On Mon, May 28, 2012 at 07:07:46AM +0200, Emmanuel Dreyfus wrote: [One server down in a replicated volume] > /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: > /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string > /machine/signal.h: Socket is not connected > > Is it the intended behavior? No reply? I would like to know if I have a NetBSD-specific bug to fix or if it is standard glusterfs behavior. -- Emmanuel Dreyfus manu at netbsd.org From pkarampu at redhat.com Tue May 29 09:09:04 2012 From: pkarampu at redhat.com (Pranith Kumar Karampuri) Date: Tue, 29 May 2012 05:09:04 -0400 (EDT) Subject: [Gluster-devel] Testing server down in replicated volume In-Reply-To: <20120529075509.GE19383@homeworld.netbsd.org> Message-ID: <97e7abfe-e431-47b8-bb26-cf70adbef253@zmail01.collab.prod.int.phx2.redhat.com> I am looking into this. Will reply soon. Pranith ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Emmanuel Dreyfus" Cc: gluster-devel at nongnu.org Sent: Tuesday, May 29, 2012 1:25:09 PM Subject: Re: [Gluster-devel] Testing server down in replicated volume On Mon, May 28, 2012 at 07:07:46AM +0200, Emmanuel Dreyfus wrote: [One server down in a replicated volume] > /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: > /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string > /machine/signal.h: Socket is not connected > > Is it the intended behavior? No reply? I would like to know if I have a NetBSD-specific bug to fix or if it is standard glusterfs behavior. -- Emmanuel Dreyfus manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From vijay at build.gluster.com Tue May 29 13:44:11 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Tue, 29 May 2012 06:44:11 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa44 released Message-ID: <20120529134412.E8A3C100CB@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa44/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa44.tar.gz This release is made off v3.3.0qa44 From pkarampu at redhat.com Tue May 29 17:28:32 2012 From: pkarampu at redhat.com (Pranith Kumar Karampuri) Date: Tue, 29 May 2012 13:28:32 -0400 (EDT) Subject: [Gluster-devel] Testing server down in replicated volume In-Reply-To: <20120529075509.GE19383@homeworld.netbsd.org> Message-ID: <4fb4ce32-9683-44cd-a7bd-aa935c79db29@zmail01.collab.prod.int.phx2.redhat.com> hi Emmanuel, I tried this for half an hour, everytime it failed because of readdir. It did not fail in any other fop. I saw that FINODELKs which relate to transactions in afr failed, but the fop succeeded on the other brick. I am not sure why a setattr (metadata transaction) is failing in your setup when a node is down. I will instrument the code to simulate the inodelk failure in setattr. Will update you tomorrow. Fop failing in readdir is also an issue that needs to be addressed. Pranith. ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Emmanuel Dreyfus" Cc: gluster-devel at nongnu.org Sent: Tuesday, May 29, 2012 1:25:09 PM Subject: Re: [Gluster-devel] Testing server down in replicated volume On Mon, May 28, 2012 at 07:07:46AM +0200, Emmanuel Dreyfus wrote: [One server down in a replicated volume] > /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: > /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string > /machine/signal.h: Socket is not connected > > Is it the intended behavior? No reply? I would like to know if I have a NetBSD-specific bug to fix or if it is standard glusterfs behavior. -- Emmanuel Dreyfus manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From bfoster at redhat.com Wed May 30 15:16:16 2012 From: bfoster at redhat.com (Brian Foster) Date: Wed, 30 May 2012 11:16:16 -0400 Subject: [Gluster-devel] glusterfs client and page cache Message-ID: <4FC639C0.6020503@redhat.com> Hi all, I've been playing with a little hack recently to add a gluster mount option to support FOPEN_KEEP_CACHE and I wanted to solicit some thoughts on whether there's value to find an intelligent way to support this functionality. To provide some context: Our current behavior with regard to fuse is that page cache is utilized by fuse, from what I can tell, just about in the same manner as a typical local fs. The primary difference is that by default, the address space mapping for an inode is completely invalidated on open. So for example, if process A opens and reads a file in a loop, subsequent reads are served from cache (bypassing fuse and gluster). If process B steps in and opens the same file, the cache is flushed and the next reads from either process are passed down through fuse. The FOPEN_KEEP_CACHE option simply disables this cache flash on open behavior. The following are some notes on my experimentation thus far: - With FOPEN_KEEP_CACHE, fuse currently only invalidates on file size changes. This is a problem in that I can rewrite some or all of a file from another client and the cached client wouldn't notice. I've sent a patch to fuse-devel to also invalidate on mtime changes (similar to nfsv3 or cifs), so we'll see how well that is received. fuse also supports a range based invalidation notification that we could take advantage of if necessary. - I reproduce a measurable performance benefit in the local/cached read situation. For example, running a kernel compile against a source tree in a gluster volume (no other xlators and build output to local storage) improves to 6 minutes from just under 8 minutes with the default graph (9.5 minutes with only the client xlator and 1:09 locally). - Some of the specific differences from current io-cache caching: - io-cache supports time based invalidation and tunables such as cache size and priority. The page cache has no such controls. - io-cache invalidates more frequently on various fops. It also looks like we invalidate on writes and don't take advantage of the write data most recently sent, whereas page cache writes are cached (errors notwithstanding). - Page cache obviously has tighter integration with the system (i.e., drop_caches controls, more specific reporting, ability to drop cache when memory is needed). All in all, I'm curious what people think about enabling the cache behavior in gluster. We could support anything from the basic mount option I'm currently using (i.e., similar to attribute/dentry caching) to something integrated with io-cache (doing invalidations when necessary), or maybe even something eventually along the lines of the nfs weak cache consistency model where it validates the cache after every fop based on file attributes. In general, are there other big issues/questions that would need to be explored before this is useful (i.e., the size invalidation issue)? Are there other performance tests that should be explored? Thoughts appreciated. Thanks. Brian From fernando.frediani at qubenet.net Wed May 30 16:19:36 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Wed, 30 May 2012 16:19:36 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F8BF93@QUBE-TR2-EXC01.qube.qubenet.net> Does anyone have an idea of this problem of not being able to power up the virtual machines on that NFS mount ? Also what do those logs mean that Anand say that there is a problem with the Repstr model. Is it something isn't finished yet ? Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 21:32 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Anand, Further to that I managed to mount the Datastore and deploy machines there, but when trying to power them On I get an error as if it couldn't find a file. Has anyone seen these kind of error before ? I would say that it could be a lock problem, but it doesn't seem to. Permissions maybe ? Or the way the NFS is exported ? (root_squash, no_root_squash, etc) Here is the log: An unexpected error was received from the ESX host while powering on VM vm-21112. Failed to power on VM. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 20:34 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From anand.avati at gmail.com Wed May 30 19:32:50 2012 From: anand.avati at gmail.com (Anand Avati) Date: Wed, 30 May 2012 12:32:50 -0700 Subject: [Gluster-devel] glusterfs client and page cache In-Reply-To: <4FC639C0.6020503@redhat.com> References: <4FC639C0.6020503@redhat.com> Message-ID: Brian, You are right, today we hardly leverage the page cache in the kernel. When Gluster started and performance translators were implemented, the fuse invalidation support did not exist, and since that support was brought in upstream fuse we haven't leveraged that effectively. We can actually do a lot more smart things using the invalidation changes. For the consistency concerns where an open fd continues to refer to local page cache - if that is a problem, today you need to mount with --enable-direct-io-mode to bypass the page cache altogether (this is very different from O_DIRECT open() support). On the other hand, to utilize the fuse invalidation APIs and promote using the page cache and still be consistent, we need to gear up glusterfs framework by first implementing server originated messaging support, then build some kind of opportunistic locking or leases to notify glusterfs clients about modifications from a second client, and third implement hooks in the client side listener to do things like sending fuse invalidations or purge pages in io-cache or flush pending writes in write-behind etc. This needs to happen, but we're short on resources to prioritize this sooner :-) Avati On Wed, May 30, 2012 at 8:16 AM, Brian Foster wrote: > Hi all, > > I've been playing with a little hack recently to add a gluster mount > option to support FOPEN_KEEP_CACHE and I wanted to solicit some thoughts > on whether there's value to find an intelligent way to support this > functionality. To provide some context: > > Our current behavior with regard to fuse is that page cache is utilized > by fuse, from what I can tell, just about in the same manner as a > typical local fs. The primary difference is that by default, the address > space mapping for an inode is completely invalidated on open. So for > example, if process A opens and reads a file in a loop, subsequent reads > are served from cache (bypassing fuse and gluster). If process B steps > in and opens the same file, the cache is flushed and the next reads from > either process are passed down through fuse. The FOPEN_KEEP_CACHE option > simply disables this cache flash on open behavior. > > The following are some notes on my experimentation thus far: > > - With FOPEN_KEEP_CACHE, fuse currently only invalidates on file size > changes. This is a problem in that I can rewrite some or all of a file > from another client and the cached client wouldn't notice. I've sent a > patch to fuse-devel to also invalidate on mtime changes (similar to > nfsv3 or cifs), so we'll see how well that is received. fuse also > supports a range based invalidation notification that we could take > advantage of if necessary. > > - I reproduce a measurable performance benefit in the local/cached read > situation. For example, running a kernel compile against a source tree > in a gluster volume (no other xlators and build output to local storage) > improves to 6 minutes from just under 8 minutes with the default graph > (9.5 minutes with only the client xlator and 1:09 locally). > > - Some of the specific differences from current io-cache caching: > - io-cache supports time based invalidation and tunables such as > cache > size and priority. The page cache has no such controls. > - io-cache invalidates more frequently on various fops. It also > looks > like we invalidate on writes and don't take advantage of the write data > most recently sent, whereas page cache writes are cached (errors > notwithstanding). > - Page cache obviously has tighter integration with the system > (i.e., > drop_caches controls, more specific reporting, ability to drop cache > when memory is needed). > > All in all, I'm curious what people think about enabling the cache > behavior in gluster. We could support anything from the basic mount > option I'm currently using (i.e., similar to attribute/dentry caching) > to something integrated with io-cache (doing invalidations when > necessary), or maybe even something eventually along the lines of the > nfs weak cache consistency model where it validates the cache after > every fop based on file attributes. > > In general, are there other big issues/questions that would need to be > explored before this is useful (i.e., the size invalidation issue)? Are > there other performance tests that should be explored? Thoughts > appreciated. Thanks. > > Brian > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bfoster at redhat.com Wed May 30 23:10:58 2012 From: bfoster at redhat.com (Brian Foster) Date: Wed, 30 May 2012 19:10:58 -0400 Subject: [Gluster-devel] glusterfs client and page cache In-Reply-To: References: <4FC639C0.6020503@redhat.com> Message-ID: <4FC6A902.9010406@redhat.com> On 05/30/2012 03:32 PM, Anand Avati wrote: > Brian, > You are right, today we hardly leverage the page cache in the kernel. > When Gluster started and performance translators were implemented, the > fuse invalidation support did not exist, and since that support was > brought in upstream fuse we haven't leveraged that effectively. We can > actually do a lot more smart things using the invalidation changes. > > For the consistency concerns where an open fd continues to refer to > local page cache - if that is a problem, today you need to mount with > --enable-direct-io-mode to bypass the page cache altogether (this is > very different from O_DIRECT open() support). On the other hand, to > utilize the fuse invalidation APIs and promote using the page cache and > still be consistent, we need to gear up glusterfs framework by first > implementing server originated messaging support, then build some kind > of opportunistic locking or leases to notify glusterfs clients about > modifications from a second client, and third implement hooks in the > client side listener to do things like sending fuse invalidations or > purge pages in io-cache or flush pending writes in write-behind etc. > This needs to happen, but we're short on resources to prioritize this > sooner :-) > Thanks for the context Avati. The fuse patch I sent lead to a similar thought process with regard to finer grained invalidation. So far it seems well received, and as I understand it, we can also utilize that mechanism to do full invalidations from gluster on older fuse modules that wouldn't have that fix. I'll look into incorporating that into what I have so far and making it available for review. Brian > Avati > > On Wed, May 30, 2012 at 8:16 AM, Brian Foster > wrote: > > Hi all, > > I've been playing with a little hack recently to add a gluster mount > option to support FOPEN_KEEP_CACHE and I wanted to solicit some thoughts > on whether there's value to find an intelligent way to support this > functionality. To provide some context: > > Our current behavior with regard to fuse is that page cache is utilized > by fuse, from what I can tell, just about in the same manner as a > typical local fs. The primary difference is that by default, the address > space mapping for an inode is completely invalidated on open. So for > example, if process A opens and reads a file in a loop, subsequent reads > are served from cache (bypassing fuse and gluster). If process B steps > in and opens the same file, the cache is flushed and the next reads from > either process are passed down through fuse. The FOPEN_KEEP_CACHE option > simply disables this cache flash on open behavior. > > The following are some notes on my experimentation thus far: > > - With FOPEN_KEEP_CACHE, fuse currently only invalidates on file size > changes. This is a problem in that I can rewrite some or all of a file > from another client and the cached client wouldn't notice. I've sent a > patch to fuse-devel to also invalidate on mtime changes (similar to > nfsv3 or cifs), so we'll see how well that is received. fuse also > supports a range based invalidation notification that we could take > advantage of if necessary. > > - I reproduce a measurable performance benefit in the local/cached read > situation. For example, running a kernel compile against a source tree > in a gluster volume (no other xlators and build output to local storage) > improves to 6 minutes from just under 8 minutes with the default graph > (9.5 minutes with only the client xlator and 1:09 locally). > > - Some of the specific differences from current io-cache caching: > - io-cache supports time based invalidation and tunables such > as cache > size and priority. The page cache has no such controls. > - io-cache invalidates more frequently on various fops. It > also looks > like we invalidate on writes and don't take advantage of the write data > most recently sent, whereas page cache writes are cached (errors > notwithstanding). > - Page cache obviously has tighter integration with the > system (i.e., > drop_caches controls, more specific reporting, ability to drop cache > when memory is needed). > > All in all, I'm curious what people think about enabling the cache > behavior in gluster. We could support anything from the basic mount > option I'm currently using (i.e., similar to attribute/dentry caching) > to something integrated with io-cache (doing invalidations when > necessary), or maybe even something eventually along the lines of the > nfs weak cache consistency model where it validates the cache after > every fop based on file attributes. > > In general, are there other big issues/questions that would need to be > explored before this is useful (i.e., the size invalidation issue)? Are > there other performance tests that should be explored? Thoughts > appreciated. Thanks. > > Brian > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > From johnmark at redhat.com Thu May 31 16:33:20 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 31 May 2012 12:33:20 -0400 (EDT) Subject: [Gluster-devel] A very special announcement from Gluster.org In-Reply-To: <344ab6e5-d6de-48d9-bfe8-e2727af7b45e@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <660ccad1-e191-405c-8645-1cb2fb02f80c@zmail01.collab.prod.int.phx2.redhat.com> Today, we?re announcing the next generation of GlusterFS , version 3.3. The release has been a year in the making and marks several firsts: the first post-acquisition release under Red Hat, our first major act as an openly-governed project and our first foray beyond NAS. We?ve also taken our first steps towards merging big data and unstructured data storage, giving users and developers new ways of managing their data scalability challenges. GlusterFS is an open source, fully distributed storage solution for the world?s ever-increasing volume of unstructured data. It is a software-only, highly available, scale-out, centrally managed storage pool that can be backed by POSIX filesystems that support extended attributes, such as Ext3/4, XFS, BTRFS and many more. This release provides many of the most commonly requested features including proactive self-healing, quorum enforcement, and granular locking for self-healing, as well as many additional bug fixes and enhancements. Some of the more noteworthy features include: ? Unified File and Object storage ? Blending OpenStack?s Object Storage API with GlusterFS provides simultaneous read and write access to data as files or as objects. ? HDFS compatibility ? Gives Hadoop administrators the ability to run MapReduce jobs on unstructured data on GlusterFS and access the data with well-known tools and shell scripts. ? Proactive self-healing ? GlusterFS volumes will now automatically restore file integrity after a replica recovers from failure. ? Granular locking ? Allows large files to be accessed even during self-healing, a feature that is particularly important for VM images. ? Replication improvements ? With quorum enforcement you can be confident that your data has been written in at least the configured number of places before the file operation returns, allowing a user-configurable adjustment to fault tolerance vs performance. Visit http://www.gluster.org to download. Packages are available for most distributions, including Fedora, Debian, RHEL, Ubuntu and CentOS. Get involved! Join us on #gluster on freenode, join our mailing list , ?like? our Facebook page , follow us on Twitter , or check out our LinkedIn group . GlusterFS is an open source project sponsored by Red Hat ?, who uses it in its line of Red Hat Storage products. (this post published at http://www.gluster.org/2012/05/introducing-glusterfs-3-3/ ) -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.frediani at qubenet.net Thu May 31 16:36:36 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Thu, 31 May 2012 16:36:36 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F8BF93@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F8BF93@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F8DC08@QUBE-TR2-EXC01.qube.qubenet.net> What is happening with this ? Non one actually care to take ownership about this ? If this is a bug why nobody is interested to get it fixed ? If not someone speak up please. Two things are not working as they supposed, I am reporting back and nobody seems to give a dam about it. -----Original Message----- From: Fernando Frediani (Qube) Sent: 30 May 2012 17:20 To: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Does anyone have an idea of this problem of not being able to power up the virtual machines on that NFS mount ? Also what do those logs mean that Anand say that there is a problem with the Repstr model. Is it something isn't finished yet ? Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 21:32 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Anand, Further to that I managed to mount the Datastore and deploy machines there, but when trying to power them On I get an error as if it couldn't find a file. Has anyone seen these kind of error before ? I would say that it could be a lock problem, but it doesn't seem to. Permissions maybe ? Or the way the NFS is exported ? (root_squash, no_root_squash, etc) Here is the log: An unexpected error was received from the ESX host while powering on VM vm-21112. Failed to power on VM. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 20:34 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From johnmark at redhat.com Thu May 31 16:48:45 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 31 May 2012 12:48:45 -0400 (EDT) Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F8DC08@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <59507de0-4264-4e27-ac94-c9b34890a5f4@zmail01.collab.prod.int.phx2.redhat.com> ----- Original Message ----- > What is happening with this ? > Non one actually care to take ownership about this ? > If this is a bug why nobody is interested to get it fixed ? If not > someone speak up please. > Two things are not working as they supposed, I am reporting back and > nobody seems to give a dam about it. Hi Fernando, If nobody is replying, it's because they don't have experience with your particular setup, or they've never seen this problem before. If you feel it's a bug, then please file a bug at http://bugzilla.redhat.com/ You can also ask questions on the IRC channel: #gluster Or on http://community.gluster.org/ I know it can be frustrating, but please understand that you will get a response only if someone out there has experience with your problem. Thanks, John Mark Community guy From manu at netbsd.org Tue May 1 02:18:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 1 May 2012 04:18:53 +0200 Subject: [Gluster-devel] Fwd: Re: Rejected NetBSD patches In-Reply-To: <4F9EED0C.2080203@redhat.com> Message-ID: <1kjeekq.1nkt3n11wtalkgM%manu@netbsd.org> Kaleb S. KEITHLEY wrote: > I haven't seen anything so far that needs to discriminate between NetBSD > and FreeBSD, but if we come across one, we can use __NetBSD__ and > __FreeBSD__ inside GF_BSD_HOST_OS. If you look at the code, NetBSD makes is way using GF_BSD_HOST_OS or GF_LINUX_HOST_OS, depending of the situation. NetBSD and FreeBSD forked 19 years ago, they had time to diverge. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 1 03:21:28 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 1 May 2012 05:21:28 +0200 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjdvf9.1o294sj12c16nlM%manu@netbsd.org> Message-ID: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> Emmanuel Dreyfus wrote: > I got a crash client-side. It happens in pthread_spin_lock() and I > recall fixing that kind of issue for a uninitialized lock. I added printf, and inode is NULL in mdc_inode_pre() therefore this is not an uninitializd lock problem. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 1 05:31:57 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 1 May 2012 07:31:57 +0200 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> Message-ID: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Emmanuel Dreyfus wrote: > I added printf, and inode is NULL in mdc_inode_pre() therefore this is > not an uninitializd lock problem. Indeed, this this the mdc_local_t structure that seems uninitialized: (gdb) frame 3 #3 0xbaa0ecb5 in mdc_setattr_cbk (frame=0xbb7e32a0, cookie=0xbb7a4380, this=0xba3e3000, op_ret=0, op_errno=0, prebuf=0xb940a57c, postbuf=0xb940a5e4, xdata=0x0) at md-cache.c:1365 1365 mdc_inode_iatt_set (this, local->loc.inode, postbuf); (gdb) print *(mdc_local_t *)frame->local $6 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, fd = 0xb8f9d230, linkname = 0x0, xattr = 0x0} And indeed local->loc it is not initialized in mdc_fsetattr(). I suspect there is away of obteining it from fd, but this is getting beyond by knowledge of glusterfs internals. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vbellur at redhat.com Wed May 2 04:21:08 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Wed, 02 May 2012 09:51:08 +0530 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> References: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Message-ID: <4FA0B634.5090605@redhat.com> On 05/01/2012 11:01 AM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > >> I added printf, and inode is NULL in mdc_inode_pre() therefore this is >> not an uninitializd lock problem. > > Indeed, this this the mdc_local_t structure that seems uninitialized: > > (gdb) frame 3 > #3 0xbaa0ecb5 in mdc_setattr_cbk (frame=0xbb7e32a0, cookie=0xbb7a4380, > this=0xba3e3000, op_ret=0, op_errno=0, prebuf=0xb940a57c, > postbuf=0xb940a5e4, xdata=0x0) at md-cache.c:1365 > 1365 mdc_inode_iatt_set (this, local->loc.inode, postbuf); > > (gdb) print *(mdc_local_t *)frame->local > $6 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, > gfid = '\000', pargfid = '\000' > }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, > parent = 0x0, gfid = '\000', > pargfid = '\000'}, fd = 0xb8f9d230, > linkname = 0x0, xattr = 0x0} > > And indeed local->loc it is not initialized in mdc_fsetattr(). I suspect > there is away of obteining it from fd, but this is getting beyond by > knowledge of glusterfs internals. > > Do you have a test case that causes this crash? Vijay From anand.avati at gmail.com Wed May 2 05:29:22 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 1 May 2012 22:29:22 -0700 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> References: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Message-ID: Can you confirm if this fixes (obvious bug) - diff --git a/xlators/performance/md-cache/src/md-cache.c b/xlators/performance/md-cache/src/md-cache.c index 9ef599a..66c0bf3 100644 --- a/xlators/performance/md-cache/src/md-cache.c +++ b/xlators/performance/md-cache/src/md-cache.c @@ -1423,7 +1423,7 @@ mdc_fsetattr (call_frame_t *frame, xlator_t *this, fd_t *fd, local->fd = fd_ref (fd); - STACK_WIND (frame, mdc_setattr_cbk, + STACK_WIND (frame, mdc_fsetattr_cbk, FIRST_CHILD(this), FIRST_CHILD(this)->fops->fsetattr, fd, stbuf, valid, xdata); On Mon, Apr 30, 2012 at 10:31 PM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > > > I added printf, and inode is NULL in mdc_inode_pre() therefore this is > > not an uninitializd lock problem. > > Indeed, this this the mdc_local_t structure that seems uninitialized: > > (gdb) frame 3 > #3 0xbaa0ecb5 in mdc_setattr_cbk (frame=0xbb7e32a0, cookie=0xbb7a4380, > this=0xba3e3000, op_ret=0, op_errno=0, prebuf=0xb940a57c, > postbuf=0xb940a5e4, xdata=0x0) at md-cache.c:1365 > 1365 mdc_inode_iatt_set (this, local->loc.inode, postbuf); > > (gdb) print *(mdc_local_t *)frame->local > $6 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, > gfid = '\000' , pargfid = '\000' > }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, > parent = 0x0, gfid = '\000' , > pargfid = '\000' }, fd = 0xb8f9d230, > linkname = 0x0, xattr = 0x0} > > And indeed local->loc it is not initialized in mdc_fsetattr(). I suspect > there is away of obteining it from fd, but this is getting beyond by > knowledge of glusterfs internals. > > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kshlmster at gmail.com Wed May 2 05:35:02 2012 From: kshlmster at gmail.com (Kaushal M) Date: Wed, 2 May 2012 11:05:02 +0530 Subject: [Gluster-devel] 3.3 and address family In-Reply-To: References: <1kj84l9.19kzk6dfdsrtsM%manu@netbsd.org> Message-ID: Didn't send the last message to list. Resending. On Wed, May 2, 2012 at 10:58 AM, Kaushal M wrote: > Hi Emmanuel, > > Took a look at your patch for fixing this problem. It solves the it for > the brick glusterfsd processes. But glusterd also spawns and communicates > with nfs server & self-heal daemon processes. The proper xlator-option is > not set for these. This might be the cause. These processes are started in > glusterd_nodesvc_start() in glusterd-utils, which is where you could look > into. > > Thanks, > Kaushal > > On Fri, Apr 27, 2012 at 10:31 PM, Emmanuel Dreyfus wrote: > >> Hi >> >> I am still trying on 3.3.0qa39, and now I have an address family issue: >> gluserfs defaults to inet6 transport while the machine is not configured >> for IPv6. >> >> I added option transport.address-family inet in glusterfs/glusterd.vol >> and now glusterd starts with an IPv4 address, but unfortunately, >> communication with spawned glusterfsd do not stick to the same address >> family: I can see packets going from ::1.1023 to ::1.24007 and they are >> rejected since I used transport.address-family inet. >> >> I need to tell glusterfs to use the same address family. I already did a >> patch for exactly the same problem some time ago, this is not very >> difficult, but it would save me some time if someone could tell me where >> should I look at in the code. >> >> -- >> Emmanuel Dreyfus >> http://hcpnet.free.fr/pubz >> manu at netbsd.org >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at nongnu.org >> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Wed May 2 09:30:32 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 2 May 2012 09:30:32 +0000 Subject: [Gluster-devel] qa39 crash In-Reply-To: References: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Message-ID: <20120502093032.GI3677@homeworld.netbsd.org> On Tue, May 01, 2012 at 10:29:22PM -0700, Anand Avati wrote: > Can you confirm if this fixes (obvious bug) - I do not crash anymore, but I spotted another bug, I do not know if it is related: removing owner write access to a non empty file open with write access fails with EPERMo Here is my test case. It works fine with glusterfs-3.2.6 but fchmod() fails with EPERM on 3.3.0qa39 #include #include #include #include #include #include int main(void) { int fd; char buf[16]; if ((fd = open("test.tmp", O_RDWR|O_CREAT, 0644)) == -1) err(EX_OSERR, "fopen failed"); if (write(fd, buf, sizeof(buf)) != sizeof(buf)) err(EX_OSERR, "write failed"); if (fchmod(fd, 0444) == -1) err(EX_OSERR, "fchmod failed"); if (close(fd) == -1) err(EX_OSERR, "close failed"); return EX_OK; } -- Emmanuel Dreyfus manu at netbsd.org From xhernandez at datalab.es Wed May 2 10:55:37 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Wed, 02 May 2012 12:55:37 +0200 Subject: [Gluster-devel] Some questions about requisites of translators Message-ID: <4FA112A9.1080101@datalab.es> Hello, I'm wondering if there are any requisites that translators must satisfy to work correctly inside glusterfs. In particular I need to know two things: 1. Are translators required to respect the order in which they receive the requests ? This is specially important in translators such as performance/io-threads or caching ones. It seems that these translators can reorder requests. If this is the case, is there any way to force some order between requests ? can inodelk/entrylk be used to force the order ? 2. Are translators required to propagate callback arguments even if the result of the operation is an error ? and if an internal translator error occurs ? When a translator has multiple subvolumes, I've seen that some arguments, such as xdata, are replaced with NULL. This can be understood, but are regular translators (those that only have one subvolume) allowed to do that or must they preserve the value of xdata, even in the case of an internal error ? If this is not a requisite, xdata loses it's function of delivering back extra information. Thank you very much, Xavi From anand.avati at gmail.com Sat May 5 06:02:30 2012 From: anand.avati at gmail.com (Anand Avati) Date: Fri, 4 May 2012 23:02:30 -0700 Subject: [Gluster-devel] Some questions about requisites of translators In-Reply-To: <4FA112A9.1080101@datalab.es> References: <4FA112A9.1080101@datalab.es> Message-ID: On Wed, May 2, 2012 at 3:55 AM, Xavier Hernandez wrote: > Hello, > > I'm wondering if there are any requisites that translators must satisfy to > work correctly inside glusterfs. > > In particular I need to know two things: > > 1. Are translators required to respect the order in which they receive the > requests ? > > This is specially important in translators such as performance/io-threads > or caching ones. It seems that these translators can reorder requests. If > this is the case, is there any way to force some order between requests ? > can inodelk/entrylk be used to force the order ? > > Translators are not expected to maintain ordering of requests. The only translator which takes care of ordering calls is write-behind. After acknowledging back write requests it has to make sure future requests see the true "effect" as though the previous write actually completed. To that end, it queues future "dependent" requests till the write acknowledgement is received from the server. inodelk/entrylk calls help achieve synchronization among clients (by getting into a critical section) - just like a mutex. It is an arbitrator. It does not help for ordering of two calls. If one call must strictly complete after another call from your translator's point of view (i.e, if it has such a requirement), then the latter call's STACK_WIND must happen in the callback of the former's STACK_UNWIND path. There are no guarantees maintained by the system to ensure that a second STACK_WIND issued right after a first STACK_WIND will complete and callback in the same order. Write-behind does all its ordering gimmicks only because it STACK_UNWINDs a write call prematurely and therefore must maintain the causal effects by means of queueing new requests behind the downcall towards the server. > 2. Are translators required to propagate callback arguments even if the > result of the operation is an error ? and if an internal translator error > occurs ? > > Usually no. If op_ret is -1, only op_errno is expected to be a usable value. Rest of the callback parameters are junk. > When a translator has multiple subvolumes, I've seen that some arguments, > such as xdata, are replaced with NULL. This can be understood, but are > regular translators (those that only have one subvolume) allowed to do that > or must they preserve the value of xdata, even in the case of an internal > error ? > > It is best to preserve the arguments unless you know specifically what you are doing. In case of error, all the non-op_{ret,errno} arguments are typically junk, including xdata. > If this is not a requisite, xdata loses it's function of delivering back > extra information. > > Can you explain? Are you seeing a use case for having a valid xdata in the callback even with op_ret == -1? Thanks, Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From tsato at valinux.co.jp Mon May 7 04:17:45 2012 From: tsato at valinux.co.jp (Tomoaki Sato) Date: Mon, 07 May 2012 13:17:45 +0900 Subject: [Gluster-devel] showmount reports many entries (Re: glusterfs-3.3.0qa39 released) In-Reply-To: <4F9A98E8.80400@gluster.com> References: <20120427053612.E08671804F5@build.gluster.com> <4F9A6422.3010000@valinux.co.jp> <4F9A98E8.80400@gluster.com> Message-ID: <4FA74CE9.8010805@valinux.co.jp> (2012/04/27 22:02), Vijay Bellur wrote: > On 04/27/2012 02:47 PM, Tomoaki Sato wrote: >> Vijay, >> >> I have been testing gluster-3.3.0qa39 NFS with 4 CentOS 6.2 NFS clients. >> The test set is like following: >> 1) All 4 clients mount 64 directories. (total 192 directories) >> 2) 192 procs runs on the 4 clients. each proc create a new unique file and write 1GB data to the file. (total 192GB) >> 3) All 4 clients umount 64 directories. >> >> The test finished successfully but showmount command reported many entries in spite of there were no NFS clients remain. >> Then I have restarted gluster related daemons. >> After restarting, showmount command reports no entries. >> Any insight into this is much appreciated. > > > http://review.gluster.com/2973 should fix this. Can you please confirm? > > > Thanks, > Vijay Vijay, I have confirmed that following instructions with c3a16c32. # showmount one Hosts on one: # mkdir /tmp/mnt # mount one:/one /tmp/mnt # showmount one Hosts on one: 172.17.200.108 # umount /tmp/mnt # showmount one Hosts on one: # And the test set has started running. It will take a couple of days to finish. by the way, I did following instructions to build RPM packages on a CentOS 5.6 x86_64 host. # yum install python-ctypes ncureses-devel readline-devel libibverbs-devel # git clone -b c3a16c32 ssh://@git.gluster.com/glusterfs.git glusterfs-3git # tar zcf /usr/src/redhat/SOURCES/glusterfs-3bit.tar.gz glusterfs-3git # rpmbuild -bb /usr/src/redhat/SOURCES/glusterfs-3git.tar.gz Thanks, Tomo Sato From manu at netbsd.org Mon May 7 04:39:22 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 7 May 2012 04:39:22 +0000 Subject: [Gluster-devel] Fixing Address family mess Message-ID: <20120507043922.GA10874@homeworld.netbsd.org> Hi Quick summary of the problem: when using transport-type socket with transport.address-family unspecified, glusterfs binds sockets with AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the kernel prefers. At mine it uses AF_INET6, while the machine is not configured to use IPv6. As a result, glusterfs client cannot connect to glusterfs server. A workaround is to use option transport.address-family inet in glusterfsd/glusterd.vol but that option must also be specified in all volume files for all bricks and FUSE client, which is unfortunate because they are automatically generated. I proposed a patch so that glusterd transport.address-family setting is propagated to various places: http://review.gluster.com/3261 That did not meet consensus. Jeff Darcy notes that we should be able to listen both on AF_INET and AF_INET6 sockets at the same time. I had a look at the code, and indeed it could easily be done. The only trouble is how to specify the listeners. For now option transport defaults to socket,rdma. I suggest we add socket families in that specification. We would then have this default: option transport socket/inet,socket/inet6,rdma With the following semantics: socket -> AF_UNSPEC socket (backward comaptibility) socket/inet -> AF_INET socket socket/inet6 -> AF_INET6 socket socket/sdp -> AF_SDP socket rdma -> sameas before Any opinion on that plan? Please comment before I writa code, it will save me some time is the proposal is wrong. -- Emmanuel Dreyfus manu at netbsd.org From xhernandez at datalab.es Mon May 7 08:07:52 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 07 May 2012 10:07:52 +0200 Subject: [Gluster-devel] Some questions about requisites of translators In-Reply-To: References: <4FA112A9.1080101@datalab.es> Message-ID: <4FA782D8.2000100@datalab.es> On 05/05/2012 08:02 AM, Anand Avati wrote: > > > On Wed, May 2, 2012 at 3:55 AM, Xavier Hernandez > > wrote: > > Hello, > > I'm wondering if there are any requisites that translators must > satisfy to work correctly inside glusterfs. > > In particular I need to know two things: > > 1. Are translators required to respect the order in which they > receive the requests ? > > This is specially important in translators such as > performance/io-threads or caching ones. It seems that these > translators can reorder requests. If this is the case, is there > any way to force some order between requests ? can inodelk/entrylk > be used to force the order ? > > > Translators are not expected to maintain ordering of requests. The > only translator which takes care of ordering calls is write-behind. > After acknowledging back write requests it has to make sure future > requests see the true "effect" as though the previous write actually > completed. To that end, it queues future "dependent" requests till the > write acknowledgement is received from the server. > > inodelk/entrylk calls help achieve synchronization among clients (by > getting into a critical section) - just like a mutex. It is an > arbitrator. It does not help for ordering of two calls. If one call > must strictly complete after another call from your translator's point > of view (i.e, if it has such a requirement), then the latter call's > STACK_WIND must happen in the callback of the former's STACK_UNWIND > path. There are no guarantees maintained by the system to ensure that > a second STACK_WIND issued right after a first STACK_WIND will > complete and callback in the same order. Write-behind does all its > ordering gimmicks only because it STACK_UNWINDs a write call > prematurely and therefore must maintain the causal effects by means of > queueing new requests behind the downcall towards the server. Good to know > 2. Are translators required to propagate callback arguments even > if the result of the operation is an error ? and if an internal > translator error occurs ? > > > Usually no. If op_ret is -1, only op_errno is expected to be a usable > value. Rest of the callback parameters are junk. > > When a translator has multiple subvolumes, I've seen that some > arguments, such as xdata, are replaced with NULL. This can be > understood, but are regular translators (those that only have one > subvolume) allowed to do that or must they preserve the value of > xdata, even in the case of an internal error ? > > > It is best to preserve the arguments unless you know specifically what > you are doing. In case of error, all the non-op_{ret,errno} arguments > are typically junk, including xdata. > > If this is not a requisite, xdata loses it's function of > delivering back extra information. > > > Can you explain? Are you seeing a use case for having a valid xdata in > the callback even with op_ret == -1? > As a part of a translator that I'm developing that works with multiple subvolumes, I need to implement some healing support to mantain data coherency (similar to AFR). After some thought, I decided that it could be advantageous to use a dedicated healing translator located near the bottom of the translators stack on the servers. This translator won't work by itself, it only adds support to be used by a higher level translator, which have to manage the logic of the healing and decide when a node needs to be healed. To do this, sometimes I need to return an error because an operation cannot be completed due to some condition related with healing itself (not with the underlying storage). However I need to send some specific healing information to let the upper translator know how it has to handle the detected condition. I cannot send a success answer because intermediate translators could take the fake data as valid and they could begin to operate incorrectly or even create inconsistencies. The other alternative is to use op_errno to encode the extra data, but this will also be difficult, even impossible in some cases, due to the amount of data and the complexity to combine it with an error code without mislead intermediate translators with strange or invalid error codes. I talked with John Mark about this translator and he suggested me to discuss it over the list. Therefore I'll initiate another thread to expose in more detail how it works and I would appreciate very much your opinion, and that of the other developers, about it. Especially if it can really be faster/safer that other solutions or not, or if you find any problem or have any suggestion to improve it. I think it could also be used by AFR and any future translator that may need some healing capabilities. Thank you very much, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From vijay at build.gluster.com Mon May 7 08:15:50 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Mon, 7 May 2012 01:15:50 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa40 released Message-ID: <20120507081553.5AA00100C5@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa40/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa40.tar.gz This release is made off v3.3.0qa40 From vijay at gluster.com Mon May 7 10:31:09 2012 From: vijay at gluster.com (Vijay Bellur) Date: Mon, 07 May 2012 16:01:09 +0530 Subject: [Gluster-devel] glusterfs-3.3.0qa40 released In-Reply-To: <20120507081553.5AA00100C5@build.gluster.com> References: <20120507081553.5AA00100C5@build.gluster.com> Message-ID: <4FA7A46D.2050506@gluster.com> This release is done by reverting commit 7d0397c2144810c8a396e00187a6617873c94002 as replace-brick and quota were not functioning with that commit. Hence the tag for this qa release would not be available in github. If you are interested in creating an equivalent of this qa release from git, it would be c4dadc74fd1d1188f123eae7f2b6d6f5232e2a0f - commit 7d0397c2144810c8a396e00187a6617873c94002. Thanks, Vijay On 05/07/2012 01:45 PM, Vijay Bellur wrote: > http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa40/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa40.tar.gz > > This release is made off v3.3.0qa40 > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From jdarcy at redhat.com Mon May 7 13:16:38 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 07 May 2012 09:16:38 -0400 Subject: [Gluster-devel] Fixing Address family mess In-Reply-To: <20120507043922.GA10874@homeworld.netbsd.org> References: <20120507043922.GA10874@homeworld.netbsd.org> Message-ID: <4FA7CB36.6040701@redhat.com> On 05/07/2012 12:39 AM, Emmanuel Dreyfus wrote: > Quick summary of the problem: when using transport-type socket with > transport.address-family unspecified, glusterfs binds sockets with > AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the > kernel prefers. At mine it uses AF_INET6, while the machine is not > configured to use IPv6. As a result, glusterfs client cannot connect > to glusterfs server. > > A workaround is to use option transport.address-family inet in > glusterfsd/glusterd.vol but that option must also be specified in > all volume files for all bricks and FUSE client, which is > unfortunate because they are automatically generated. I proposed a > patch so that glusterd transport.address-family setting is propagated > to various places: http://review.gluster.com/3261 > > That did not meet consensus. Jeff Darcy notes that we should be able > to listen both on AF_INET and AF_INET6 sockets at the same time. I > had a look at the code, and indeed it could easily be done. The only > trouble is how to specify the listeners. For now option transport > defaults to socket,rdma. I suggest we add socket families in that > specification. We would then have this default: > option transport socket/inet,socket/inet6,rdma > > With the following semantics: > socket -> AF_UNSPEC socket (backward comaptibility) > socket/inet -> AF_INET socket > socket/inet6 -> AF_INET6 socket > socket/sdp -> AF_SDP socket > rdma -> sameas before > > Any opinion on that plan? Please comment before I writa code, it will > save me some time is the proposal is wrong. I think it looks like the right solution. I understand that keeping the address-family multiplexing entirely in the socket code would be more complex, since it changes the relationship between transport instances and file descriptors (and threads in the SSL/multi-thread case). That's unfortunate, but far from the most unfortunate thing about our transport code. I do wonder whether we should use '/' as the separator, since it kind of implies the same kind of relationships between names and paths that we use for translator names - e.g. cluster/dht is actually used as part of the actual path for dht.so - and in this case that relationship doesn't actually exist. Another idea, which I don't actually like any better but which I'll suggest for completeness, would be to express the list of address families via an option: option transport.socket.address-family inet6 Now that I think about it, another benefit is that it supports multiple instances of the same address family with different options, e.g. to support segregated networks. Obviously we lack higher-level support for that right now, but if that should ever change then it would be nice to have the right low-level infrastructure in place for it. From jdarcy at redhat.com Mon May 7 14:43:47 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 07 May 2012 10:43:47 -0400 Subject: [Gluster-devel] ZkFarmer Message-ID: <4FA7DFA3.1030300@redhat.com> I've long felt that our ways of dealing with cluster membership and staging of config changes is not quite as robust and scalable as we might want. Accordingly, I spent a bit of time a couple of weeks ago looking into the possibility of using ZooKeeper to do some of this stuff. Yeah, it brings in a heavy Java dependency, but when I looked at some lighter-weight alternatives they all seemed to be lacking in more important ways. Basically the idea was to do this: * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or point everyone at an existing ZooKeeper cluster. * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" merely updates ZK, and "peer status" merely reads from it). * Store config information in ZK *once* instead of regenerating volfiles etc. on every node (and dealing with the ugly cases where a node was down when the config change happened). * Set watches on ZK nodes to be notified when config changes happen, and respond appropriately. I eventually ran out of time and moved on to other things, but this or something like it (e.g. using Riak Core) still seems like a better approach than what we have. In that context, it looks like ZkFarmer[1] might be a big help. AFAICT someone else was trying to solve almost exactly the same kind of server/config problem that we have, and wrapped their solution into a library. Is this a direction other devs might be interested in pursuing some day, if/when time allows? [1] https://github.com/rs/zkfarmer From johnmark at redhat.com Mon May 7 19:35:54 2012 From: johnmark at redhat.com (John Mark Walker) Date: Mon, 07 May 2012 15:35:54 -0400 (EDT) Subject: [Gluster-devel] Server outage - review.gluster.com - please stand by In-Reply-To: <5299ff98-4714-4702-8f26-0d6f62441fe3@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: Greetings, Our iWeb server, which hosts review.gluster.com, is currently down. I have filed an urgent request to reboot the server in question. If you notice anything else working poorly, aside from review.gluster.com, please let me know ASAP. I'll send a note when services are back to normal. -JM From ian.latter at midnightcode.org Mon May 7 22:17:41 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 08 May 2012 08:17:41 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205072217.q47MHfmr003867@singularity.tronunltd.com> Is there anything written up on why you/all want every node to be completely conscious of every other node? I could see a couple of architectures that might work better (be more scalable) if the config minutiae were either not necessary to be shared or shared in only cases where the config minutiae were a dependency. RE ZK, I have an issue with it not being a binary at the linux distribution level. This is the reason I don't currently have Gluster's geo replication module in place .. ----- Original Message ----- >From: "Jeff Darcy" >To: >Subject: [Gluster-devel] ZkFarmer >Date: Mon, 07 May 2012 10:43:47 -0400 > > I've long felt that our ways of dealing with cluster membership and staging of > config changes is not quite as robust and scalable as we might want. > Accordingly, I spent a bit of time a couple of weeks ago looking into the > possibility of using ZooKeeper to do some of this stuff. Yeah, it brings in a > heavy Java dependency, but when I looked at some lighter-weight alternatives > they all seemed to be lacking in more important ways. Basically the idea was > to do this: > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or > point everyone at an existing ZooKeeper cluster. > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" > merely updates ZK, and "peer status" merely reads from it). > > * Store config information in ZK *once* instead of regenerating volfiles etc. > on every node (and dealing with the ugly cases where a node was down when the > config change happened). > > * Set watches on ZK nodes to be notified when config changes happen, and > respond appropriately. > > I eventually ran out of time and moved on to other things, but this or > something like it (e.g. using Riak Core) still seems like a better approach > than what we have. In that context, it looks like ZkFarmer[1] might be a big > help. AFAICT someone else was trying to solve almost exactly the same kind of > server/config problem that we have, and wrapped their solution into a library. > Is this a direction other devs might be interested in pursuing some day, > if/when time allows? > > > [1] https://github.com/rs/zkfarmer > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From ab at gluster.com Mon May 7 22:55:22 2012 From: ab at gluster.com (Anand Babu Periasamy) Date: Mon, 7 May 2012 15:55:22 -0700 Subject: [Gluster-devel] Fixing Address family mess In-Reply-To: <4FA7CB36.6040701@redhat.com> References: <20120507043922.GA10874@homeworld.netbsd.org> <4FA7CB36.6040701@redhat.com> Message-ID: On Mon, May 7, 2012 at 6:16 AM, Jeff Darcy wrote: > On 05/07/2012 12:39 AM, Emmanuel Dreyfus wrote: >> Quick summary of the problem: when using transport-type socket with >> transport.address-family unspecified, glusterfs binds sockets with >> AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the >> kernel prefers. At mine it uses AF_INET6, while the machine is not >> configured to use IPv6. As a result, glusterfs client cannot connect >> to glusterfs server. >> >> A workaround is to use option transport.address-family inet in >> glusterfsd/glusterd.vol but that option must also be specified in >> all volume files for all bricks and FUSE client, which is >> unfortunate because they are automatically generated. I proposed a >> patch so that glusterd transport.address-family setting is propagated >> to various places: http://review.gluster.com/3261 >> >> That did not meet consensus. Jeff Darcy notes that we should be able >> to listen both on AF_INET and AF_INET6 sockets at the same time. I >> had a look at the code, and indeed it could easily be done. The only >> trouble is how to specify the listeners. For now option transport >> defaults to socket,rdma. I suggest we add socket families in that >> specification. We would then have this default: >> ? ?option transport socket/inet,socket/inet6,rdma >> >> With the following semantics: >> ? ?socket -> AF_UNSPEC socket (backward comaptibility) >> ? ?socket/inet -> AF_INET socket >> ? ?socket/inet6 -> AF_INET6 socket >> ? ?socket/sdp -> AF_SDP socket >> ? ?rdma -> sameas before >> >> Any opinion on that plan? Please comment before I writa code, it will >> save me some time is the proposal is wrong. > > I think it looks like the right solution. I understand that keeping the > address-family multiplexing entirely in the socket code would be more complex, > since it changes the relationship between transport instances and file > descriptors (and threads in the SSL/multi-thread case). ?That's unfortunate, > but far from the most unfortunate thing about our transport code. > > I do wonder whether we should use '/' as the separator, since it kind of > implies the same kind of relationships between names and paths that we use for > translator names - e.g. cluster/dht is actually used as part of the actual path > for dht.so - and in this case that relationship doesn't actually exist. Another > idea, which I don't actually like any better but which I'll suggest for > completeness, would be to express the list of address families via an option: > > ? ? ? ?option transport.socket.address-family inet6 > > Now that I think about it, another benefit is that it supports multiple > instances of the same address family with different options, e.g. to support > segregated networks. ?Obviously we lack higher-level support for that right > now, but if that should ever change then it would be nice to have the right > low-level infrastructure in place for it. > Yes this should be controlled through volume options. "transport.address-family" is the right place to set it. Possible values are "inet, inet6, unix, inet-sdp". I would have named those user facing options as "ipv4, ipv6, sdp, all". If transport.address-family is not set. then if remote-host is set default to AF_INET (ipv4) if if transport.socket.connect-path is set default to AF_UNIX (unix) AF_UNSPEC is should be be taken as IPv4/IPv6. It is named appropriately. Default should be ipv4. I have not tested the patch. It is simply to explain how the changes should look like. I ignored legacy translators. When we implement concurrent support for multiple address-family (likely via mult-process model) we can worry about combinations. I agree. Combinations should look like "inet | inet6 | .." and not "inet / inet6 /.." -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: glusterfs-af-default-ipv4.diff Type: application/octet-stream Size: 9194 bytes Desc: not available URL: From jdarcy at redhat.com Tue May 8 00:43:31 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 07 May 2012 20:43:31 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205072217.q47MHfmr003867@singularity.tronunltd.com> References: <201205072217.q47MHfmr003867@singularity.tronunltd.com> Message-ID: <4FA86C33.6020901@redhat.com> On 05/07/2012 06:17 PM, Ian Latter wrote: > Is there anything written up on why you/all want every > node to be completely conscious of every other node? > > I could see a couple of architectures that might work > better (be more scalable) if the config minutiae were > either not necessary to be shared or shared in only > cases where the config minutiae were a dependency. Well, these aren't exactly minutiae. Everything at file or directory level is fully distributed and will remain so. We're talking only about stuff at the volume or server level, which is very little data but very broad in scope. Trying to segregate that only adds complexity and subtracts convenience, compared to having it equally accessible to (or through) any server. > RE ZK, I have an issue with it not being a binary at > the linux distribution level. This is the reason I don't > currently have Gluster's geo replication module in > place .. What exactly is your objection to interpreted or JIT compiled languages? Performance? Security? It's an unusual position, to say the least. From glusterdevel at louiszuckerman.com Tue May 8 03:52:02 2012 From: glusterdevel at louiszuckerman.com (Louis Zuckerman) Date: Mon, 7 May 2012 23:52:02 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <4FA7DFA3.1030300@redhat.com> References: <4FA7DFA3.1030300@redhat.com> Message-ID: Here's another ZooKeeper management framework that may be useful. It's called Curator, developed by Netflix, and recently released as open source. It probably has a bit more inertia than ZkFarmer too. http://techblog.netflix.com/2011/11/introducing-curator-netflix-zookeeper.html https://github.com/Netflix/curator HTH -louis On Mon, May 7, 2012 at 10:43 AM, Jeff Darcy wrote: > I've long felt that our ways of dealing with cluster membership and > staging of > config changes is not quite as robust and scalable as we might want. > Accordingly, I spent a bit of time a couple of weeks ago looking into the > possibility of using ZooKeeper to do some of this stuff. Yeah, it brings > in a > heavy Java dependency, but when I looked at some lighter-weight > alternatives > they all seemed to be lacking in more important ways. Basically the idea > was > to do this: > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, > or > point everyone at an existing ZooKeeper cluster. > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" > merely updates ZK, and "peer status" merely reads from it). > > * Store config information in ZK *once* instead of regenerating volfiles > etc. > on every node (and dealing with the ugly cases where a node was down when > the > config change happened). > > * Set watches on ZK nodes to be notified when config changes happen, and > respond appropriately. > > I eventually ran out of time and moved on to other things, but this or > something like it (e.g. using Riak Core) still seems like a better approach > than what we have. In that context, it looks like ZkFarmer[1] might be a > big > help. AFAICT someone else was trying to solve almost exactly the same > kind of > server/config problem that we have, and wrapped their solution into a > library. > Is this a direction other devs might be interested in pursuing some day, > if/when time allows? > > > [1] https://github.com/rs/zkfarmer > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Tue May 8 04:27:24 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 08 May 2012 14:27:24 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205080427.q484RO09004857@singularity.tronunltd.com> > > Is there anything written up on why you/all want every > > node to be completely conscious of every other node? > > > > I could see a couple of architectures that might work > > better (be more scalable) if the config minutiae were > > either not necessary to be shared or shared in only > > cases where the config minutiae were a dependency. > > Well, these aren't exactly minutiae. Everything at file or directory level is > fully distributed and will remain so. We're talking only about stuff at the > volume or server level, which is very little data but very broad in scope. > Trying to segregate that only adds complexity and subtracts convenience, > compared to having it equally accessible to (or through) any server. Sorry, I didn't have time this morning to add more detail. Note that my concern isn't bandwidth, its flexibility; the less knowledge needed the more I can do crazy things in user land, like running boxes in different data centres and randomly power things up and down, randomly re- address, randomly replace in-box hardware, load balance, NAT, etc. It makes a dynamic environment difficult to construct, for example, when Gluster rejects the same volume-id being presented to an existing cluster from a new GFID. But there's no need to go even that complicated, let me pull out an example of where shared knowledge may be unnecessary; The work that I was doing in Gluster (pre glusterd) drove out one primary "server" which fronted a Replicate volume of both its own Distribute volume and that of another server or two - themselves serving a single Distribute volume. So the client connected to one server for one volume and the rest was black box / magic (from the client's perspective - big fast storage in many locations); in that case it could be said that servers needed some shared knowledge, while the clients didn't. The equivalent configuration in a glusterd world (from my experiments) pushed all of the distribute knowledge out to the client and I haven't had a response as to how to add a replicate on distributed volumes in this model, so I've lost replicate. But in this world, the client must know about everything and the server is simply a set of served/presented disks (as volumes). In this glusterd world, then, why does any server need to know of any other server, if the clients are doing all of the heavy lifting? The additional consideration is where the server both consumes and presents, but this would be captured in the client side view. i.e. given where glusterd seems to be driving, this knowledge seems to be needed on the client side (within glusterfs, not glusterfsd). To my mind this breaks the gluster architecture that I read about 2009, but I need to stress that I didn't get a reply to the glusterd architecture question that I posted about a month ago; so I don't know if glusterd is currently limiting deployment options because; - there is an intention to drive the heavy lifting to the client (for example for performance reasons in big deployments), or; - there are known limitations in the existing bricks/ modules (for example moving files thru distribute), or; - there is ultimately (long term) more flexibility seen in this model (and we're at a midway point between pre glusterd and post so it doesn't feel that way yet), or; - there is an intent to drive out a particular market outcome or match an existing storage model (the gluster presentation was driving towards cloud, and maybe those vendors don't use server side implementations), etc. As I don't have a clear/big picture in my mind; if I'm not considering all of the impacts, then my apologies. > > RE ZK, I have an issue with it not being a binary at > > the linux distribution level. This is the reason I don't > > currently have Gluster's geo replication module in > > place .. > > What exactly is your objection to interpreted or JIT compiled languages? > Performance? Security? It's an unusual position, to say the least. > Specifically, primarily, space. Saturn builds GlusterFS capacity from a 48 Megabyte Linux distribution and adding many Megabytes of Perl and/or Python and/or PHP and/or Java for a single script is impractical. My secondary concern is licensing (specifically in the Java run-time environment case). Hadoop forced my hand; GNU's JRE/compiler wasn't up to the task of running Hadoop when I last looked at it (about 2 or 3 years ago now) - well, it could run a 2007 or so version but not current ones at that time - so now I work with Gluster .. Going back to ZkFarmer; Considering other architectures; it depends on how you slice and dice the problem as to how much external support you need; > I've long felt that our ways of dealing with cluster > membership and staging of config changes is not > quite as robust and scalable as we might want. By way of example; The openMosix kernel extensions maintained their own information exchange between cluster nodes; if a node (ip) was added via the /proc interface, it was "in" the cluster. Therefore cluster membership was the hand-off/interface. It could be as simple as a text list on each node, or it could be left to a user space daemon which could then gate cluster membership - this suited everyone with a small cluster. The native daemon (omdiscd) used multicast packets to find nodes and then stuff those IP's into the /proc interface - this suited everyone with a private/dedicated cluster. A colleague and I wrote a TCP variation to allow multi-site discovery with SSH public key exchanges and IPSEC tunnel establishment as part of the gating process - this suited those with a distributed/ part-time cluster. To ZooKeeper's point (http://zookeeper.apache.org/), the discovery protocol that we created was weak and I've since found a model/algorithm that allows for far more robust discovery. The point being that, depending on the final cluster architecture for gluster (i.e. all are nodes are peers and thus all are cluster members, nodes are client or server and both are cluster members, nodes are client or server and only clients [or servers] are cluster members, etc) there may be simpler cluster management options .. Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From ab at gluster.com Tue May 8 04:33:50 2012 From: ab at gluster.com (Anand Babu Periasamy) Date: Mon, 7 May 2012 21:33:50 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <4FA7DFA3.1030300@redhat.com> References: <4FA7DFA3.1030300@redhat.com> Message-ID: On Mon, May 7, 2012 at 7:43 AM, Jeff Darcy wrote: > I've long felt that our ways of dealing with cluster membership and staging of > config changes is not quite as robust and scalable as we might want. > Accordingly, I spent a bit of time a couple of weeks ago looking into the > possibility of using ZooKeeper to do some of this stuff. ?Yeah, it brings in a > heavy Java dependency, but when I looked at some lighter-weight alternatives > they all seemed to be lacking in more important ways. ?Basically the idea was > to do this: > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or > point everyone at an existing ZooKeeper cluster. > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" > merely updates ZK, and "peer status" merely reads from it). > > * Store config information in ZK *once* instead of regenerating volfiles etc. > on every node (and dealing with the ugly cases where a node was down when the > config change happened). > > * Set watches on ZK nodes to be notified when config changes happen, and > respond appropriately. > > I eventually ran out of time and moved on to other things, but this or > something like it (e.g. using Riak Core) still seems like a better approach > than what we have. ?In that context, it looks like ZkFarmer[1] might be a big > help. ?AFAICT someone else was trying to solve almost exactly the same kind of > server/config problem that we have, and wrapped their solution into a library. > ?Is this a direction other devs might be interested in pursuing some day, > if/when time allows? > > > [1] https://github.com/rs/zkfarmer Real issue is here is: GlusterFS is a fully distributed system. It is OK for config files to be in one place (centralized). It is easier to manage and backup. Avati still claims that making distributed copies are not a problem (volume operations are fast, versioned and checksumed). Also the code base for replicating 3 way or all-node is same. We all need to come to agreement on the demerits of replicating the volume spec on every node. If we are convinced to keep the config info in one place, ZK is certainly one a good idea. I personally hate Java dependency. I still struggle with Java dependencies for browser and clojure. I can digest that if we are going to adopt Java over Python for future external modules. Alternatively we can also look at creating a replicated meta system volume. What ever we adopt, we should keep dependencies and installation steps to the bare minimum and simple. -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein From ab at gluster.com Tue May 8 04:56:10 2012 From: ab at gluster.com (Anand Babu Periasamy) Date: Mon, 7 May 2012 21:56:10 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205080427.q484RO09004857@singularity.tronunltd.com> References: <201205080427.q484RO09004857@singularity.tronunltd.com> Message-ID: On Mon, May 7, 2012 at 9:27 PM, Ian Latter wrote: > >> > Is there anything written up on why you/all want every >> > node to be completely conscious of every other node? >> > >> > I could see a couple of architectures that might work >> > better (be more scalable) if the config minutiae were >> > either not necessary to be shared or shared in only >> > cases where the config minutiae were a dependency. >> >> Well, these aren't exactly minutiae. ?Everything at file > or directory level is >> fully distributed and will remain so. ?We're talking only > about stuff at the >> volume or server level, which is very little data but very > broad in scope. >> Trying to segregate that only adds complexity and > subtracts convenience, >> compared to having it equally accessible to (or through) > any server. > > Sorry, I didn't have time this morning to add more detail. > > Note that my concern isn't bandwidth, its flexibility; the > less knowledge needed the more I can do crazy things > in user land, like running boxes in different data centres > and randomly power things up and down, randomly re- > address, randomly replace in-box hardware, load > balance, NAT, etc. ?It makes a dynamic environment > difficult to construct, for example, when Gluster rejects > the same volume-id being presented to an existing > cluster from a new GFID. > > But there's no need to go even that complicated, let > me pull out an example of where shared knowledge > may be unnecessary; > > The work that I was doing in Gluster (pre glusterd) drove > out one primary "server" which fronted a Replicate > volume of both its own Distribute volume and that of > another server or two - themselves serving a single > Distribute volume. ?So the client connected to one > server for one volume and the rest was black box / > magic (from the client's perspective - big fast storage > in many locations); in that case it could be said that > servers needed some shared knowledge, while the > clients didn't. > > The equivalent configuration in a glusterd world (from > my experiments) pushed all of the distribute knowledge > out to the client and I haven't had a response as to how > to add a replicate on distributed volumes in this model, > so I've lost replicate. ?But in this world, the client must > know about everything and the server is simply a set > of served/presented disks (as volumes). ?In this > glusterd world, then, why does any server need to > know of any other server, if the clients are doing all of > the heavy lifting? > > The additional consideration is where the server both > consumes and presents, but this would be captured in > the client side view. ?i.e. given where glusterd seems > to be driving, this knowledge seems to be needed on > the client side (within glusterfs, not glusterfsd). > > To my mind this breaks the gluster architecture that I > read about 2009, but I need to stress that I didn't get > a reply to the glusterd architecture question that I > posted about a month ago; ?so I don't know if glusterd > is currently limiting deployment options because; > ?- there is an intention to drive the heavy lifting to the > ? ?client (for example for performance reasons in big > ? ?deployments), or; > ?- there are known limitations in the existing bricks/ > ? ?modules (for example moving files thru distribute), > ? ?or; > ?- there is ultimately (long term) more flexibility seen > ? ?in this model (and we're at a midway point between > ? ?pre glusterd and post so it doesn't feel that way > ? ?yet), or; > ?- there is an intent to drive out a particular market > ? ?outcome or match an existing storage model (the > ? ?gluster presentation was driving towards cloud, > ? ?and maybe those vendors don't use server side > ? ?implementations), etc. > > As I don't have a clear/big picture in my mind; if I'm > not considering all of the impacts, then my apologies. > > >> > RE ZK, I have an issue with it not being a binary at >> > the linux distribution level. ?This is the reason I don't >> > currently have Gluster's geo replication module in >> > place .. >> >> What exactly is your objection to interpreted or JIT > compiled languages? >> Performance? ?Security? ?It's an unusual position, to say > the least. >> > > Specifically, primarily, space. ?Saturn builds GlusterFS > capacity from a 48 Megabyte Linux distribution and > adding many Megabytes of Perl and/or Python and/or > PHP and/or Java for a single script is impractical. > > My secondary concern is licensing (specifically in the > Java run-time environment case). ?Hadoop forced my > hand; GNU's JRE/compiler wasn't up to the task of > running Hadoop when I last looked at it (about 2 or 3 > years ago now) - well, it could run a 2007 or so > version but not current ones at that time - so now I > work with Gluster .. > > > > Going back to ZkFarmer; > > Considering other architectures; it depends on how > you slice and dice the problem as to how much > external support you need; > ?> I've long felt that our ways of dealing with cluster > ?> membership and staging of config changes is not > ?> quite as robust and scalable as we might want. > > By way of example; > ?The openMosix kernel extensions maintained their > own information exchange between cluster nodes; if > a node (ip) was added via the /proc interface, it was > "in" the cluster. ?Therefore cluster membership was > the hand-off/interface. > ?It could be as simple as a text list on each node, or > it could be left to a user space daemon which could > then gate cluster membership - this suited everyone > with a small cluster. > ?The native daemon (omdiscd) used multicast > packets to find nodes and then stuff those IP's into > the /proc interface - this suited everyone with a > private/dedicated cluster. > ?A colleague and I wrote a TCP variation to allow > multi-site discovery with SSH public key exchanges > and IPSEC tunnel establishment as part of the > gating process - this suited those with a distributed/ > part-time cluster. ?To ZooKeeper's point > (http://zookeeper.apache.org/), the discovery > protocol that we created was weak and I've since > found a model/algorithm that allows for far more > robust discovery. > > ?The point being that, depending on the final cluster > architecture for gluster (i.e. all are nodes are peers > and thus all are cluster members, nodes are client > or server and both are cluster members, nodes are > client or server and only clients [or servers] are > cluster members, etc) there may be simpler cluster > management options .. > > > Cheers, > Reason to keep the volume spec files on all servers is simply to be fully distributed. No one node or set of nodes should hold the cluster hostage. Code to keep them in sync over 2 nodes or 20 nodes is essentially the same. We are revisiting this situation now because we want to scale to 1000s of nodes potentially. Gluster CLI operations should not time out or slow down. If ZK requires proprietary JRE for stability, Java will be NO NO!. We may not need ZK at all. If we simply decide to centralize the config, GlusterFS has enough code to handle them. Again Avati will argue that it is exactly the same code as now. My point is to keep things simple as we scale. Even if the code base is same, we should still restrict it to N selected nodes. It is matter of adding config option. -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein From ian.latter at midnightcode.org Tue May 8 05:21:37 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 08 May 2012 15:21:37 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205080521.q485Lb9d005117@singularity.tronunltd.com> > No one node or set of nodes should hold the > cluster hostage. Agreed - this is fundamental. > We are revisiting this situation now because we > want to scale to 1000s of nodes potentially. Good, I hate upper bounds on architectures :) Though I haven't tested my own implementation, I understand that one implementation of the discovery protocol that I've used, scaled to 20,000 hosts across three sites in two countries; this is the the type of robust outcome that can be manipulated at the macro scale - i.e. without manipulating per-node details. > Gluster CLI operations should not time out or > slow down. This is critical - not just the CLI but also the storage interface (in a redundant environment); infrastructure wears and fails, thus failing infrastructure should be regarded as the norm/ default. > If ZK requires proprietary JRE for stability, > Java will be NO NO!. *Fantastic* > My point is to keep things simple as we scale. I couldn't agree more. In that principle I ask that each dependency on cluster knowledge be considered carefully with a minimalist approach. -- Ian Latter Late night coder .. http://midnightcode.org/ From vbellur at redhat.com Tue May 8 09:15:13 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Tue, 08 May 2012 14:45:13 +0530 Subject: [Gluster-devel] Server outage - review.gluster.com - please stand by In-Reply-To: References: Message-ID: <4FA8E421.3090108@redhat.com> On 05/08/2012 01:05 AM, John Mark Walker wrote: > Greetings, > > Our iWeb server, which hosts review.gluster.com, is currently down. I have filed an urgent request to reboot the server in question. > > If you notice anything else working poorly, aside from review.gluster.com, please let me know ASAP. > > I'll send a note when services are back to normal. All services are back to normal. Please let us know if you notice any issue. Thanks, Vijay From xhernandez at datalab.es Tue May 8 09:34:35 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Tue, 08 May 2012 11:34:35 +0200 Subject: [Gluster-devel] A healing translator Message-ID: <4FA8E8AB.2040604@datalab.es> Hello developers, I would like to expose some ideas we are working on to create a new kind of translator that should be able to unify and simplify to some extent the healing procedures of complex translators. Currently, the only translator with complex healing capabilities that we are aware of is AFR. We are developing another translator that will also need healing capabilities, so we thought that it would be interesting to create a new translator able to handle the common part of the healing process and hence to simplify and avoid duplicated code in other translators. The basic idea of the new translator is to handle healing tasks nearer the storage translator on the server nodes instead to control everything from a translator on the client nodes. Of course the heal translator is not able to handle healing entirely by itself, it needs a client translator which will coordinate all tasks. The heal translator is intended to be used by translators that work with multiple subvolumes. I will try to explain how it works without entering into too much details. There is an important requisite for all client translators that use healing: they must have exactly the same list of subvolumes and in the same order. Currently, I think this is not a problem. The heal translator treats each file as an independent entity, and each one can be in 3 modes: 1. Normal mode This is the normal mode for a copy or fragment of a file when it is synchronized and consistent with the same file on other nodes (for example with other replicas. It is the client translator who decides if it is synchronized or not). 2. Healing mode This is the mode used when a client detects an inconsistency in the copy or fragment of the file stored on this node and initiates the healing procedures. 3. Provider mode (I don't like very much this name, though) This is the mode used by client translators when an inconsistency is detected in this file, but the copy or fragment stored in this node is considered good and it will be used as a source to repair the contents of this file on other nodes. Initially, when a file is created, it is set in normal mode. Client translators that make changes must guarantee that they send the modification requests in the same order to all the servers. This should be done using inodelk/entrylk. When a change is sent to a server, the client must include a bitmap mask of the clients to which the request is being sent. Normally this is a bitmap containing all the clients, however, when a server fails for some reason some bits will be cleared. The heal translator uses this bitmap to early detect failures on other nodes from the point of view of each client. When this condition is detected, the request is aborted with an error and the client is notified with the remaining list of valid nodes. If the client considers the request can be successfully server with the remaining list of nodes, it can resend the request with the updated bitmap. The heal translator also updates two file attributes for each change request to mantain the "version" of the data and metadata contents of the file. A similar task is currently made by AFR using xattrop. This would not be needed anymore, speeding write requests. The version of data and metadata is returned to the client for each read request, allowing it to detect inconsistent data. When a client detects an inconsistency, it initiates healing. First of all, it must lock the entry and inode (when necessary). Then, from the data collected from each node, it must decide which nodes have good data and which ones have bad data and hence need to be healed. There are two possible cases: 1. File is not a regular file In this case the reconstruction is very fast and requires few requests, so it is done while the file is locked. In this case, the heal translator does nothing relevant. 2. File is a regular file For regular files, the first step is to synchronize the metadata to the bad nodes, including the version information. Once this is done, the file is set in healing mode on bad nodes, and provider mode on good nodes. Then the entry and inode are unlocked. When a file is in provider mode, it works as in normal mode, but refuses to start another healing. Only one client can be healing a file. When a file is in healing mode, each normal write request from any client are handled as if the file were in normal mode, updating the version information and detecting possible inconsistencies with the bitmap. Additionally, the healing translator marks the written region of the file as "good". Each write request from the healing client intended to repair the file must be marked with a special flag. In this case, the area that wants to be written is filtered by the list of "good" ranges (if there are any intersection with a good range, it is removed from the request). The resulting set of ranges are propagated to the lower translator and added to the list of "good" ranges but the version information is not updated. Read requests are only served if the range requested is entirely contained into the "good" regions list. There are some additional details, but I think this is enough to have a general idea of its purpose and how it works. The main advantages of this translator are: 1. Avoid duplicated code in client translators 2. Simplify and unify healing methods in client translators 3. xattrop is not needed anymore in client translators to keep track of changes 4. Full file contents are repaired without locking the file 5. Better detection and prevention of some split brain situations as soon as possible I think it would be very useful. It seems to me that it works correctly in all situations, however I don't have all the experience that other developers have with the healing functions of AFR, so I will be happy to answer any question or suggestion to solve problems it may have or to improve it. What do you think about it ? Thank you, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From jdarcy at redhat.com Tue May 8 12:57:31 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Tue, 08 May 2012 08:57:31 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: References: <4FA7DFA3.1030300@redhat.com> Message-ID: <4FA9183B.5080708@redhat.com> On 05/08/2012 12:33 AM, Anand Babu Periasamy wrote: > Real issue is here is: GlusterFS is a fully distributed system. It is > OK for config files to be in one place (centralized). It is easier to > manage and backup. Avati still claims that making distributed copies > are not a problem (volume operations are fast, versioned and > checksumed). It's also grossly inefficient at 100-node scale. I'll also need some convincing before I believe that nodes which are down during a config change will catch up automatically and reliably in all cases. I think this is even more of an issue with membership than with config data. All-to-all pings are just not acceptable at 100-node or greater scale. We need something better, and more importantly designing cluster membership protocols is just not a business we should even be in. We shouldn't be devoting our own time to that when we can just use something designed by people who have that as their focus. > Also the code base for replicating 3 way or all-node is > same. We all need to come to agreement on the demerits of replicating > the volume spec on every node. It's somewhat similar to how we replicate data - we need enough copies to survive a certain number of anticipated failures. > If we are convinced to keep the config info in one place, ZK is > certainly one a good idea. I personally hate Java dependency. I still > struggle with Java dependencies for browser and clojure. I can digest > that if we are going to adopt Java over Python for future external > modules. Alternatively we can also look at creating a replicated meta > system volume. What ever we adopt, we should keep dependencies and > installation steps to the bare minimum and simple. I personally hate the Java dependency too. I'd much rather have something in C/Go/Python/Erlang but couldn't find anything that had the same (useful) feature set. I also considered the idea of storing config in a hand-crafted GlusterFS volume, using our own mechanisms for distributing/finding and replicating data. That's at least an area where we can claim some expertise. Such layering does create a few interesting issues, but nothing intractable. The big drawback is that it only solves the config-data problem; a solution which combines that with cluster membership is IMO preferable. The development drag of having to maintain that functionality ourselves, and hook every new feature into the not-very-convenient APIs that have predictably resulted, is considerable. From jdarcy at redhat.com Tue May 8 12:42:19 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Tue, 08 May 2012 08:42:19 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205080427.q484RO09004857@singularity.tronunltd.com> References: <201205080427.q484RO09004857@singularity.tronunltd.com> Message-ID: <4FA914AB.8030209@redhat.com> On 05/08/2012 12:27 AM, Ian Latter wrote: > The equivalent configuration in a glusterd world (from > my experiments) pushed all of the distribute knowledge > out to the client and I haven't had a response as to how > to add a replicate on distributed volumes in this model, > so I've lost replicate. This doesn't seem to be a problem with replicate-first vs. distribute-first, but with client-side vs. server-side deployment of those translators. You *can* construct your own volfiles that do these things on the servers. It will work, but you won't get a lot of support for it. The issue here is that we have only a finite number of developers, and a near-infinite number of configurations. We can't properly qualify everything. One way we've tried to limit that space is by preferring distribute over replicate, because replicate does a better job of shielding distribute from brick failures than vice versa. Another is to deploy both on the clients, following the scalability rule of pushing effort to the most numerous components. The code can support other arrangements, but the people might not. BTW, a similar concern exists with respect to replication (i.e. AFR) across data centers. Performance is going to be bad, and there's not going to be much we can do about it. > But in this world, the client must > know about everything and the server is simply a set > of served/presented disks (as volumes). In this > glusterd world, then, why does any server need to > know of any other server, if the clients are doing all of > the heavy lifting? First, because config changes have to apply across servers. Second, because server machines often spin up client processes for things like repair or rebalance. From ian.latter at midnightcode.org Tue May 8 23:08:32 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Wed, 09 May 2012 09:08:32 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205082308.q48N8WQg008425@singularity.tronunltd.com> > On 05/08/2012 12:27 AM, Ian Latter wrote: > > The equivalent configuration in a glusterd world (from > > my experiments) pushed all of the distribute knowledge > > out to the client and I haven't had a response as to how > > to add a replicate on distributed volumes in this model, > > so I've lost replicate. > > This doesn't seem to be a problem with replicate-first vs. distribute-first, > but with client-side vs. server-side deployment of those translators. You > *can* construct your own volfiles that do these things on the servers. It will > work, but you won't get a lot of support for it. The issue here is that we > have only a finite number of developers, and a near-infinite number of > configurations. We can't properly qualify everything. One way we've tried to > limit that space is by preferring distribute over replicate, because replicate > does a better job of shielding distribute from brick failures than vice versa. > Another is to deploy both on the clients, following the scalability rule of > pushing effort to the most numerous components. The code can support other > arrangements, but the people might not. Sure, I have my own vol files that do (did) what I wanted and I was supporting myself (and users); the question (and the point) is what is the GlusterFS *intent*? I'll write an rsyncd wrapper myself, to run on top of Gluster, if the intent is not allow the configuration I'm after (arbitrary number of disks in one multi-host environment replicated to an arbitrary number of disks in another multi-host environment, where ideally each environment need not sum to the same data capacity, presented in a single contiguous consumable storage layer to an arbitrary number of unintelligent clients, that is as fault tolerant as I choose it to be including the ability to add and offline/online and remove storage as I so choose) .. or switch out the whole solution if Gluster is heading away from my needs. I just need to know what the direction is .. I may even be able to help get you there if you tell me :) > BTW, a similar concern exists with respect to replication (i.e. AFR) across > data centers. Performance is going to be bad, and there's not going to be much > we can do about it. Hmm .. that depends .. these sorts of statements need context/qualification (in bandwidth and latency terms). For example the last multi-site environment that I did architecture for was two DCs set 32kms apart with a redundant 20Gbps layer-2 (ethernet) stretch between them - latency was 1ms average, 2ms max (the fiber actually took a 70km path). Didn't run Gluster on it, but we did stretch a number things that "couldn't" be stretched. > > But in this world, the client must > > know about everything and the server is simply a set > > of served/presented disks (as volumes). In this > > glusterd world, then, why does any server need to > > know of any other server, if the clients are doing all of > > the heavy lifting? > > First, because config changes have to apply across servers. Second, because > server machines often spin up client processes for things like repair or > rebalance. Yep, but my reading is that the config's that the servers need are local - to make a disk a share (volume), and that as you've described the rest are "client processes" (even when on something built as a "server"), so if you catered for all clients then you'd be set? I.e. AFR now runs in the client? And I am sick of the word-wrap on this client .. I think you've finally convinced me to fix it ... what's normal these days - still 80 chars? -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Wed May 9 00:57:49 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 8 May 2012 17:57:49 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205082308.q48N8WQg008425@singularity.tronunltd.com> References: <201205082308.q48N8WQg008425@singularity.tronunltd.com> Message-ID: On Tue, May 8, 2012 at 4:08 PM, Ian Latter wrote: > > On 05/08/2012 12:27 AM, Ian Latter wrote: > > > The equivalent configuration in a glusterd world (from > > > my experiments) pushed all of the distribute knowledge > > > out to the client and I haven't had a response as to how > > > to add a replicate on distributed volumes in this model, > > > so I've lost replicate. > > > > This doesn't seem to be a problem with replicate-first vs. > distribute-first, > > but with client-side vs. server-side deployment of those > translators. You > > *can* construct your own volfiles that do these things on > the servers. It will > > work, but you won't get a lot of support for it. The > issue here is that we > > have only a finite number of developers, and a > near-infinite number of > > configurations. We can't properly qualify everything. > One way we've tried to > > limit that space is by preferring distribute over > replicate, because replicate > > does a better job of shielding distribute from brick > failures than vice versa. > > Another is to deploy both on the clients, following the > scalability rule of > > pushing effort to the most numerous components. The code > can support other > > arrangements, but the people might not. > > Sure, I have my own vol files that do (did) what I wanted > and I was supporting myself (and users); the question > (and the point) is what is the GlusterFS *intent*? The "intent" (more or less - I hate to use the word as it can imply a commitment to what I am about to say, but there isn't one) is to keep the bricks (server process) dumb and have the intelligence on the client side. This is a "rough goal". There are cases where replication on the server side is inevitable (in the case of NFS access) but we keep the software architecture undisturbed by running a client process on the server machine to achieve it. We do plan to support "replication on the server" in the future while still retaining the existing software architecture as much as possible. This is particularly useful in Hadoop environment where the jobs expect write performance of a single copy and expect copy to happen in the background. We have the proactive self-heal daemon running on the server machines now (which again is a client process which happens to be physically placed on the server) which gives us many interesting possibilities - i.e, with simple changes where we fool the client side replicate translator at the time of transaction initiation that only the closest server is up at that point of time and write to it alone, and have the proactive self-heal daemon perform the extra copies in the background. This would be consistent with other readers as they get directed to the "right" version of the file by inspecting the changelogs while the background replication is in progress. The intention of the above example is to give a general sense of how we want to evolve the architecture (i.e, the "intention" you were referring to) - keep the clients intelligent and servers dumb. If some intelligence needs to be built on the physical server, tackle it by loading a client process there (there are also "pathinfo xattr" kind of internal techniques to figure out locality of the clients in a generic way without bringing "server sidedness" into them in a harsh way) I'll > write an rsyncd wrapper myself, to run on top of Gluster, > if the intent is not allow the configuration I'm after > (arbitrary number of disks in one multi-host environment > replicated to an arbitrary number of disks in another > multi-host environment, where ideally each environment > need not sum to the same data capacity, presented in a > single contiguous consumable storage layer to an > arbitrary number of unintelligent clients, that is as fault > tolerant as I choose it to be including the ability to add > and offline/online and remove storage as I so choose) .. > or switch out the whole solution if Gluster is heading > away from my needs. I just need to know what the > direction is .. I may even be able to help get you there if > you tell me :) > > There are good and bad in both styles (distribute on top v/s replicate on top). Replicate on top gives you much better flexibility of configuration. Distribute on top is easier for us developers. As a user I would like replicate on top as well. But the problem today is that replicate (and self-heal) does not understand "partial failure" of its subvolumes. If one of the subvolume of replicate is a distribute, then today's replicate only understands complete failure of the distribute set or it assumes everything is completely fine. An example is self-healing of directory entries. If a file is "missing" in one subvolume because a distribute node is temporarily down, replicate has no clue why it is missing (or that it should keep away from attempting to self-heal). Along the same lines, it does not know that once a server is taken off from its distribute subvolume for good that it needs to start recreating missing files. The effort to fix this seems to be big enough to disturb the inertia of status quo. If this is fixed, we can definitely adopt a replicate-on-top mode in glusterd. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From abperiasamy at gmail.com Wed May 9 01:05:37 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Tue, 8 May 2012 18:05:37 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205082308.q48N8WQg008425@singularity.tronunltd.com> References: <201205082308.q48N8WQg008425@singularity.tronunltd.com> Message-ID: On Tue, May 8, 2012 at 4:08 PM, Ian Latter wrote: >> On 05/08/2012 12:27 AM, Ian Latter wrote: >> > The equivalent configuration in a glusterd world (from >> > my experiments) pushed all of the distribute knowledge >> > out to the client and I haven't had a response as to how >> > to add a replicate on distributed volumes in this model, >> > so I've lost replicate. >> >> This doesn't seem to be a problem with replicate-first vs. > distribute-first, >> but with client-side vs. server-side deployment of those > translators. ?You >> *can* construct your own volfiles that do these things on > the servers. ?It will >> work, but you won't get a lot of support for it. ?The > issue here is that we >> have only a finite number of developers, and a > near-infinite number of >> configurations. ?We can't properly qualify everything. > One way we've tried to >> limit that space is by preferring distribute over > replicate, because replicate >> does a better job of shielding distribute from brick > failures than vice versa. >> Another is to deploy both on the clients, following the > scalability rule of >> pushing effort to the most numerous components. ?The code > can support other >> arrangements, but the people might not. > > Sure, I have my own vol files that do (did) what I wanted > and I was supporting myself (and users); the question > (and the point) is what is the GlusterFS *intent*? ?I'll > write an rsyncd wrapper myself, to run on top of Gluster, > if the intent is not allow the configuration I'm after > (arbitrary number of disks in one multi-host environment > replicated to an arbitrary number of disks in another > multi-host environment, where ideally each environment > need not sum to the same data capacity, presented in a > single contiguous consumable storage layer to an > arbitrary number of unintelligent clients, that is as fault > tolerant as I choose it to be including the ability to add > and offline/online and remove storage as I so choose) .. > or switch out the whole solution if Gluster is heading > away from my ?needs. ?I just need to know what the > direction is .. I may even be able to help get you there if > you tell me :) Rsync'ing the vol spec files is the simplest and elegant approach. It is how glusterfs originally handled config files. How ever elastic volume management (online volume management operations) requires synchronized online changes to volume spec files. This requires GlusterFS to manage volume specification files internally. That is why we brought glusterd in 3.1. Real question is: do we want to keep the volume spec files on all nodes (fully distributed) or few selected nodes. > >> BTW, a similar concern exists with respect to replication > (i.e. AFR) across >> data centers. ?Performance is going to be bad, and there's > not going to be much >> we can do about it. > > Hmm .. that depends .. these sorts of statements need > context/qualification (in bandwidth and latency terms). ?For > example the last multi-site environment that I did > architecture for was two DCs set 32kms apart with a > redundant 20Gbps layer-2 (ethernet) stretch between > them - latency was 1ms average, 2ms max (the fiber > actually took a 70km path). ?Didn't run Gluster on it, but > we did stretch a number things that "couldn't" be stretched. > > >> > But in this world, the client must >> > know about everything and the server is simply a set >> > of served/presented disks (as volumes). ?In this >> > glusterd world, then, why does any server need to >> > know of any other server, if the clients are doing all of >> > the heavy lifting? >> >> First, because config changes have to apply across > servers. ?Second, because >> server machines often spin up client processes for things > like repair or >> rebalance. > > Yep, but my reading is that the config's that the servers > need are local - to make a disk a share (volume), and > that as you've described the rest are "client processes" > (even when on something built as a "server"), so if you > catered for all clients then you'd be set? ?I.e. AFR now > runs in the client? > > > And I am sick of the word-wrap on this client .. I think > you've finally convinced me to fix it ... what's normal > these days - still 80 chars? I used to line-wrap (gnus and cool emacs extensions). It doesn't make sense to line wrap any more. Let the email client handle it depending on the screen size of the device (mobile / tablet / desktop). -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From anand.avati at gmail.com Wed May 9 01:33:50 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 8 May 2012 18:33:50 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: References: <4FA7DFA3.1030300@redhat.com> Message-ID: On Mon, May 7, 2012 at 9:33 PM, Anand Babu Periasamy wrote: > On Mon, May 7, 2012 at 7:43 AM, Jeff Darcy wrote: > > I've long felt that our ways of dealing with cluster membership and > staging of > > config changes is not quite as robust and scalable as we might want. > > Accordingly, I spent a bit of time a couple of weeks ago looking into the > > possibility of using ZooKeeper to do some of this stuff. Yeah, it > brings in a > > heavy Java dependency, but when I looked at some lighter-weight > alternatives > > they all seemed to be lacking in more important ways. Basically the > idea was > > to do this: > > > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper > servers, or > > point everyone at an existing ZooKeeper cluster. > > > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer > probe" > > merely updates ZK, and "peer status" merely reads from it). > > > > * Store config information in ZK *once* instead of regenerating volfiles > etc. > > on every node (and dealing with the ugly cases where a node was down > when the > > config change happened). > > > > * Set watches on ZK nodes to be notified when config changes happen, and > > respond appropriately. > > > > I eventually ran out of time and moved on to other things, but this or > > something like it (e.g. using Riak Core) still seems like a better > approach > > than what we have. In that context, it looks like ZkFarmer[1] might be > a big > > help. AFAICT someone else was trying to solve almost exactly the same > kind of > > server/config problem that we have, and wrapped their solution into a > library. > > Is this a direction other devs might be interested in pursuing some day, > > if/when time allows? > > > > > > [1] https://github.com/rs/zkfarmer > > Real issue is here is: GlusterFS is a fully distributed system. It is > OK for config files to be in one place (centralized). It is easier to > manage and backup. Avati still claims that making distributed copies > are not a problem (volume operations are fast, versioned and > checksumed). Also the code base for replicating 3 way or all-node is > same. We all need to come to agreement on the demerits of replicating > the volume spec on every node. > My claim is somewhat similar to what you said literally, but slightly different in meaning. What I mean is, while it is true keeping multiple copies of the volfile is more expensive/resource consuming in theory, what is the breaking point in terms of number of servers where it begins to matter? There are trivial (low lying) enhancements which are possible (for e.g, store volfiles of a volume only on participating servers instead of all servers) which could address a class of concerns. There are clear advantages in having volfiles in all the participating nodes at least - it takes away dependency on order of booting of servers in your data centre. If volfiles are available locally you dont have to wait/retry for the "central servers" to come up first. Whether this is volfiles managed by glusterd, or "storage servers" of ZK, it is a big advantage to have the startup of a given server decoupled from the others (of course the coupling comes in at an operational level at the time of volume modifications, but that is much more acceptable). If the storage of volfiles on all servers really seems unnecessary, we should first come up with real hard numbers - number of servers v/s latency of volume operations and then figure out at what point it starts becoming unacceptably slow. Maybe a good solution is to just propagate the volfiles in the background while still retaining version info than introducing a more intrusive change? But we really need the numbers first. > > If we are convinced to keep the config info in one place, ZK is > certainly one a good idea. I personally hate Java dependency. I still > struggle with Java dependencies for browser and clojure. I can digest > that if we are going to adopt Java over Python for future external > modules. Alternatively we can also look at creating a replicated meta > system volume. What ever we adopt, we should keep dependencies and > installation steps to the bare minimum and simple. > > It is true other projects have figured out the problem of membership and configuration management and specialize at doing that. That is very good for the entire computing community as a whole. If there are components we can incorporate and build upon their work, that is very desirable. At the same time we also need to check what other baggage we inherit along with the specialized expertise we take on. One of the biggest strengths of Gluster has been its "lightweight"edness and lack of dependencies - which in turn has driven our adoption significantly which in turn results in higher feedback and bug reports etc. (i.e, it is not an isolated strength in itself). Enforcing a Java dependency down the throat of users who want a simple distributed filesystem (yes, the moment we stop thinking of gluster as a "simple" distributed filesystem - even though it may be an oxymoron technically, but I guess you know what I mean :) it's a slippery slope towards it becoming "yet another" distributed filesystem.) The simplicity is what "makes" gluster to a large extent what it is. This makes the developer's life miserable to a fair degree, but it anyways always is, one way or another ;) I am not against adopting external projects. There are good reasons many times to do so. If there are external projects which are "compatible in personality" with gluster and helps us avoid reinventing the wheel, we must definitely do so. If they are not compatible, I'm sure there are lessons and ideas we can adopt, if not code. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Wed May 9 04:18:46 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 9 May 2012 04:18:46 +0000 Subject: [Gluster-devel] ZkFarmer In-Reply-To: References: <4FA7DFA3.1030300@redhat.com> Message-ID: <20120509041846.GB18684@homeworld.netbsd.org> On Mon, May 07, 2012 at 09:33:50PM -0700, Anand Babu Periasamy wrote: > I personally hate Java dependency. Me too. I know Java programs are supposed to have decent performances, but my experiences had always been terrible. Please do not add a dependency on Java. -- Emmanuel Dreyfus manu at netbsd.org From manu at netbsd.org Wed May 9 04:41:47 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 9 May 2012 04:41:47 +0000 Subject: [Gluster-devel] glusterfs-3.3.0qa40 released In-Reply-To: <20120507081553.5AA00100C5@build.gluster.com> References: <20120507081553.5AA00100C5@build.gluster.com> Message-ID: <20120509044147.GC18684@homeworld.netbsd.org> On Mon, May 07, 2012 at 01:15:50AM -0700, Vijay Bellur wrote: > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa40.tar.gz Hi There is a small issue with python: the machine that runs autoconf only has python 2.5 installed, and as a result, the generated configure script fails to detect an installed python 2.6 or higher. Here is an example at mine, where python 2.7 is installed: checking for a Python interpreter with version >= 2.4... none configure: error: no suitable Python interpreter found That can be fixed by patching configure, but it would be nice if gluster builds could contain the check with latest python. -- Emmanuel Dreyfus manu at netbsd.org From renqiang at 360buy.com Wed May 9 04:46:08 2012 From: renqiang at 360buy.com (=?gb2312?B?yM7Hvw==?=) Date: Wed, 9 May 2012 12:46:08 +0800 Subject: [Gluster-devel] How to repair a 1TB disk in 30 mins Message-ID: <000301cd2d9e$a6b07fc0$f4117f40$@com> Dear All: I have a question. When I have a large cluster, maybe more than 10PB data, if a file have 3 copies and each disk have 1TB capacity, So we need about 30,000 disks. All disks are very cheap and are easily damaged. We must repair a 1TB disk in 30 mins?As far as I know?in gluster architecture?all data in the damaged disk will be repaired to the new disk which is used to replace the damaged disk. As a result of the writing speed of disk, when we repair 1TB disk in gluster, we need more than 5 hours. Can we do it in 30 mins? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Wed May 9 05:35:40 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Wed, 09 May 2012 15:35:40 +1000 Subject: [Gluster-devel] automake Message-ID: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Hello, I have built a new module and I can't seem to get the changed makefiles to be built. I have not used "configure" in any of my projects and I'm not seeing an answer from my google searches. The error that I get is during the "make" where glusterfs-3.2.6/missing errors at line 52 "automake-1.9: command not found". This is a newer RedHat environment and it has automake 1.11 .. if I cp 1.11 to 1.9 I get other errors ... libtool is reporting that the automake version is 1.11.1. I believe that it is getting the 1.9 version from Gluster ... How do I get a new Makefile.am and Makefile.in to work in this structure? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From harsha at gluster.com Wed May 9 06:03:00 2012 From: harsha at gluster.com (Harshavardhana) Date: Tue, 8 May 2012 23:03:00 -0700 Subject: [Gluster-devel] automake In-Reply-To: <201205090535.q495Ze5E009996@singularity.tronunltd.com> References: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Message-ID: Ian, Please re-run the ./autogen.sh and use again. Make sure you have added entries in 'configure.ac' and 'Makefile.am' for the respective module name and directory. -Harsha On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > Hello, > > > ?I have built a new module and I can't seem to > get the changed makefiles to be built. ?I have not > used "configure" in any of my projects and I'm > not seeing an answer from my google searches. > > ?The error that I get is during the "make" where > glusterfs-3.2.6/missing errors at line 52 > "automake-1.9: command not found". > > ?This is a newer RedHat environment and it has > automake 1.11 .. if I cp 1.11 to 1.9 I get other > errors ... libtool is reporting that the automake > version is 1.11.1. ?I believe that it is getting the > 1.9 version from Gluster ... > > ?How do I get a new Makefile.am and Makefile.in > to work in this structure? > > > > Cheers, > > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Wed May 9 06:05:54 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Wed, 09 May 2012 16:05:54 +1000 Subject: [Gluster-devel] automake Message-ID: <201205090605.q4965sPn010223@singularity.tronunltd.com> You're a champion. Thanks Harsha. ----- Original Message ----- >From: "Harshavardhana" >To: "Ian Latter" >Subject: Re: [Gluster-devel] automake >Date: Tue, 08 May 2012 23:03:00 -0700 > > Ian, > > Please re-run the ./autogen.sh and use again. > > Make sure you have added entries in 'configure.ac' and 'Makefile.am' > for the respective module name and directory. > > -Harsha > > On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > > Hello, > > > > > > ?I have built a new module and I can't seem to > > get the changed makefiles to be built. ?I have not > > used "configure" in any of my projects and I'm > > not seeing an answer from my google searches. > > > > ?The error that I get is during the "make" where > > glusterfs-3.2.6/missing errors at line 52 > > "automake-1.9: command not found". > > > > ?This is a newer RedHat environment and it has > > automake 1.11 .. if I cp 1.11 to 1.9 I get other > > errors ... libtool is reporting that the automake > > version is 1.11.1. ?I believe that it is getting the > > 1.9 version from Gluster ... > > > > ?How do I get a new Makefile.am and Makefile.in > > to work in this structure? > > > > > > > > Cheers, > > > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Wed May 9 06:08:41 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 8 May 2012 23:08:41 -0700 Subject: [Gluster-devel] automake In-Reply-To: <201205090535.q495Ze5E009996@singularity.tronunltd.com> References: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Message-ID: You might want to read autobook for the general theory behind autotools. Here's a quick summary - aclocal prepares the running of autotools. autoheader prepares autotools to generate a config.h to be consumed by C code configure.ac is the "source" to discover the build system and accept user parameters autoconf converts configure.ac to configure Makefile.am is the "source" to define what is to be built and how. automake converts Makefile.am to Makefile.in till here everything is scripted in ./autogen.sh running configure creates Makefile out of Makefile.in now run make :) Avati On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > Hello, > > > I have built a new module and I can't seem to > get the changed makefiles to be built. I have not > used "configure" in any of my projects and I'm > not seeing an answer from my google searches. > > The error that I get is during the "make" where > glusterfs-3.2.6/missing errors at line 52 > "automake-1.9: command not found". > > This is a newer RedHat environment and it has > automake 1.11 .. if I cp 1.11 to 1.9 I get other > errors ... libtool is reporting that the automake > version is 1.11.1. I believe that it is getting the > 1.9 version from Gluster ... > > How do I get a new Makefile.am and Makefile.in > to work in this structure? > > > > Cheers, > > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abperiasamy at gmail.com Wed May 9 07:21:35 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Wed, 9 May 2012 00:21:35 -0700 Subject: [Gluster-devel] automake In-Reply-To: References: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Message-ID: On Tue, May 8, 2012 at 11:08 PM, Anand Avati wrote: > You might want to read autobook for the general theory behind autotools. > Here's a quick summary - > > aclocal prepares the running of autotools. > autoheader prepares autotools to generate a config.h to be consumed by C > code > configure.ac is the "source" to discover the build system and accept user > parameters > autoconf converts configure.ac to configure > Makefile.am is the "source" to define what is to be built and how. > automake converts Makefile.am to Makefile.in > > till here everything is scripted in ./autogen.sh > > running configure creates Makefile out of Makefile.in > > now run make :) > Best way to learn autotools is copy-paste-customize. In general, if you are starting a new project, Debian has a nice little tool called "autoproject". It will auto generate autoconf and automake files. Then you start customizing it. GNU project should really merge all these tools in to one simple coherent system. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From abperiasamy at gmail.com Wed May 9 07:54:43 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Wed, 9 May 2012 00:54:43 -0700 Subject: [Gluster-devel] How to repair a 1TB disk in 30 mins In-Reply-To: <000301cd2d9e$a6b07fc0$f4117f40$@com> References: <000301cd2d9e$a6b07fc0$f4117f40$@com> Message-ID: On Tue, May 8, 2012 at 9:46 PM, ?? wrote: > Dear All: > > ? I have a question. When I have a large cluster, maybe more than 10PB data, > if a file have 3 copies and each disk have 1TB capacity, So we need about > 30,000 disks. All disks are very cheap and are easily damaged. We must > ?repair a 1TB disk in 30 mins?As far as I know?in gluster architecture?all > data in the damaged disk will be repaired to the new disk which is used to > replace the damaged disk. As a result of the writing speed of disk, when we > repair 1TB disk in gluster, we need more than 5 hours. Can we do it in 30 > mins? 5 hours is based on SATA 1TB disk copying at ~50MB/s across small and large files + folders. This means, you literally attached the disk to the system and manually transferring the data. I can't think of any other faster way to transfer data on 1TB 7200RPM SATA/SAS disks without bending space-time ;). Larger disks and RAID arrays only makes this worse. This is exactly why we implemented passive self-heal in the first place. GlusterFS heals files on demand (as they are accessed), so applications have least down time or disruption. There is plenty of time to heal the cold data in background. All we should care is minimal down time. Self-heal in 3.3 has some major improvements. It got significantly faster, because healing is performed on the server side entirely (server to server). It can perform granular healing on large files (previously checksum operations used to pause or timeout the VMs). Active-healing (Replicate now remembers pending files and heals them when the failed node comes back. Previously you have to perform name-space wide recursive directory listing). Most importantly self-healing is no longer a blackbox. heal-info can show pending and currently-healing files. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From renqiang at 360buy.com Wed May 9 09:29:34 2012 From: renqiang at 360buy.com (=?utf-8?B?5Lu75by6?=) Date: Wed, 9 May 2012 17:29:34 +0800 Subject: [Gluster-devel] =?utf-8?b?562U5aSNOiAgSG93IHRvIHJlcGFpciBhIDFU?= =?utf-8?q?B_disk_in_30_mins?= In-Reply-To: References: <000301cd2d9e$a6b07fc0$f4117f40$@com> Message-ID: <002601cd2dc6$3f68f4f0$be3aded0$@com> Thank you very much? And I have some questions? 1?What's the capacity of the largest cluster online ?And how many nodes in it? And What is it used for? 2?When we excute 'ls' in a directory,it's very slow,if the cluster has too many bricks and too many nodes.Can we do it well? -----????----- ???: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] ????: 2012?5?9? 15:55 ???: renqiang ??: gluster-devel at nongnu.org ??: Re: [Gluster-devel] How to repair a 1TB disk in 30 mins On Tue, May 8, 2012 at 9:46 PM, ?? wrote: > Dear All: > > I have a question. When I have a large cluster, maybe more than 10PB data, > if a file have 3 copies and each disk have 1TB capacity, So we need about > 30,000 disks. All disks are very cheap and are easily damaged. We must > repair a 1TB disk in 30 mins?As far as I know?in gluster architecture?all > data in the damaged disk will be repaired to the new disk which is used to > replace the damaged disk. As a result of the writing speed of disk, when we > repair 1TB disk in gluster, we need more than 5 hours. Can we do it in 30 > mins? 5 hours is based on SATA 1TB disk copying at ~50MB/s across small and large files + folders. This means, you literally attached the disk to the system and manually transferring the data. I can't think of any other faster way to transfer data on 1TB 7200RPM SATA/SAS disks without bending space-time ;). Larger disks and RAID arrays only makes this worse. This is exactly why we implemented passive self-heal in the first place. GlusterFS heals files on demand (as they are accessed), so applications have least down time or disruption. There is plenty of time to heal the cold data in background. All we should care is minimal down time. Self-heal in 3.3 has some major improvements. It got significantly faster, because healing is performed on the server side entirely (server to server). It can perform granular healing on large files (previously checksum operations used to pause or timeout the VMs). Active-healing (Replicate now remembers pending files and heals them when the failed node comes back. Previously you have to perform name-space wide recursive directory listing). Most importantly self-healing is no longer a blackbox. heal-info can show pending and currently-healing files. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From ian.latter at midnightcode.org Thu May 10 05:47:06 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Thu, 10 May 2012 15:47:06 +1000 Subject: [Gluster-devel] Hide Feature Message-ID: <201205100547.q4A5l6eH015066@singularity.tronunltd.com> Hello, I have published an untested "hide" module (compiled against glusterfs-3.2.6); A simple method for hiding an underlying directory structure from parent/up-stream bricks within GlusterFS. In 2012 this code was spawned from my incomplete 2009 dedupe brick code which used this method to protect its internal hash database from the user, above. http://midnightcode.org/projects/saturn/code/hide-0.5.tgz I am serious when I mean untested - I've not even loaded the module under Gluster, it simply compiles. Let me know if there are tweaks that should be made or considered. Enjoy. -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 05:55:55 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Thu, 10 May 2012 15:55:55 +1000 Subject: [Gluster-devel] Fuse operations Message-ID: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> Hello, I published the Hide module in order to open a discussion around Fuse operations; http://fuse.sourceforge.net/doxygen/structfuse__operations.html In the dedupe module I want to secure the hash database from direct parent/use manipulation. The approach that I took was to find every GlusterFS file operation (fop) that took a loc_t parameter (as discovered via every xlator that is included in the tarball), in order to do path matching and then pass-through the call or return an error. The problem is that I can't find GlusterFS examples for all of the Fuse operators and, when I stray from the examples (like getattr and utiments), gluster tells me that there are no such xlator fops (at compile time - from the wind and unwind macros). So, I guess; 1) Are all Fuse/FS ops handled by Gluster? 2) Where can I find a complete list of the Gluster fops, and not just those that have been used in existing modules? 3) Is it safe to path match on loc_t? (i.e. is it fully resolved such that I won't find /etc/././././passwd)? This I could test .. Thanks, -- Ian Latter Late night coder .. http://midnightcode.org/ From jdarcy at redhat.com Thu May 10 13:39:21 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 10 May 2012 09:39:21 -0400 Subject: [Gluster-devel] Hide Feature In-Reply-To: <201205100547.q4A5l6eH015066@singularity.tronunltd.com> References: <201205100547.q4A5l6eH015066@singularity.tronunltd.com> Message-ID: <20120510093921.4a9f581a@jdarcy-dt.usersys.redhat.com> On Thu, 10 May 2012 15:47:06 +1000 "Ian Latter" wrote: > I have published an untested "hide" module (compiled > against glusterfs-3.2.6); > > A simple method for hiding an underlying directory > structure from parent/up-stream bricks within > GlusterFS. In 2012 this code was spawned from > my incomplete 2009 dedupe brick code which used > this method to protect its internal hash database > from the user, above. > > http://midnightcode.org/projects/saturn/code/hide-0.5.tgz > > > I am serious when I mean untested - I've not even > loaded the module under Gluster, it simply compiles. > > > Let me know if there are tweaks that should be made > or considered. A couple of comments: * It should be sufficient to fail lookup for paths that match your pattern. If that fails, the caller will never get to any others. You can use the quota translator as an example for something like this. * If you want to continue supporting this yourself, then you can just leave the code as it is, though in that case you'll want to consider building it "out of tree" as I describe in my "Translator 101" post[1] or do for some of my own translators[2]. Otherwise you'll need to submit it as a patch through Gerrit according to our standard workflow[3]. You'll also need to fix some of the idiosyncratic indentation. I don't remember the current policy wrt copyright assignment, but that might be required too. [1] http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/ [2] https://github.com/jdarcy/negative-lookup [3] http://www.gluster.org/community/documentation/index.php/Development_Work_Flow From jdarcy at redhat.com Thu May 10 13:58:51 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 10 May 2012 09:58:51 -0400 Subject: [Gluster-devel] Fuse operations In-Reply-To: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> References: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> Message-ID: <20120510095851.2f034889@jdarcy-dt.usersys.redhat.com> On Thu, 10 May 2012 15:55:55 +1000 "Ian Latter" wrote: > So, I guess; > 1) Are all Fuse/FS ops handled by Gluster? > 2) Where can I find a complete list of the > Gluster fops, and not just those that have > been used in existing modules? GlusterFS operations for a translator are all defined in an xlator_fops structure. When building translators, it can also be convenient to look at the default_xxx and default_xxx_cbk functions for each fop you implement. Also, I forgot to mention in my comments on your "hide" translator that you can often use the default_xxx_cbk callback when you call STACK_WIND, instead of having to define your own trivial one. FUSE operations are listed by the fuse_opcode enum. You can check for yourself how closely this matches our list. They do have a few ops of their own, we have a few of their own, and a few of theirs actually map to our xlator_cbks instead of xlator_fops. The points of non-correspondence seem to be interrupt, bmap, poll and ioctl. Maybe Csaba can elaborate on what we do (or plan to do) about these. > 3) Is it safe to path match on loc_t? (i.e. is > it fully resolved such that I won't find > /etc/././././passwd)? This I could test .. Name/path resolution is an area that has changed pretty recently, so I'll let Avati or Amar field that one. From anand.avati at gmail.com Thu May 10 19:36:26 2012 From: anand.avati at gmail.com (Anand Avati) Date: Thu, 10 May 2012 12:36:26 -0700 Subject: [Gluster-devel] Fuse operations In-Reply-To: <20120510095851.2f034889@jdarcy-dt.usersys.redhat.com> References: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> <20120510095851.2f034889@jdarcy-dt.usersys.redhat.com> Message-ID: On Thu, May 10, 2012 at 6:58 AM, Jeff Darcy wrote: > On Thu, 10 May 2012 15:55:55 +1000 > "Ian Latter" wrote: > > > So, I guess; > > 1) Are all Fuse/FS ops handled by Gluster? > > 2) Where can I find a complete list of the > > Gluster fops, and not just those that have > > been used in existing modules? > > GlusterFS operations for a translator are all defined in an xlator_fops > structure. When building translators, it can also be convenient to > look at the default_xxx and default_xxx_cbk functions for each fop you > implement. Also, I forgot to mention in my comments on your "hide" > translator that you can often use the default_xxx_cbk callback when you > call STACK_WIND, instead of having to define your own trivial one. > > FUSE operations are listed by the fuse_opcode enum. You can check for > yourself how closely this matches our list. They do have a few ops of > their own, we have a few of their own, and a few of theirs actually map > to our xlator_cbks instead of xlator_fops. The points of > non-correspondence seem to be interrupt, bmap, poll and ioctl. Maybe > Csaba can elaborate on what we do (or plan to do) about these. > > We might support interrupt sometime. Bmap - probably never. Poll, maybe. Ioctl - depeneds on what type of ioctl and requirement. > > 3) Is it safe to path match on loc_t? (i.e. is > > it fully resolved such that I won't find > > /etc/././././passwd)? This I could test .. > > Name/path resolution is an area that has changed pretty recently, so > I'll let Avati or Amar field that one. > The ".." interpretation is done by the client side VFS. Internal path construction does not use ".." and are always normalized. There are new situations where we now support non-absolute paths, but those are for GFID based addressing and ".." does not come into picture there. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnmark at redhat.com Thu May 10 21:41:08 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 10 May 2012 17:41:08 -0400 (EDT) Subject: [Gluster-devel] Bugzilla upgrade & planned outage - May 22 In-Reply-To: Message-ID: Pasting an email from bugzilla-announce: Red Hat Bugzilla (bugzilla.redhat.com) will be unavailable on May 22nd starting at 6 p.m. EDT [2200 UTC] to perform an upgrade from Bugzilla 3.6 to Bugzilla 4.2. We are hoping to be complete in no more than 3 hours barring any problems. Any services relying on bugzilla.redhat.com may not work properly during this time. Please be aware in case you need use of those services during the outage. Also *PLEASE* make sure any scripts or other external applications that rely on bugzilla.redhat.com are tested against our test server before the upgrade if you have not done so already. Let the Bugzilla Team know immediately of any issues found by reporting the bug in bugzilla.redhat.com against the Bugzilla product, version 4.2. A summary of the RPC changes is also included below. RPC changes from upstream Bugzilla 4.2: - Bug.* returns arrays for components, versions and aliases - Bug.* returns target_release array - Bug.* returns flag information (from Bugzilla 4.4) - Bug.search supports searching on keywords, dependancies, blocks - Bug.search supports quick searches, saved searches and advanced searches - Group.get has been added - Component.* and Flag.* have been added - Product.get has a component_names option to return just the component names. RPC changes from Red Hat Bugzilla 3.6: - This list may be incomplete. - This list excludes upstream changes from 3.6 that we inherited - Bug.update calls may use different column names. For example, in 3.6 you updated the 'short_desc' key if you wanted to change the summary. Now you must use the 'summary' key. This may be an inconeniance, but will make it much more maintainable in the long run. - Bug.search_new new becomes Bug.search. The 3.6 version of Bug.search is no longer available. - Product.* has been changed to match upstream code - Group.create has been added - RedHat.* and bugzilla.* calls that mirror official RPC calls are officially depreciated, and will be removed approximately two months after Red Hat Bugzilla 4.2 is released. To test against the new beta Bugzilla server, go to https://partner-bugzilla.redhat.com/ Thanks, JM From ian.latter at midnightcode.org Thu May 10 22:25:02 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 08:25:02 +1000 Subject: [Gluster-devel] automake Message-ID: <201205102225.q4AMP2X2018428@singularity.tronunltd.com> Thanks Avati, Yes, when I said that I hadn't use "configure" I meant "autotools" (though I didn't know it :) I think almost every project I download and build from scratch uses configure .. the last time I looked at the autotools was a few years ago now, maybe its time for a re-look .. my libraries are getting big enough to warrant it I suppose. Hadn't seen autogen before .. thanks for your help. Cheers, ----- Original Message ----- >From: "Anand Avati" >To: "Ian Latter" >Subject: Re: [Gluster-devel] automake >Date: Tue, 08 May 2012 23:08:41 -0700 > > You might want to read autobook for the general theory behind autotools. > Here's a quick summary - > > aclocal prepares the running of autotools. > autoheader prepares autotools to generate a config.h to be consumed by C > code > configure.ac is the "source" to discover the build system and accept user > parameters > autoconf converts configure.ac to configure > Makefile.am is the "source" to define what is to be built and how. > automake converts Makefile.am to Makefile.in > > till here everything is scripted in ./autogen.sh > > running configure creates Makefile out of Makefile.in > > now run make :) > > Avati > > On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > > > Hello, > > > > > > I have built a new module and I can't seem to > > get the changed makefiles to be built. I have not > > used "configure" in any of my projects and I'm > > not seeing an answer from my google searches. > > > > The error that I get is during the "make" where > > glusterfs-3.2.6/missing errors at line 52 > > "automake-1.9: command not found". > > > > This is a newer RedHat environment and it has > > automake 1.11 .. if I cp 1.11 to 1.9 I get other > > errors ... libtool is reporting that the automake > > version is 1.11.1. I believe that it is getting the > > 1.9 version from Gluster ... > > > > How do I get a new Makefile.am and Makefile.in > > to work in this structure? > > > > > > > > Cheers, > > > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 22:26:22 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 08:26:22 +1000 Subject: [Gluster-devel] automake Message-ID: <201205102226.q4AMQMEC018461@singularity.tronunltd.com> > > You might want to read autobook for the general theory behind autotools. > > Here's a quick summary - > > > > aclocal prepares the running of autotools. > > autoheader prepares autotools to generate a config.h to be consumed by C > > code > > configure.ac is the "source" to discover the build system and accept user > > parameters > > autoconf converts configure.ac to configure > > Makefile.am is the "source" to define what is to be built and how. > > automake converts Makefile.am to Makefile.in > > > > till here everything is scripted in ./autogen.sh > > > > running configure creates Makefile out of Makefile.in > > > > now run make :) > > > > Best way to learn autotools is copy-paste-customize. In general, if > you are starting a new project, Debian has a nice little tool called > "autoproject". It will auto generate autoconf and automake files. Then > you start customizing it. > > GNU project should really merge all these tools in to one simple > coherent system. My build environment is Fedora but I'm assuming its there too .. if I get some time I'll have a poke around .. Thanks for the info, appreciate it. -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 22:44:32 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 08:44:32 +1000 Subject: [Gluster-devel] Hide Feature Message-ID: <201205102244.q4AMiW2Z018543@singularity.tronunltd.com> Sorry for the re-send Jeff, I managed to screw up the CC so the list didn't get it; > > Let me know if there are tweaks that should be made > > or considered. > > A couple of comments: > > * It should be sufficient to fail lookup for paths that > match your pattern. If that fails, the caller will > never get to any others. You can use the quota > translator as an example for something like this. Ok, this is interesting. So if someone calls another fop .. say "open" ... against my brick/module, something (Fuse?) will make another, dependent, call to lookup first? If that's true then I can cut this all down to size. > * If you want to continue supporting this yourself, > then you can just leave the code as it is, though in > that case you'll want to consider building it "out of > tree" as I describe in my "Translator 101" post[1] > or do for some of my own translators[2]. > Otherwise you'll need to submit it as a patch > through Gerrit according to our standard workflow[3]. Thanks for the Translator articles/posts, I hadn't seen those. Per my previous patches, I'll publish code on my site under the GPL and you guys (Gluster/RedHat) can run them through whatever processes you choose. If it gets included in the GlusterFS package, then that's fine. If it gets ignored by the GlusterFS package, then that's fine also. > You'll also need to fix some of the idiosyncratic > indentation. I don't remember the current policy wrt > copyright assignment, but that might be required too. The weird indentation style used is not mine .. its what I gathered from the Gluster code that I read through. > [1] > http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/ > > [2] https://github.com/jdarcy/negative-lookup > > [3] > http://www.gluster.org/community/documentation/index.php/Development_Work_Flow > -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 23:39:58 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 09:39:58 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205102339.q4ANdwg8018739@singularity.tronunltd.com> > > Sure, I have my own vol files that do (did) what I wanted > > and I was supporting myself (and users); the question > > (and the point) is what is the GlusterFS *intent*? > > > The "intent" (more or less - I hate to use the word as it can imply a > commitment to what I am about to say, but there isn't one) is to keep the > bricks (server process) dumb and have the intelligence on the client side. > This is a "rough goal". There are cases where replication on the server > side is inevitable (in the case of NFS access) but we keep the software > architecture undisturbed by running a client process on the server machine > to achieve it. [There's a difference between intent and plan/roadmap] Okay. Unfortunately I am unable to leverage this - I tried to serve a Fuse->GlusterFS client mount point (of a Distribute volume) as a GlusterFS posix brick (for a Replicate volume) and it wouldn't play ball .. > We do plan to support "replication on the server" in the future while still > retaining the existing software architecture as much as possible. This is > particularly useful in Hadoop environment where the jobs expect write > performance of a single copy and expect copy to happen in the background. > We have the proactive self-heal daemon running on the server machines now > (which again is a client process which happens to be physically placed on > the server) which gives us many interesting possibilities - i.e, with > simple changes where we fool the client side replicate translator at the > time of transaction initiation that only the closest server is up at that > point of time and write to it alone, and have the proactive self-heal > daemon perform the extra copies in the background. This would be consistent > with other readers as they get directed to the "right" version of the file > by inspecting the changelogs while the background replication is in > progress. > > The intention of the above example is to give a general sense of how we > want to evolve the architecture (i.e, the "intention" you were referring > to) - keep the clients intelligent and servers dumb. If some intelligence > needs to be built on the physical server, tackle it by loading a client > process there (there are also "pathinfo xattr" kind of internal techniques > to figure out locality of the clients in a generic way without bringing > "server sidedness" into them in a harsh way) Okay .. But what happened to the "brick" architecture of stacking anything on anything? I think you point that out here ... > I'll > > write an rsyncd wrapper myself, to run on top of Gluster, > > if the intent is not allow the configuration I'm after > > (arbitrary number of disks in one multi-host environment > > replicated to an arbitrary number of disks in another > > multi-host environment, where ideally each environment > > need not sum to the same data capacity, presented in a > > single contiguous consumable storage layer to an > > arbitrary number of unintelligent clients, that is as fault > > tolerant as I choose it to be including the ability to add > > and offline/online and remove storage as I so choose) .. > > or switch out the whole solution if Gluster is heading > > away from my needs. I just need to know what the > > direction is .. I may even be able to help get you there if > > you tell me :) > > > > > There are good and bad in both styles (distribute on top v/s replicate on > top). Replicate on top gives you much better flexibility of configuration. > Distribute on top is easier for us developers. As a user I would like > replicate on top as well. But the problem today is that replicate (and > self-heal) does not understand "partial failure" of its subvolumes. If one > of the subvolume of replicate is a distribute, then today's replicate only > understands complete failure of the distribute set or it assumes everything > is completely fine. An example is self-healing of directory entries. If a > file is "missing" in one subvolume because a distribute node is temporarily > down, replicate has no clue why it is missing (or that it should keep away > from attempting to self-heal). Along the same lines, it does not know that > once a server is taken off from its distribute subvolume for good that it > needs to start recreating missing files. Hmm. I loved the brick idea. I don't like perverting it by trying to "see through" layers. In that context I can see two or three expected outcomes from someone building this type of stack (heh: a quick trick brick stack) - when a distribute child disappears; At the Distribute layer; 1) The distribute name space / stat space remains in tact, though the content is obviously not avail. 2) The distribute presentation is pure and true of its constituents, showing only the names / stats that are online/avail. In its standalone case, 2 is probably preferable as it allows clean add/start/stop/ remove capacity. At the Replicate layer; 3) replication occurs only where the name / stat space shows a gap 4) the replication occurs at any delta I don't think there's a real choice here, even if 3 were sensible, what would replicate do if there was a local name and even just a remote file size change, when there's no local content to update; it must be 4. In which case, I would expect that a replicate on top of a distribute with a missing child would suddenly see a delta that it would immediately set about repairing. > The effort to fix this seems to be big enough to disturb the inertia of > status quo. If this is fixed, we can definitely adopt a replicate-on-top > mode in glusterd. I'm not sure why there needs to be a "fix" .. wasn't the previous behaviour sensible? Or, if there is something to "change", then bolstering the distribute module might be enough - a combination of 1 and 2 above. Try this out: what if the Distribute layer maintained a full name space on each child, and didn't allow "recreation"? Say 3 children, one is broken/offline, so that /path/to/child/3/file is missing but is known to be missing (internally to Distribute). Then the Distribute brick can both not show the name space to the parent layers, but can also actively prevent manipulation of those files (the parent can neither stat /path/to/child/3/file nor unlink, nor create/write to it). If this change is meant to be permanent, then the administrative act of removing the child from distribute will then truncate the locked name space, allowing parents (be they users or other bricks, like Replicate) to act as they please (such as recreating the missing files). If you adhere to the principles that I thought I understood from 2009 or so then you should be able to let the users create unforeseen Gluster architectures without fear or impact. I.e. i) each brick is fully self contained * ii) physical bricks are the bread of a brick stack sandwich ** iii) any logical brick can appear above/below any other logical brick in a brick stack * Not mandating a 1:1 file mapping from layer to layer ** Eg: the Posix (bottom), Client (bottom), Server (top) and NFS (top) are all regarded as physical bricks. Thus it was my expectation that a dedupe brick (being logical) could either go above or below a distribute brick (also logical), for example. Or that an encryption brick could go on top of replicate which was on top of encryption which was on top of distribute which was on top of encryption on top of posix, for example. Or .. am I over simplifying the problem space? -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 23:52:43 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 09:52:43 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205102352.q4ANqhc6018790@singularity.tronunltd.com> Actually, I want to clarify this point; > But the problem today is that replicate (and > self-heal) does not understand "partial failure" > of its subvolumes. If one of the subvolume of > replicate is a distribute, then today's replicate > only understands complete failure of the > distribute set or it assumes everything is > completely fine. I haven't seen this in practice .. I have seen replicate attempt to repair anything that was "missing" and that both the replicate and the underlying bricks were still viable storage layers in that process ... ----- Original Message ----- >From: "Ian Latter" >To: "Anand Avati" >Subject: Re: [Gluster-devel] ZkFarmer >Date: Fri, 11 May 2012 09:39:58 +1000 > > > > Sure, I have my own vol files that do (did) what I wanted > > > and I was supporting myself (and users); the question > > > (and the point) is what is the GlusterFS *intent*? > > > > > > The "intent" (more or less - I hate to use the word as it > can imply a > > commitment to what I am about to say, but there isn't one) > is to keep the > > bricks (server process) dumb and have the intelligence on > the client side. > > This is a "rough goal". There are cases where replication > on the server > > side is inevitable (in the case of NFS access) but we keep > the software > > architecture undisturbed by running a client process on > the server machine > > to achieve it. > > [There's a difference between intent and plan/roadmap] > > Okay. Unfortunately I am unable to leverage this - I tried > to serve a Fuse->GlusterFS client mount point (of a > Distribute volume) as a GlusterFS posix brick (for a > Replicate volume) and it wouldn't play ball .. > > > We do plan to support "replication on the server" in the > future while still > > retaining the existing software architecture as much as > possible. This is > > particularly useful in Hadoop environment where the jobs > expect write > > performance of a single copy and expect copy to happen in > the background. > > We have the proactive self-heal daemon running on the > server machines now > > (which again is a client process which happens to be > physically placed on > > the server) which gives us many interesting possibilities > - i.e, with > > simple changes where we fool the client side replicate > translator at the > > time of transaction initiation that only the closest > server is up at that > > point of time and write to it alone, and have the > proactive self-heal > > daemon perform the extra copies in the background. This > would be consistent > > with other readers as they get directed to the "right" > version of the file > > by inspecting the changelogs while the background > replication is in > > progress. > > > > The intention of the above example is to give a general > sense of how we > > want to evolve the architecture (i.e, the "intention" you > were referring > > to) - keep the clients intelligent and servers dumb. If > some intelligence > > needs to be built on the physical server, tackle it by > loading a client > > process there (there are also "pathinfo xattr" kind of > internal techniques > > to figure out locality of the clients in a generic way > without bringing > > "server sidedness" into them in a harsh way) > > Okay .. But what happened to the "brick" architecture > of stacking anything on anything? I think you point > that out here ... > > > > I'll > > > write an rsyncd wrapper myself, to run on top of Gluster, > > > if the intent is not allow the configuration I'm after > > > (arbitrary number of disks in one multi-host environment > > > replicated to an arbitrary number of disks in another > > > multi-host environment, where ideally each environment > > > need not sum to the same data capacity, presented in a > > > single contiguous consumable storage layer to an > > > arbitrary number of unintelligent clients, that is as fault > > > tolerant as I choose it to be including the ability to add > > > and offline/online and remove storage as I so choose) .. > > > or switch out the whole solution if Gluster is heading > > > away from my needs. I just need to know what the > > > direction is .. I may even be able to help get you there if > > > you tell me :) > > > > > > > > There are good and bad in both styles (distribute on top > v/s replicate on > > top). Replicate on top gives you much better flexibility > of configuration. > > Distribute on top is easier for us developers. As a user I > would like > > replicate on top as well. But the problem today is that > replicate (and > > self-heal) does not understand "partial failure" of its > subvolumes. If one > > of the subvolume of replicate is a distribute, then > today's replicate only > > understands complete failure of the distribute set or it > assumes everything > > is completely fine. An example is self-healing of > directory entries. If a > > file is "missing" in one subvolume because a distribute > node is temporarily > > down, replicate has no clue why it is missing (or that it > should keep away > > from attempting to self-heal). Along the same lines, it > does not know that > > once a server is taken off from its distribute subvolume > for good that it > > needs to start recreating missing files. > > Hmm. I loved the brick idea. I don't like perverting it by > trying to "see through" layers. In that context I can see > two or three expected outcomes from someone building > this type of stack (heh: a quick trick brick stack) - when > a distribute child disappears; > > At the Distribute layer; > 1) The distribute name space / stat space > remains in tact, though the content is > obviously not avail. > 2) The distribute presentation is pure and true > of its constituents, showing only the names > / stats that are online/avail. > > In its standalone case, 2 is probably > preferable as it allows clean add/start/stop/ > remove capacity. > > At the Replicate layer; > 3) replication occurs only where the name / > stat space shows a gap > 4) the replication occurs at any delta > > I don't think there's a real choice here, even > if 3 were sensible, what would replicate do if > there was a local name and even just a remote > file size change, when there's no local content > to update; it must be 4. > > In which case, I would expect that a replicate > on top of a distribute with a missing child would > suddenly see a delta that it would immediately > set about repairing. > > > > The effort to fix this seems to be big enough to disturb > the inertia of > > status quo. If this is fixed, we can definitely adopt a > replicate-on-top > > mode in glusterd. > > I'm not sure why there needs to be a "fix" .. wasn't > the previous behaviour sensible? > > Or, if there is something to "change", then > bolstering the distribute module might be enough - > a combination of 1 and 2 above. > > Try this out: what if the Distribute layer maintained > a full name space on each child, and didn't allow > "recreation"? Say 3 children, one is broken/offline, > so that /path/to/child/3/file is missing but is known > to be missing (internally to Distribute). Then the > Distribute brick can both not show the name > space to the parent layers, but can also actively > prevent manipulation of those files (the parent > can neither stat /path/to/child/3/file nor unlink, nor > create/write to it). If this change is meant to be > permanent, then the administrative act of > removing the child from distribute will then > truncate the locked name space, allowing parents > (be they users or other bricks, like Replicate) to > act as they please (such as recreating the > missing files). > > If you adhere to the principles that I thought I > understood from 2009 or so then you should be > able to let the users create unforeseen Gluster > architectures without fear or impact. I.e. > > i) each brick is fully self contained * > ii) physical bricks are the bread of a brick > stack sandwich ** > iii) any logical brick can appear above/below > any other logical brick in a brick stack > > * Not mandating a 1:1 file mapping from layer > to layer > > ** Eg: the Posix (bottom), Client (bottom), > Server (top) and NFS (top) are all > regarded as physical bricks. > > Thus it was my expectation that a dedupe brick > (being logical) could either go above or below > a distribute brick (also logical), for example. > > Or that an encryption brick could go on top > of replicate which was on top of encryption > which was on top of distribute which was on > top of encryption on top of posix, for example. > > > Or .. am I over simplifying the problem space? > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From vbellur at redhat.com Fri May 11 07:06:38 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Fri, 11 May 2012 12:36:38 +0530 Subject: [Gluster-devel] release-3.3 branched out Message-ID: <4FACBA7E.6090801@redhat.com> A new branch release-3.3 has been created. You can checkout the branch via: $git checkout -b release-3.3 origin/release-3.3 rfc.sh has been updated to send patches to the appropriate branch. The plan is to have all 3.3.x releases happen off this branch. If you need any fix to be part of a 3.3.x release, please send out a backport of the same from master to release-3.3 after it has been accepted in master. Thanks, Vijay From manu at netbsd.org Fri May 11 07:29:20 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 11 May 2012 07:29:20 +0000 Subject: [Gluster-devel] release-3.3 branched out In-Reply-To: <4FACBA7E.6090801@redhat.com> References: <4FACBA7E.6090801@redhat.com> Message-ID: <20120511072920.GG18684@homeworld.netbsd.org> On Fri, May 11, 2012 at 12:36:38PM +0530, Vijay Bellur wrote: > A new branch release-3.3 has been created. You can checkout the branch via: Any chance someone merge my build fixes so that I can pullup to the new branch? http://review.gluster.com/3238 -- Emmanuel Dreyfus manu at netbsd.org From vbellur at redhat.com Fri May 11 07:43:13 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Fri, 11 May 2012 13:13:13 +0530 Subject: [Gluster-devel] release-3.3 branched out In-Reply-To: <20120511072920.GG18684@homeworld.netbsd.org> References: <4FACBA7E.6090801@redhat.com> <20120511072920.GG18684@homeworld.netbsd.org> Message-ID: <4FACC311.5020708@redhat.com> On 05/11/2012 12:59 PM, Emmanuel Dreyfus wrote: > On Fri, May 11, 2012 at 12:36:38PM +0530, Vijay Bellur wrote: >> A new branch release-3.3 has been created. You can checkout the branch via: > Any chance someone merge my build fixes so that I can pullup to the > new branch? > http://review.gluster.com/3238 Merged to master. Vijay From vijay at build.gluster.com Fri May 11 10:35:24 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Fri, 11 May 2012 03:35:24 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa41 released Message-ID: <20120511103527.5809B18009D@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa41/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa41.tar.gz This release is made off v3.3.0qa41 From 7220022 at gmail.com Sat May 12 15:22:57 2012 From: 7220022 at gmail.com (7220022) Date: Sat, 12 May 2012 19:22:57 +0400 Subject: [Gluster-devel] Gluster VSA for VMware ESX Message-ID: <012701cd3053$1d2e6110$578b2330$@gmail.com> Would love to test performance of Gluster Virtual Storage Appliance for VMware, but cannot get the demo. Emails and calls to Red Hat went unanswered. We've built a nice test system for the cluster at our lab, 8 modern servers running ESX4.1 and connected via 40gb InfiniBand fabric. Each server has 24 2.5" drives, SLC SSD and 10K SAS HDD-s connected to 6 LSI controllers with CacheCade (Pro 2.0 with write cache enabled,) 4 drives per controller. The plan is to test performance using bricks made of HDD-s cached with SSD-s, as well as HDD-s and SSD-s separately. Can anyone help getting the demo version of VSA? It's fine if it's a beta version, we just wanted to check the performance and scalability. -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Sun May 13 08:27:20 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 13 May 2012 10:27:20 +0200 Subject: [Gluster-devel] buffer corruption in io-stats Message-ID: <1kk12tm.1awqq7kf1joseM%manu@netbsd.org> I get a reproductible SIGSEGV with sources from latest git. iosfd is overwritten by the file path, it seems there is a confusion somewhere between iosfd->filename pointer value and pointed buffer (gdb) bt #1 0xbb92891b in free () from /lib/libc.so.12 #2 0xbbbb37a7 in __gf_free (free_ptr=0x74656e2f) at mem-pool.c:258 #3 0xb9a85378 in io_stats_release (this=0xba3e3000, fd=0xb8f9d098) at io-stats.c:2420 #4 0xbbbafcc0 in fd_destroy (fd=0xb8f9d098) at fd.c:507 #5 0xbbbafdf8 in fd_unref (fd=0xb8f9d098) at fd.c:543 #6 0xbbbaf7cf in gf_fdptr_put (fdtable=0xbb77d070, fd=0xb8f9d098) at fd.c:393 #7 0xbb821147 in fuse_release () from /usr/local/lib/glusterfs/3git/xlator/mount/fuse.so #8 0xbb82a2e1 in fuse_thread_proc () from /usr/local/lib/glusterfs/3git/xlator/mount/fuse.so (gdb) frame 3 #3 0xb9a85378 in io_stats_release (this=0xba3e3000, fd=0xb8f9d098) at io-stats.c:2420 2420 GF_FREE (iosfd->filename); (gdb) print *iosfd $2 = {filename = 0x74656e2f
, data_written = 3418922014271107938, data_read = 7813586423313035891, block_count_write = {4788563690262784356, 3330756270057407571, 7074933154630937908, 28265, 0 }, block_count_read = { 0 }, opened_at = {tv_sec = 1336897011, tv_usec = 145734}} (gdb) x/10s iosfd 0xbb70f800: "/netbsd/usr/src/tooldir.NetBSD-6.99.4-i386/bin" -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 13 14:42:45 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 13 May 2012 16:42:45 +0200 Subject: [Gluster-devel] python version Message-ID: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> Hi There is a problem with python version detection in the configure script. The machine on which autotools is ran prior releasing glusterfs expands AM_PATH_PYTHON into a script that fails to accept python > 2.4. As I understand, a solution is to concatenate latest automake-1.12/m4/python.m4 into glusterfs' aclocal.m4. That way python up to 3.1 shoul be accepted. Opinions? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From renqiang at 360buy.com Mon May 14 01:20:32 2012 From: renqiang at 360buy.com (=?gb2312?B?yM7Hvw==?=) Date: Mon, 14 May 2012 09:20:32 +0800 Subject: [Gluster-devel] balance stoped Message-ID: <018001cd316f$c25a6f90$470f4eb0$@com> Hi,All! May I ask you a question? When we do balance on a volume, it stopped when moving the 505th?s file 0f 1006 files. Now we cannot restart it and also cannot cancel it. How can I do, please? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Mon May 14 01:22:43 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Mon, 14 May 2012 11:22:43 +1000 Subject: [Gluster-devel] lseek Message-ID: <201205140122.q4E1MhU8000317@singularity.tronunltd.com> Hello, I'm looking for a seek (lseek) implementation in one of the modules and I can't see one. Do I need to care about seeking if my module changes the file size (i.e. compresses) in Gluster? I would have thought that I did except that I believe that what I'm reading is that Gluster returns a NONSEEKABLE flag on file open (fuse_kernel.h at line 149). Does this mitigate the need to correct the user seeks? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Mon May 14 07:48:17 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 14 May 2012 09:48:17 +0200 Subject: [Gluster-devel] lseek In-Reply-To: <201205140122.q4E1MhU8000317@singularity.tronunltd.com> References: <201205140122.q4E1MhU8000317@singularity.tronunltd.com> Message-ID: <4FB0B8C1.4020908@datalab.es> Hello Ian, there is no such thing as an explicit seek in glusterfs. Each readv, writev, (f)truncate and rchecksum have an offset parameter that tells you the position where the operation must be performed. If you make something that changes the size of the file you must make it in a way that it is transparent to upper translators. This means that all offsets you will receive are "real" (in your case, offsets in the uncompressed version of the file). You should calculate in some way the equivalent offset in the compressed version of the file and send it to the correspoding fop of the lower translators. In the same way, you must return in all iatt structures the real size of the file (not the compressed size). I'm not sure what is the intended use of NONSEEKABLE, but I think it is for special file types, like devices or similar that are sequential in nature. Anyway, this is a fuse flag that you can't return from a regular translator open fop. Xavi On 05/14/2012 03:22 AM, Ian Latter wrote: > Hello, > > > I'm looking for a seek (lseek) implementation in > one of the modules and I can't see one. > > Do I need to care about seeking if my module > changes the file size (i.e. compresses) in Gluster? > I would have thought that I did except that I believe > that what I'm reading is that Gluster returns a > NONSEEKABLE flag on file open (fuse_kernel.h at > line 149). Does this mitigate the need to correct > the user seeks? > > > Cheers, > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Mon May 14 09:51:59 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Mon, 14 May 2012 19:51:59 +1000 Subject: [Gluster-devel] lseek Message-ID: <201205140951.q4E9px5H001754@singularity.tronunltd.com> Hello Xavi, Ok - thanks. I was hoping that this was how read and write were working (i.e. with absolute offsets and not just getting relative offsets from the current seek point), however what of the raw seek command? len = lseek(fd, 0, SEEK_END); Upon successful completion, lseek() returns the resulting offset location as measured in bytes from the beginning of the file. Any idea on where the return value comes from? I will need to fake up a file size for this command .. ----- Original Message ----- >From: "Xavier Hernandez" >To: >Subject: Re: [Gluster-devel] lseek >Date: Mon, 14 May 2012 09:48:17 +0200 > > Hello Ian, > > there is no such thing as an explicit seek in glusterfs. Each readv, > writev, (f)truncate and rchecksum have an offset parameter that tells > you the position where the operation must be performed. > > If you make something that changes the size of the file you must make it > in a way that it is transparent to upper translators. This means that > all offsets you will receive are "real" (in your case, offsets in the > uncompressed version of the file). You should calculate in some way the > equivalent offset in the compressed version of the file and send it to > the correspoding fop of the lower translators. > > In the same way, you must return in all iatt structures the real size of > the file (not the compressed size). > > I'm not sure what is the intended use of NONSEEKABLE, but I think it is > for special file types, like devices or similar that are sequential in > nature. Anyway, this is a fuse flag that you can't return from a regular > translator open fop. > > Xavi > > On 05/14/2012 03:22 AM, Ian Latter wrote: > > Hello, > > > > > > I'm looking for a seek (lseek) implementation in > > one of the modules and I can't see one. > > > > Do I need to care about seeking if my module > > changes the file size (i.e. compresses) in Gluster? > > I would have thought that I did except that I believe > > that what I'm reading is that Gluster returns a > > NONSEEKABLE flag on file open (fuse_kernel.h at > > line 149). Does this mitigate the need to correct > > the user seeks? > > > > > > Cheers, > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Mon May 14 10:29:54 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 14 May 2012 12:29:54 +0200 Subject: [Gluster-devel] lseek In-Reply-To: <201205140951.q4E9px5H001754@singularity.tronunltd.com> References: <201205140951.q4E9px5H001754@singularity.tronunltd.com> Message-ID: <4FB0DEA2.3030805@datalab.es> Hello Ian, lseek calls are handled internally by the kernel and they never reach the user land for fuse calls. lseek only updates the current file offset that is stored inside the kernel file's structure. This value is what is passed to read/write fuse calls as an absolute offset. There isn't any problem in this behavior as long as you hide all size manipulations from fuse. If you write a translator that compresses a file, you should do so in a transparent manner. This means, basically, that: 1. Whenever you are asked to return the file size, you must return the size of the uncompressed file 2. Whenever you receive an offset, you must translate that offset to the corresponding offset in the compressed file and work with that 3. Whenever you are asked to read or write data, you must return the number of uncompressed bytes read or written (even if you have compressed the chunk of data to a smaller size and you have physically written less bytes). 4. All read requests must return uncompressed data (this seems obvious though) This guarantees that your manipulations are not seen in any way by any upper translator or even fuse, thus everything should work smoothly. If you respect these rules, lseek (and your translator) will work as expected. In particular, when a user calls lseek with SEEK_END, the kernel takes the size of the file from the internal kernel inode's structure. This size is obtained through a previous call to lookup or updated using the result of write operations. If you respect points 1 and 3, this value will be correct. In gluster there are a lot of fops that return a iatt structure. You must guarantee that all these functions return the correct size of the file in the field ia_size to be sure that everything works as expected. Xavi On 05/14/2012 11:51 AM, Ian Latter wrote: > Hello Xavi, > > > Ok - thanks. I was hoping that this was how read > and write were working (i.e. with absolute offsets > and not just getting relative offsets from the current > seek point), however what of the raw seek > command? > > len = lseek(fd, 0, SEEK_END); > > Upon successful completion, lseek() returns > the resulting offset location as measured in > bytes from the beginning of the file. > > Any idea on where the return value comes from? > I will need to fake up a file size for this command .. > > > > ----- Original Message ----- >> From: "Xavier Hernandez" >> To: >> Subject: Re: [Gluster-devel] lseek >> Date: Mon, 14 May 2012 09:48:17 +0200 >> >> Hello Ian, >> >> there is no such thing as an explicit seek in glusterfs. > Each readv, >> writev, (f)truncate and rchecksum have an offset parameter > that tells >> you the position where the operation must be performed. >> >> If you make something that changes the size of the file > you must make it >> in a way that it is transparent to upper translators. This > means that >> all offsets you will receive are "real" (in your case, > offsets in the >> uncompressed version of the file). You should calculate in > some way the >> equivalent offset in the compressed version of the file > and send it to >> the correspoding fop of the lower translators. >> >> In the same way, you must return in all iatt structures > the real size of >> the file (not the compressed size). >> >> I'm not sure what is the intended use of NONSEEKABLE, but > I think it is >> for special file types, like devices or similar that are > sequential in >> nature. Anyway, this is a fuse flag that you can't return > from a regular >> translator open fop. >> >> Xavi >> >> On 05/14/2012 03:22 AM, Ian Latter wrote: >>> Hello, >>> >>> >>> I'm looking for a seek (lseek) implementation in >>> one of the modules and I can't see one. >>> >>> Do I need to care about seeking if my module >>> changes the file size (i.e. compresses) in Gluster? >>> I would have thought that I did except that I believe >>> that what I'm reading is that Gluster returns a >>> NONSEEKABLE flag on file open (fuse_kernel.h at >>> line 149). Does this mitigate the need to correct >>> the user seeks? >>> >>> >>> Cheers, >>> >>> >>> >>> -- >>> Ian Latter >>> Late night coder .. >>> http://midnightcode.org/ >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at nongnu.org >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at nongnu.org >> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Mon May 14 11:18:22 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Mon, 14 May 2012 21:18:22 +1000 Subject: [Gluster-devel] lseek Message-ID: <201205141118.q4EBIMku002113@singularity.tronunltd.com> Hello Xavier, I don't have a problem with the principles, these were effectively how I was traveling (the notable difference is statfs which I want to pass-through unaffected, reporting the true file system capacity such that a du [stat] may sum to a greater value than a df [statfs]). In 2009 I had a mostly- functional hashing write function and a dubious read function (I stumbled when I had to open a file from within a fop). But I think what you're telling/showing me is that I have no deep understanding of the mapping of the system calls to their Fuse->Gluster fops - which is expected :) And, this is a better outcome than learning that Gluster has gaps in its framework with regard to my objective. I.e. I didn't know that lseek mapped to lookup. And the examples aren't comprehensive enough (rot-13 is the only one that really manipulates content, and it only plays with read and write, obviously because it has a 1:1 relationship with the data). This is the key, and not something that I was expecting; > In gluster there are a lot of fops that return a iatt > structure. You must guarantee that all these > functions return the correct size of the file in > the field ia_size to be sure that everything works > as expected. I'll do my best to build a comprehensive list of iatt returning fops from the examples ... but I'd say it'll take a solid peer review to get this hammered out properly. Thanks for steering me straight Xavi, appreciate it. ----- Original Message ----- >From: "Xavier Hernandez" >To: "Ian Latter" >Subject: Re: [Gluster-devel] lseek >Date: Mon, 14 May 2012 12:29:54 +0200 > > Hello Ian, > > lseek calls are handled internally by the kernel and they never reach > the user land for fuse calls. lseek only updates the current file offset > that is stored inside the kernel file's structure. This value is what is > passed to read/write fuse calls as an absolute offset. > > There isn't any problem in this behavior as long as you hide all size > manipulations from fuse. If you write a translator that compresses a > file, you should do so in a transparent manner. This means, basically, that: > > 1. Whenever you are asked to return the file size, you must return the > size of the uncompressed file > 2. Whenever you receive an offset, you must translate that offset to the > corresponding offset in the compressed file and work with that > 3. Whenever you are asked to read or write data, you must return the > number of uncompressed bytes read or written (even if you have > compressed the chunk of data to a smaller size and you have physically > written less bytes). > 4. All read requests must return uncompressed data (this seems obvious > though) > > This guarantees that your manipulations are not seen in any way by any > upper translator or even fuse, thus everything should work smoothly. > > If you respect these rules, lseek (and your translator) will work as > expected. > > In particular, when a user calls lseek with SEEK_END, the kernel takes > the size of the file from the internal kernel inode's structure. This > size is obtained through a previous call to lookup or updated using the > result of write operations. If you respect points 1 and 3, this value > will be correct. > > In gluster there are a lot of fops that return a iatt structure. You > must guarantee that all these functions return the correct size of the > file in the field ia_size to be sure that everything works as expected. > > Xavi > > On 05/14/2012 11:51 AM, Ian Latter wrote: > > Hello Xavi, > > > > > > Ok - thanks. I was hoping that this was how read > > and write were working (i.e. with absolute offsets > > and not just getting relative offsets from the current > > seek point), however what of the raw seek > > command? > > > > len = lseek(fd, 0, SEEK_END); > > > > Upon successful completion, lseek() returns > > the resulting offset location as measured in > > bytes from the beginning of the file. > > > > Any idea on where the return value comes from? > > I will need to fake up a file size for this command .. > > > > > > > > ----- Original Message ----- > >> From: "Xavier Hernandez" > >> To: > >> Subject: Re: [Gluster-devel] lseek > >> Date: Mon, 14 May 2012 09:48:17 +0200 > >> > >> Hello Ian, > >> > >> there is no such thing as an explicit seek in glusterfs. > > Each readv, > >> writev, (f)truncate and rchecksum have an offset parameter > > that tells > >> you the position where the operation must be performed. > >> > >> If you make something that changes the size of the file > > you must make it > >> in a way that it is transparent to upper translators. This > > means that > >> all offsets you will receive are "real" (in your case, > > offsets in the > >> uncompressed version of the file). You should calculate in > > some way the > >> equivalent offset in the compressed version of the file > > and send it to > >> the correspoding fop of the lower translators. > >> > >> In the same way, you must return in all iatt structures > > the real size of > >> the file (not the compressed size). > >> > >> I'm not sure what is the intended use of NONSEEKABLE, but > > I think it is > >> for special file types, like devices or similar that are > > sequential in > >> nature. Anyway, this is a fuse flag that you can't return > > from a regular > >> translator open fop. > >> > >> Xavi > >> > >> On 05/14/2012 03:22 AM, Ian Latter wrote: > >>> Hello, > >>> > >>> > >>> I'm looking for a seek (lseek) implementation in > >>> one of the modules and I can't see one. > >>> > >>> Do I need to care about seeking if my module > >>> changes the file size (i.e. compresses) in Gluster? > >>> I would have thought that I did except that I believe > >>> that what I'm reading is that Gluster returns a > >>> NONSEEKABLE flag on file open (fuse_kernel.h at > >>> line 149). Does this mitigate the need to correct > >>> the user seeks? > >>> > >>> > >>> Cheers, > >>> > >>> > >>> > >>> -- > >>> Ian Latter > >>> Late night coder .. > >>> http://midnightcode.org/ > >>> > >>> _______________________________________________ > >>> Gluster-devel mailing list > >>> Gluster-devel at nongnu.org > >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel > >> _______________________________________________ > >> Gluster-devel mailing list > >> Gluster-devel at nongnu.org > >> https://lists.nongnu.org/mailman/listinfo/gluster-devel > >> > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Mon May 14 11:47:10 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 14 May 2012 13:47:10 +0200 Subject: [Gluster-devel] lseek In-Reply-To: <201205141118.q4EBIMku002113@singularity.tronunltd.com> References: <201205141118.q4EBIMku002113@singularity.tronunltd.com> Message-ID: <4FB0F0BE.9030009@datalab.es> Hello Ian, I didn't thought in statfs. In this special case things are a bit harder for a compression translator. I think it's impossible to return accurate data without a considerable amount of work. Maybe some estimation of the available space based on the current achieved mean compression ratio would be sufficient, but never accurate. With more work you could even be able to say exactly how much space have been used, but the best you can do with the remaining space is an estimation. Regarding lseek, there isn't a map with lookup. Probably I haven't explained it as well as I wanted. There are basically two kinds of user mode calls. Those that use a string containing a filename to operate with (stat, unlink, open, creat, ...), and those that use a file descriptor (fstat, read, write, ...). The kernel does not work with names to handle files, so it has to translate the names to inodes to work with them. This means that any call that uses a string will need to make a "lookup" to get the associated inode (the only exception is creat, that creates a new inode without using lookup). This means that every filename based operation can generate a lookup request (although some caching mechanism may reduce the number of calls). All operations that work with a file descriptor do not generate a lookup request, because the file descriptor is already bound to an inode. In your particular case, to do an lseek you must have made a previous call to open (that would have generated a lookup request) or creat. Hope this better explains how kernel and gluster are bound... Xavi On 05/14/2012 01:18 PM, Ian Latter wrote: > Hello Xavier, > > > I don't have a problem with the principles, these > were effectively how I was traveling (the notable > difference is statfs which I want to pass-through > unaffected, reporting the true file system capacity > such that a du [stat] may sum to a greater value > than a df [statfs]). In 2009 I had a mostly- > functional hashing write function and a dubious > read function (I stumbled when I had to open a > file from within a fop). > > But I think what you're telling/showing me is that > I have no deep understanding of the mapping of > the system calls to their Fuse->Gluster fops - > which is expected :) And, this is a better outcome > than learning that Gluster has gaps in its > framework with regard to my objective. I.e. I > didn't know that lseek mapped to lookup. And > the examples aren't comprehensive enough > (rot-13 is the only one that really manipulates > content, and it only plays with read and write, > obviously because it has a 1:1 relationship with > the data). > > This is the key, and not something that I was > expecting; > >> In gluster there are a lot of fops that return a iatt >> structure. You must guarantee that all these >> functions return the correct size of the file in >> the field ia_size to be sure that everything works >> as expected. > I'll do my best to build a comprehensive list of iatt > returning fops from the examples ... but I'd say it'll > take a solid peer review to get this hammered out > properly. > > Thanks for steering me straight Xavi, appreciate > it. > > > > ----- Original Message ----- >> From: "Xavier Hernandez" >> To: "Ian Latter" >> Subject: Re: [Gluster-devel] lseek >> Date: Mon, 14 May 2012 12:29:54 +0200 >> >> Hello Ian, >> >> lseek calls are handled internally by the kernel and they > never reach >> the user land for fuse calls. lseek only updates the > current file offset >> that is stored inside the kernel file's structure. This > value is what is >> passed to read/write fuse calls as an absolute offset. >> >> There isn't any problem in this behavior as long as you > hide all size >> manipulations from fuse. If you write a translator that > compresses a >> file, you should do so in a transparent manner. This > means, basically, that: >> 1. Whenever you are asked to return the file size, you > must return the >> size of the uncompressed file >> 2. Whenever you receive an offset, you must translate that > offset to the >> corresponding offset in the compressed file and work with that >> 3. Whenever you are asked to read or write data, you must > return the >> number of uncompressed bytes read or written (even if you > have >> compressed the chunk of data to a smaller size and you > have physically >> written less bytes). >> 4. All read requests must return uncompressed data (this > seems obvious >> though) >> >> This guarantees that your manipulations are not seen in > any way by any >> upper translator or even fuse, thus everything should work > smoothly. >> If you respect these rules, lseek (and your translator) > will work as >> expected. >> >> In particular, when a user calls lseek with SEEK_END, the > kernel takes >> the size of the file from the internal kernel inode's > structure. This >> size is obtained through a previous call to lookup or > updated using the >> result of write operations. If you respect points 1 and 3, > this value >> will be correct. >> >> In gluster there are a lot of fops that return a iatt > structure. You >> must guarantee that all these functions return the correct > size of the >> file in the field ia_size to be sure that everything works > as expected. >> Xavi >> >> On 05/14/2012 11:51 AM, Ian Latter wrote: >>> Hello Xavi, >>> >>> >>> Ok - thanks. I was hoping that this was how read >>> and write were working (i.e. with absolute offsets >>> and not just getting relative offsets from the current >>> seek point), however what of the raw seek >>> command? >>> >>> len = lseek(fd, 0, SEEK_END); >>> >>> Upon successful completion, lseek() returns >>> the resulting offset location as measured in >>> bytes from the beginning of the file. >>> >>> Any idea on where the return value comes from? >>> I will need to fake up a file size for this command .. >>> >>> >>> >>> ----- Original Message ----- >>>> From: "Xavier Hernandez" >>>> To: >>>> Subject: Re: [Gluster-devel] lseek >>>> Date: Mon, 14 May 2012 09:48:17 +0200 >>>> >>>> Hello Ian, >>>> >>>> there is no such thing as an explicit seek in glusterfs. >>> Each readv, >>>> writev, (f)truncate and rchecksum have an offset parameter >>> that tells >>>> you the position where the operation must be performed. >>>> >>>> If you make something that changes the size of the file >>> you must make it >>>> in a way that it is transparent to upper translators. This >>> means that >>>> all offsets you will receive are "real" (in your case, >>> offsets in the >>>> uncompressed version of the file). You should calculate in >>> some way the >>>> equivalent offset in the compressed version of the file >>> and send it to >>>> the correspoding fop of the lower translators. >>>> >>>> In the same way, you must return in all iatt structures >>> the real size of >>>> the file (not the compressed size). >>>> >>>> I'm not sure what is the intended use of NONSEEKABLE, but >>> I think it is >>>> for special file types, like devices or similar that are >>> sequential in >>>> nature. Anyway, this is a fuse flag that you can't return >>> from a regular >>>> translator open fop. >>>> >>>> Xavi >>>> >>>> On 05/14/2012 03:22 AM, Ian Latter wrote: >>>>> Hello, >>>>> >>>>> >>>>> I'm looking for a seek (lseek) implementation in >>>>> one of the modules and I can't see one. >>>>> >>>>> Do I need to care about seeking if my module >>>>> changes the file size (i.e. compresses) in Gluster? >>>>> I would have thought that I did except that I believe >>>>> that what I'm reading is that Gluster returns a >>>>> NONSEEKABLE flag on file open (fuse_kernel.h at >>>>> line 149). Does this mitigate the need to correct >>>>> the user seeks? >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> >>>>> -- >>>>> Ian Latter >>>>> Late night coder .. >>>>> http://midnightcode.org/ >>>>> >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at nongnu.org >>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at nongnu.org >>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >>>> >>> -- >>> Ian Latter >>> Late night coder .. >>> http://midnightcode.org/ >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at nongnu.org >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From kkeithle at redhat.com Mon May 14 14:17:12 2012 From: kkeithle at redhat.com (Kaleb S. KEITHLEY) Date: Mon, 14 May 2012 10:17:12 -0400 Subject: [Gluster-devel] python version In-Reply-To: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> Message-ID: <4FB113E8.0@redhat.com> On 05/13/2012 10:42 AM, Emmanuel Dreyfus wrote: > Hi > > There is a problem with python version detection in the configure > script. The machine on which autotools is ran prior releasing glusterfs > expands AM_PATH_PYTHON into a script that fails to accept python> 2.4. > > As I understand, a solution is to concatenate latest > automake-1.12/m4/python.m4 into glusterfs' aclocal.m4. That way python > up to 3.1 should be accepted. Opinions? The aclocal.m4 file is produced when (/usr/bin/)aclocal is invoked by ./autogen.sh file in preparation for building gluster. (You have to run autogen.sh to produce the ./configure file.) aclocal uses whatever python.m4 file you have on your system, e.g. /usr/share/aclocal-1.11/python.m4, which is also from the automake package. I presume whoever packages automake for a particular system is taking into consideration what other packages and versions are standard for the system and picks right version of automake. IOW picks the version of automake that has all the (hard-coded) versions of python to match the python they have on their system. If someone has installed a later version of python and not also updated to a compatible version of automake, that's not a problem that gluster should have to solve, or even try to solve. I don't believe we want to require our build process to download the latest-and-greatest version of automake. As a side note, I sampled a few currently shipping systems and see that the automake shipped with/for Fedora 16 and 17, FreeBSD 8.2 and 8.3, and NetBSD 5.1.2, is automake-1.11, which has all the appearances of supporting python 2.5 (and 3.0). Finally, after all that, note that the configure.ac file appears to be hard-coded to require python 2.x, so if anyone is trying to use python 3.x, that's doomed to fail until configure.ac is "fixed." Do we even know why python 2.x is required and why python 3.x can't be used? -- Kaleb From manu at netbsd.org Mon May 14 14:23:47 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 14 May 2012 14:23:47 +0000 Subject: [Gluster-devel] python version In-Reply-To: <4FB113E8.0@redhat.com> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> Message-ID: <20120514142347.GA3985@homeworld.netbsd.org> On Mon, May 14, 2012 at 10:17:12AM -0400, Kaleb S. KEITHLEY wrote: > The aclocal.m4 file is produced when (/usr/bin/)aclocal is invoked > by ./autogen.sh file in preparation for building gluster. (You have > to run autogen.sh to produce the ./configure file.) Right, then my plan will not work, and the only way to fix the problem is to upgrade automake on the machine that produces the gluterfs releases. > As a side note, I sampled a few currently shipping systems and see > that the automake shipped with/for Fedora 16 and 17, FreeBSD 8.2 and > 8.3, and NetBSD 5.1.2, is automake-1.11, which has all the > appearances of supporting python 2.5 (and 3.0). You seem to take for granted that people building a glusterfs release will run autotools before running configure. This is not the way it should work: a released tarball should contain a configure script that works anywhere. The tarballs released up to at least 3.3.0qa40 have a configure script that cannot detect python > 2.4 -- Emmanuel Dreyfus manu at netbsd.org From kkeithle at redhat.com Mon May 14 14:31:32 2012 From: kkeithle at redhat.com (Kaleb S. KEITHLEY) Date: Mon, 14 May 2012 10:31:32 -0400 Subject: [Gluster-devel] python version In-Reply-To: <20120514142347.GA3985@homeworld.netbsd.org> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> <20120514142347.GA3985@homeworld.netbsd.org> Message-ID: <4FB11744.1040907@redhat.com> On 05/14/2012 10:23 AM, Emmanuel Dreyfus wrote: > On Mon, May 14, 2012 at 10:17:12AM -0400, Kaleb S. KEITHLEY wrote: >> The aclocal.m4 file is produced when (/usr/bin/)aclocal is invoked >> by ./autogen.sh file in preparation for building gluster. (You have >> to run autogen.sh to produce the ./configure file.) > > Right, then my plan will not work, and the only way to fix the problem > is to upgrade automake on the machine that produces the glusterfs > releases. > >> As a side note, I sampled a few currently shipping systems and see >> that the automake shipped with/for Fedora 16 and 17, FreeBSD 8.2 and >> 8.3, and NetBSD 5.1.2, is automake-1.11, which has all the >> appearances of supporting python 2.5 (and 3.0). > > You seem to take for granted that people building a glusterfs > release will run autotools before running configure. This is not > the way it should work: a released tarball should contain a > configure script that works anywhere. The tarballs released up to > at least 3.3.0qa40 have a configure script that cannot detect python> 2.4 > I looked at what I get when I checkout the source from the git repo and what I have to do to build from a freshly checked out source tree. And yes, we need to upgrade the build machines were we package the release tarballs. Right now is not a good time to do that. -- Kaleb From yknev.shankar at gmail.com Mon May 14 15:31:56 2012 From: yknev.shankar at gmail.com (Venky Shankar) Date: Mon, 14 May 2012 21:01:56 +0530 Subject: [Gluster-devel] python version In-Reply-To: <4FB113E8.0@redhat.com> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> Message-ID: [snip] > Finally, after all that, note that the configure.ac file appears to be > hard-coded to require python 2.x, so if anyone is trying to use python 3.x, > that's doomed to fail until configure.ac is "fixed." Do we even know why > python 2.x is required and why python 3.x can't be used? > python 2.x is required by geo-replication. Although geo-replication is code ready for python 3.x, it's not functionally tested with it. That's the reason configure.ac has 2.x hard-coded. > > -- > > Kaleb > > > ______________________________**_________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/**mailman/listinfo/gluster-devel > Thanks, -Venky -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Mon May 14 15:45:48 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 14 May 2012 15:45:48 +0000 Subject: [Gluster-devel] python version In-Reply-To: References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> Message-ID: <20120514154548.GB3985@homeworld.netbsd.org> On Mon, May 14, 2012 at 09:01:56PM +0530, Venky Shankar wrote: > python 2.x is required by geo-replication. Although geo-replication is code > ready for python 3.x, it's not functionally tested with it. That's the > reason configure.ac has 2.x hard-coded. Well, my problem is that python 2.5, python 2.6 and python 2.7 are not detected by configure. One need to patch configure in order to build with python 2.x (x > 4) installed. -- Emmanuel Dreyfus manu at netbsd.org From kkeithle at redhat.com Mon May 14 16:30:12 2012 From: kkeithle at redhat.com (Kaleb S. KEITHLEY) Date: Mon, 14 May 2012 12:30:12 -0400 Subject: [Gluster-devel] python version In-Reply-To: <20120514154548.GB3985@homeworld.netbsd.org> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> <20120514154548.GB3985@homeworld.netbsd.org> Message-ID: <4FB13314.3060708@redhat.com> On 05/14/2012 11:45 AM, Emmanuel Dreyfus wrote: > On Mon, May 14, 2012 at 09:01:56PM +0530, Venky Shankar wrote: >> python 2.x is required by geo-replication. Although geo-replication is code >> ready for python 3.x, it's not functionally tested with it. That's the >> reason configure.ac has 2.x hard-coded. > > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not > detected by configure. One need to patch configure in order to build > with python 2.x (x> 4) installed. > Seems like it would be easier to get autoconf and automake from the NetBSD packages and just run `./autogen.sh && ./configure` (Which, FWIW, is how glusterfs RPMs are built for the Fedora distributions. I'd wager for much the same reason.) -- Kaleb From manu at netbsd.org Mon May 14 18:46:07 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 14 May 2012 20:46:07 +0200 Subject: [Gluster-devel] python version In-Reply-To: <4FB13314.3060708@redhat.com> Message-ID: <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Kaleb S. KEITHLEY wrote: > > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not > > detected by configure. One need to patch configure in order to build > > with python 2.x (x> 4) installed. > > Seems like it would be easier to get autoconf and automake from the > NetBSD packages and just run `./autogen.sh && ./configure` I prefer patching the configure script. Running autogen introduce build dependencies on perl just to substitute a string on a single line: that's overkill. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From abperiasamy at gmail.com Mon May 14 19:25:20 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Mon, 14 May 2012 12:25:20 -0700 Subject: [Gluster-devel] python version In-Reply-To: <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> References: <4FB13314.3060708@redhat.com> <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Message-ID: On Mon, May 14, 2012 at 11:46 AM, Emmanuel Dreyfus wrote: > Kaleb S. KEITHLEY wrote: > >> > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not >> > detected by configure. One need to patch configure in order to build >> > with python 2.x (x> ?4) installed. >> >> Seems like it would be easier to get autoconf and automake from the >> NetBSD packages and just run `./autogen.sh && ./configure` > > I prefer patching the configure script. Running autogen introduce build > dependencies on perl just to substitute a string on a single line: > that's overkill. > Who ever builds from source is required to run autogen.sh to produce env specific configure and build files. "configure" script should not be checked into git repository. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From anand.avati at gmail.com Mon May 14 23:58:18 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 14 May 2012 16:58:18 -0700 Subject: [Gluster-devel] python version In-Reply-To: References: <4FB13314.3060708@redhat.com> <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Message-ID: On Mon, May 14, 2012 at 12:25 PM, Anand Babu Periasamy < abperiasamy at gmail.com> wrote: > On Mon, May 14, 2012 at 11:46 AM, Emmanuel Dreyfus > wrote: > > Kaleb S. KEITHLEY wrote: > > > >> > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not > >> > detected by configure. One need to patch configure in order to build > >> > with python 2.x (x> 4) installed. > >> > >> Seems like it would be easier to get autoconf and automake from the > >> NetBSD packages and just run `./autogen.sh && ./configure` > > > > I prefer patching the configure script. Running autogen introduce build > > dependencies on perl just to substitute a string on a single line: > > that's overkill. > > > > Who ever builds from source is required to run autogen.sh to produce > env specific configure and build files. Not quite. That's the whole point of having a configure script in the first place - to detect the environment at build time. One who builds from source should not require to run autogen.sh, just configure should be sufficient. Since configure itself is a generated script, and can possibly have mistakes and requirements change (like the one being discussed), that's when autogen.sh must be used to re-generate configure script. In this case however, the simplest approach would actually be to run autogen.sh till either: a) we upgrade the release build machine to use newer aclocal macros b) qualify geo-replication to work on python 3 and remove the check. Emmanuel, since the problem is not going to be a long lasting one (either of the two should fix your problem), I suggest you find a solution local to you in the interim. Even better, if someone can actually test and qualify geo-replication to work on python 3 it would ease solution "b" sooner. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 15 01:30:21 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 03:30:21 +0200 Subject: [Gluster-devel] python version In-Reply-To: Message-ID: <1kk4971.wh86xo1gypeoiM%manu@netbsd.org> Anand Avati wrote: > a) we upgrade the release build machine to use newer aclocal macros > > b) qualify geo-replication to work on python 3 and remove the check. Solution b is not enough: even if the configure script does not claim a specific version of python, it will still be unable to detect an installed python > 2.4 because it contains that: for am_cv_pathless_PYTHON in python python2 python2.4 python2.3 python2.2 python2.1 python2.0 none; do What about solution c? c) Tweak autogen.sh so that it patches generated configure and add the checks for python > 2.4 if they are missing: --- autogen.sh.orig 2012-05-15 03:22:48.000000000 +0200 +++ autogen.sh 2012-05-15 03:24:28.000000000 +0200 @@ -5,4 +5,6 @@ (libtoolize --automake --copy --force || glibtoolize --automake --copy --force) autoconf automake --add-missing --copy --foreign cd argp-standalone;./autogen.sh + +sed 's/for am_cv_pathless_PYTHON in python python2 python2.4/for am_cv_pathless_PYTHON in python python2 python3 python3.2 python3.1 python3.0 python2.7 2.6 python2.5 python2.4/' configure > configure.new && mv configure.new configure -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 15 04:20:29 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 06:20:29 +0200 Subject: [Gluster-devel] Fixing Address family mess In-Reply-To: Message-ID: <1kk4hl3.1qjswd01knbbvqM%manu@netbsd.org> Anand Babu Periasamy wrote: > AF_UNSPEC is should be be taken as IPv4/IPv6. It is named > appropriately. Default should be ipv4. > > I have not tested the patch. I did test it and it fixed the problem at mine. Here it is in gerrit: http://review.gluster.com/#change,3319 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 15 04:27:26 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 06:27:26 +0200 Subject: [Gluster-devel] NetBSD support in 3.3? Message-ID: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> Hi I still have a few pending submissions for NetBSD support in latest sources: http://review.gluster.com/3319 Use inet as default transport http://review.gluster.com/3320 Add missing (base|dir)name_r http://review.gluster.com/3321 NetBSD build fixes I would like to have 3.3 building without too many unintegrated patches on NetBSD. Is it worth working on pushing the changes above or is release-3.3 too close to release to expect such changes to get into it now? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From amarts at redhat.com Tue May 15 05:51:55 2012 From: amarts at redhat.com (Amar Tumballi) Date: Tue, 15 May 2012 11:21:55 +0530 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> References: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> Message-ID: <4FB1EEFB.2020509@redhat.com> On 05/15/2012 09:57 AM, Emmanuel Dreyfus wrote: > Hi > > I still have a few pending submissions for NetBSD support in latest > sources: > http://review.gluster.com/3319 Use inet as default transport > http://review.gluster.com/3320 Add missing (base|dir)name_r > http://review.gluster.com/3321 NetBSD build fixes > > I would like to have 3.3 building without too many unintegrated patches > on NetBSD. Is it worth working on pushing the changes above or is > release-3.3 too close to release to expect such changes to get into it > now? > Emmanuel, I understand your concerns, but I suspect we are very close to 3.3.0 release at this point of time, and hence it may be tight for taking these patches in. What we are planing is for a quicker 3.3.1 depending on the community feedback of 3.3.0 release, which should surely have your patches included. Hope that makes sense. Regards, Amar From manu at netbsd.org Tue May 15 10:13:07 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 10:13:07 +0000 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <4FB1EEFB.2020509@redhat.com> References: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> <4FB1EEFB.2020509@redhat.com> Message-ID: <20120515101307.GD3985@homeworld.netbsd.org> On Tue, May 15, 2012 at 11:21:55AM +0530, Amar Tumballi wrote: > I understand your concerns, but I suspect we are very close to 3.3.0 > release at this point of time, and hence it may be tight for taking > these patches in. Riht, I will therefore not request pullups to release-3.3 for theses changes, but I would appreciate if people could review them so that they have a chance to go in master. Will 3.3.1 be based on release-3.3, or will a new branch be forked? -- Emmanuel Dreyfus manu at netbsd.org From vbellur at redhat.com Tue May 15 10:14:38 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Tue, 15 May 2012 15:44:38 +0530 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <20120515101307.GD3985@homeworld.netbsd.org> References: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> <4FB1EEFB.2020509@redhat.com> <20120515101307.GD3985@homeworld.netbsd.org> Message-ID: <4FB22C8E.1@redhat.com> On 05/15/2012 03:43 PM, Emmanuel Dreyfus wrote: > Riht, I will therefore not request pullups to release-3.3 for theses > changes, but I would appreciate if people could review them so that they > have a chance to go in master. > > Will 3.3.1 be based on release-3.3, or will a new branch be forked? All 3.3.x releases will be based on release-3.3. It might be a good idea to rebase these changes to release-3.3 after they have been accepted in master. Vijay From manu at netbsd.org Tue May 15 11:51:36 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 13:51:36 +0200 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <4FB22C8E.1@redhat.com> Message-ID: <1kk51xf.8p0t3l1viyp1mM%manu@netbsd.org> Vijay Bellur wrote: > All 3.3.x releases will be based on release-3.3. It might be a good idea > to rebase these changes to release-3.3 after they have been accepted in > master. But after 3.3 release, as I understand. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From ej1515.park at samsung.com Wed May 16 12:23:12 2012 From: ej1515.park at samsung.com (=?euc-kr?B?udrAusHY?=) Date: Wed, 16 May 2012 12:23:12 +0000 (GMT) Subject: [Gluster-devel] Asking about Gluster Performance Factors Message-ID: <0M44007MX7QO1Z40@mailout1.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201205162123598_1LI1H0JV.jpg Type: image/jpeg Size: 72722 bytes Desc: not available URL: From johnmark at redhat.com Wed May 16 14:38:50 2012 From: johnmark at redhat.com (John Mark Walker) Date: Wed, 16 May 2012 10:38:50 -0400 (EDT) Subject: [Gluster-devel] Asking about Gluster Performance Factors In-Reply-To: <0M44007MX7QO1Z40@mailout1.samsung.com> Message-ID: <931185f2-f1b7-431f-96a0-1e7cb476b7d7@zmail01.collab.prod.int.phx2.redhat.com> Hi Ethan, ----- Original Message ----- > Dear Gluster Dev Team : > I'm Ethan, Assistant engineer in Samsung electronics. Reviewing your > paper, I have some questions of performance factors in gluster. Which paper? Can you provide a link? Also, please note that this is a community mailing list, and we cannot guarantee quick response times here - if you need a fast response, I'm happy to put you through to the right people. Thanks, John Mark Walker Gluster Community Guy > First, what does it mean the option "performance.cache-*"? Does it > mean read cache? If does, what's difference between the options > "prformance.cache-max-file-size" and "performance.cache-size" ? > I read your another paper("performance in a gluster system, versions > 3.1.x") and it says as below on Page 12, > (Gluster Native protocol does not implement write caching, as we > believe that the modest performance improvements from rite caching > do not justify the risk of cache coherency issues.) > Second, how much is the read throughput improved as configuring 2-way > replication? we need any statistics or something like that. > ("performance in a gluster system, versions 3.1.x") and it says as > below on Page 12, > (However, read throughput is generally improved by replication, as > reads can be delivered from either storage node) > I would ask you to return ASAP. From johnmark at redhat.com Wed May 16 15:56:32 2012 From: johnmark at redhat.com (John Mark Walker) Date: Wed, 16 May 2012 11:56:32 -0400 (EDT) Subject: [Gluster-devel] Reminder: community.gluster.org In-Reply-To: <4b117086-34aa-4d8b-aede-ffae2e3abfbd@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1bb98699-b028-4f92-b8fd-603056aef57c@zmail01.collab.prod.int.phx2.redhat.com> Greetings all, Just a friendly reminder that we could use your help on community.gluster.org (hereafter 'c.g.o'). Someday in the near future, we will have 2-way synchronization between our mailing lists and c.g.o, but as of now, there are 2 places to ask and answer questions. I ask that for things with definite answers, even if they start out here on the mailing lists, please provide the question and answer on c.g.o. For lengthy conversations about using or developing GlusterFS, including ideas for new ideas, roadmaps, etc., the mailing lists are ideal for that. Why do we prefer c.g.o? Because it's Google-friendly :) So, if you see any existing questions over there that you are qualified to answer, please do weigh in with an answer. And as always, for quick "real-time" help, you're best served by visiting #gluster on the freenode IRC network. This has been a public service announcement from your friendly community guy. -JM From ndevos at redhat.com Wed May 16 19:56:04 2012 From: ndevos at redhat.com (Niels de Vos) Date: Wed, 16 May 2012 21:56:04 +0200 Subject: [Gluster-devel] Updated Wireshark packages for RHEL-6 and Fedora-17 available for testing Message-ID: <4FB40654.60703@redhat.com> Hi all, today I have merged support for GlusterFS 3.2 and 3.3 into one Wireshark 'dissector'. The packages with date 20120516 in the version support both the current stable 3.2.x version, and the latest 3.3.0qa41. Older 3.3.0 versions will likely have issues due to some changes in the RPC-AUTH protocol used. Updating to the latest qa41 release (or newer) is recommended anyway. I do not expect that we'll add support for earlier 3.3.0 releases. My repository with packages for RHEL-6 and Fedora-17 contains a .repo file for yum (save it in /etc/yum.repos.d): - http://repos.fedorapeople.org/repos/devos/wireshark-gluster/ RPMs for other Fedora or RHEL versions can be provided on request. Let me know if you need an other version (or architecture). Single patches for some different Wireshark versions are available from https://github.com/nixpanic/gluster-wireshark. A full history of commits can be found here: - https://github.com/nixpanic/gluster-wireshark-1.4/commits/master/ (Support for GlusterFS 3.3 was added by Akhila and Shree, thanks!) Please test and report success and problems, file a issues on github: https://github.com/nixpanic/gluster-wireshark-1.4/issues Some functionality is still missing, but with the current status, it should be good for most analysing already. With more issues filed, it makes it easier to track what items are important. Of course, you can also respond to this email and give feedback :-) After some more cleanup of the code, this dissector will be passed on for review and inclusion in the upstream Wireshark project. Some more testing results is therefore much appreciated. Thanks, Niels From johnmark at redhat.com Wed May 16 21:12:41 2012 From: johnmark at redhat.com (John Mark Walker) Date: Wed, 16 May 2012 17:12:41 -0400 (EDT) Subject: [Gluster-devel] GlusterFest! For GlusterFS 3.3 Beta 4 In-Reply-To: Message-ID: <5456de9c-6c8b-4995-ad1e-720c9c52c74f@zmail01.collab.prod.int.phx2.redhat.com> Greetings, We are planning to have one more beta release tomorrow. If all goes as planned, this will be the release candidate. In conjunction with the beta, I thought we should have a 24-hour GlusterFest, starting tomorrow at 8pm - http://www.gluster.org/community/documentation/index.php/GlusterFest 'What's a GlusterFest?' you may be asking. Well, it's all of the below: - Testing the software. Install the new beta (when it's released tomorrow) and put it through its paces. We will put some basic testing procedures on the GlusterFest page here - http://www.gluster.org/community/documentation/index.php/GlusterFest - Feel free to create your own testing procedures and link to it from the GlusterFest page - Finding bugs. See the current list of bugs targeted for this release: http://bit.ly/beta4bugs - Fixing bugs. If you're the kind of person who wants to submit patches, see our development workflow doc: http://www.gluster.org/community/documentation/index.php/Development_Work_Flow - and then get to know Gerritt: http://review.gluster.com/ The GlusterFest page will be updated with some basic testing procedures tomorrow, and GlusterFest will officially begin at 8pm PDT May 17/03:00 UTC May 18 (coinciding with the end of our meetup tomorrow), and ending at 8pm PDT May 18/03:00 UTC May 19. If you need assistance, see #gluster on Freenode for "real-time" questions, gluster-users and community.gluster.org for general usage questions, and gluster-devel for anything related to building, patching, and bug-fixing. To keep up with GlusterFest activity, I'll be sending updates from the @glusterorg account on Twitter, and I'm sure there will be traffic on the mailing lists, as well. Happy testing and bug-hunting! -JM From ej1515.park at samsung.com Thu May 17 01:08:50 2012 From: ej1515.park at samsung.com (=?euc-kr?B?udrAusHY?=) Date: Thu, 17 May 2012 01:08:50 +0000 (GMT) Subject: [Gluster-devel] Asking about Gluster Performance Factors Message-ID: <0M4500FX676Q1150@mailout4.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201205171008201_QKNMBDIF.jpg Type: image/jpeg Size: 72722 bytes Desc: not available URL: From johnmark at redhat.com Thu May 17 04:28:50 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 00:28:50 -0400 (EDT) Subject: [Gluster-devel] Fwd: Asking about Gluster Performance Factors In-Reply-To: Message-ID: <153525d7-fe8c-4f5c-aa06-097fcb4b0980@zmail01.collab.prod.int.phx2.redhat.com> See response below from Ben England. Also, note that this question should probably go in gluster-users. -JM ----- Forwarded Message ----- From: "Ben England" To: "John Mark Walker" Sent: Wednesday, May 16, 2012 8:23:30 AM Subject: Re: [Gluster-devel] Asking about Gluster Performance Factors JM, see comments marked with ben>>> below. ----- Original Message ----- From: "???" To: gluster-devel at nongnu.org Sent: Wednesday, May 16, 2012 5:23:12 AM Subject: [Gluster-devel] Asking about Gluster Performance Factors Samsung Enterprise Portal mySingle May 16, 2012 Dear Gluster Dev Team : I'm Ethan, Assistant engineer in Samsung electronics. Reviewing your paper, I have some questions of performance factors in gluster. First, what does it mean the option "performance.cache-*"? Does it mean read cache? If does, what's difference between the options "prformance.cache-max-file-size" and "performance.cache-size" ? I read your another paper("performance in a gluster system, versions 3.1.x") and it says as below on Page 12, (Gluster Native protocol does not implement write caching, as we believe that the modest performance improvements from rite caching do not justify the risk of cache coherency issues.) ben>>> While gluster processes do not implement write caching internally, there are at least 3 ways to improve write performance in a Gluster system. - If you use a RAID controller with a non-volatile writeback cache, the RAID controller can buffer writes on behalf of the Gluster server and thereby reduce latency. - XFS or any other local filesystem used within the server "bricks" can do "write-thru" caching, meaning that the writes can be aggregated and can be kept in the Linux buffer cache so that subsequent read requests can be satisfied from this cache, transparent to Gluster processes. - there is a "write-behind" translator in the native client that will aggregate small sequential write requests at the FUSE layer into larger network-level write requests. If the smallest possible application I/O size is a requirement, sequential writes can also be efficiently aggregated by an NFS client. Second, how much is the read throughput improved as configuring 2-way replication? we need any statistics or something like that. ("performance in a gluster system, versions 3.1.x") and it says as below on Page 12, (However, read throughput is generally improved by replication, as reads can be delivered from either storage node) ben>>> Yes, reads can be satisfied by either server in a replication pair. Since the gluster native client only reads one of the two replicas, read performance should be approximately the same for 2-replica file system as it would be for a 1-replica file system. The difference in performance is with writes, as you would expect. Sincerely yours, Ethan Eunjun Park Assistant Engineer, Solution Development Team, Media Solution Center 416, Maetan 3-dong, Yeongtong-gu, Suwon-si, Gyeonggi-do 443-742, Korea Mobile : 010-8609-9532 E-mail : ej1515.park at samsung.com http://www.samsung.com/sec _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnmark at redhat.com Thu May 17 06:35:10 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 02:35:10 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: Message-ID: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> I was thinking about sending these gerritt notifications to gluster-devel by default - what do y'all think? -JM ----- Forwarded Message ----- From: "Kaushal M (Code Review)" Sent: Wednesday, May 16, 2012 11:32:26 PM Subject: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() Kaushal M has uploaded a new change for review. Change subject: client/protocol : Changes in client3_1_getxattr() ...................................................................... client/protocol : Changes in client3_1_getxattr() Backporting change 1d02db63ae from master. Copy args->loc to local->loc in client3_1_getxattr(). This prevents logs with "(null) (--)" in client3_1_getxattr_cbk(). Also save args->name in local->name and print it in the log as well. BUG: 812199 Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Signed-off-by: Kaushal M --- M xlators/protocol/client/src/client-helpers.c M xlators/protocol/client/src/client.h M xlators/protocol/client/src/client3_1-fops.c 3 files changed, 11 insertions(+), 2 deletions(-) git pull ssh://*/glusterfs refs/changes/50/3350/1 -- To view, visit http://review.gluster.com/3350 To unsubscribe, visit http://review.gluster.com/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Gerrit-PatchSet: 1 Gerrit-Project: glusterfs Gerrit-Branch: release-3.3 Gerrit-Owner: Kaushal M From rajesh at redhat.com Thu May 17 06:42:56 2012 From: rajesh at redhat.com (Rajesh Amaravathi) Date: Thu, 17 May 2012 02:42:56 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: +1 Regards, Rajesh Amaravathi, Software Engineer, GlusterFS RedHat Inc. ----- Original Message ----- From: "John Mark Walker" To: gluster-devel at nongnu.org Sent: Thursday, May 17, 2012 12:05:10 PM Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() I was thinking about sending these gerritt notifications to gluster-devel by default - what do y'all think? -JM ----- Forwarded Message ----- From: "Kaushal M (Code Review)" Sent: Wednesday, May 16, 2012 11:32:26 PM Subject: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() Kaushal M has uploaded a new change for review. Change subject: client/protocol : Changes in client3_1_getxattr() ...................................................................... client/protocol : Changes in client3_1_getxattr() Backporting change 1d02db63ae from master. Copy args->loc to local->loc in client3_1_getxattr(). This prevents logs with "(null) (--)" in client3_1_getxattr_cbk(). Also save args->name in local->name and print it in the log as well. BUG: 812199 Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Signed-off-by: Kaushal M --- M xlators/protocol/client/src/client-helpers.c M xlators/protocol/client/src/client.h M xlators/protocol/client/src/client3_1-fops.c 3 files changed, 11 insertions(+), 2 deletions(-) git pull ssh://*/glusterfs refs/changes/50/3350/1 -- To view, visit http://review.gluster.com/3350 To unsubscribe, visit http://review.gluster.com/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Gerrit-PatchSet: 1 Gerrit-Project: glusterfs Gerrit-Branch: release-3.3 Gerrit-Owner: Kaushal M _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From vijay at gluster.com Thu May 17 06:55:42 2012 From: vijay at gluster.com (Vijay Bellur) Date: Thu, 17 May 2012 12:25:42 +0530 Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> References: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FB4A0EE.40102@gluster.com> On 05/17/2012 12:05 PM, John Mark Walker wrote: > I was thinking about sending these gerritt notifications to gluster-devel by default - what do y'all think? Gerrit automatically sends out a notification to all registered users who are watching the project. Do we need an additional notification to gluster-devel if there's a considerable overlap between registered users of gluster-devel and gerrit? -Vijay From johnmark at redhat.com Thu May 17 07:26:23 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 03:26:23 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <4FB4A0EE.40102@gluster.com> Message-ID: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> There are close to 600 people now subscribed to gluster-devel - how many of them actually have an account on Gerritt? I honestly have no idea. Another thing this would do is send a subtle message to subscribers that this is not the place to discuss user issues, but perhaps there are better ways to do that. I've seen many projects do this - as well as send all bugzilla and github notifications, but I could also see some people getting annoyed. -JM ----- Original Message ----- > On 05/17/2012 12:05 PM, John Mark Walker wrote: > > I was thinking about sending these gerritt notifications to > > gluster-devel by default - what do y'all think? > > Gerrit automatically sends out a notification to all registered users > who are watching the project. Do we need an additional notification > to > gluster-devel if there's a considerable overlap between registered > users > of gluster-devel and gerrit? > > > -Vijay > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From ashetty at redhat.com Thu May 17 07:35:27 2012 From: ashetty at redhat.com (Anush Shetty) Date: Thu, 17 May 2012 13:05:27 +0530 Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> References: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FB4AA3F.1090700@redhat.com> On 05/17/2012 12:56 PM, John Mark Walker wrote: > There are close to 600 people now subscribed to gluster-devel - how many of them actually have an account on Gerritt? I honestly have no idea. Another thing this would do is send a subtle message to subscribers that this is not the place to discuss user issues, but perhaps there are better ways to do that. > > I've seen many projects do this - as well as send all bugzilla and github notifications, but I could also see some people getting annoyed. > How about a weekly digest of the same. - Anush From manu at netbsd.org Thu May 17 09:02:32 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 17 May 2012 11:02:32 +0200 Subject: [Gluster-devel] Crashes with latest git code Message-ID: <1kk8js0.b6kp732ejixeM%manu@netbsd.org> Hi I get a lot of crashes on NetBSD with latest git code. looking at core dumps, it is obvious I get memory corruption, as I find various structure overwritten by texts (file path or content). Linking with electric fence produces a much earlier crash, always at the same place. Here is how it looks: Program terminated with signal 11, Segmentation fault. #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 (gdb) bt #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 #1 0xbaa5ec1e in gf_strdup (src=0x0) at ../../../../libglusterfs/src/mem-pool.h:119 #2 0xbaa76dbf in client3_1_getxattr (frame=0xbb77f5c0, this=0xba3cd000, data=0xbfbfe18c) at client3_1-fops.c:4641 #3 0xbaa59ab8 in client_getxattr (frame=0xbb77f5c0, this=0xba3cd000, loc=0xb9402dd0, name=0x0, xdata=0x0) at client.c:1452 #4 0xb9ac3c7d in afr_sh_metadata_sync_prepare (frame=0xba8026bc, this=0xba3ce000) at afr-self-heal-metadata.c:419 #5 0xb9ac428b in afr_sh_metadata_fix (frame=0xba8026bc, this=0xba3ce000, op_ret=0, op_errno=0) at afr-self-heal-metadata.c:522 #6 0xb9abeb2b in afr_sh_common_lookup_cbk (frame=0xba8026bc, cookie=0x1, this=0xba3ce000, op_ret=0, op_errno=0, inode=0xb8b001a0, buf=0xbfbfe424, xattr=0xba401394, postparent=0xbfbfe3bc) at afr-self-heal-common.c:1311 #7 0xbaa6dc10 in client3_1_lookup_cbk (req=0xb92010d8, iov=0xb92010f8, count=1, myframe=0xbb77f550) at client3_1-fops.c:2636 Frame 4 is this: STACK_WIND (frame, afr_sh_metadata_getxattr_cbk, priv->children[source], priv->children[source]->fops->getxattr, &local->loc, NULL, NULL); Then in frame 3, I get args.name = NULL client_getxattr (call_frame_t *frame, xlator_t *this, loc_t *loc, const char *name, dict_t *xdata) (...) args.name = name; (...) ret = proc->fn (frame, this, &args); In frame 2, args->name = NULL client3_1_getxattr (call_frame_t *frame, xlator_t *this, void *data) (...) args = data; (...) local->name = gf_strdup (args->name); And there we will crash in gf_strdup(). The root cause is afr_sh_metadata_sync_prepare() calling client_getxattr with NULL arguments. The fix is beyond my knowledge of glusterfs internals, but I am sure that some folks here will be able to comment. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From rabhat at redhat.com Thu May 17 09:11:55 2012 From: rabhat at redhat.com (Raghavendra Bhat) Date: Thu, 17 May 2012 05:11:55 -0400 (EDT) Subject: [Gluster-devel] Crashes with latest git code In-Reply-To: <1kk8js0.b6kp732ejixeM%manu@netbsd.org> Message-ID: <3044c1d3-8b15-4a9d-9d18-7343cf8a33f4@zmail01.collab.prod.int.phx2.redhat.com> Hi Emmanuel, A bug has already been filed for this (822385) and patch has been sent for the review (http://review.gluster.com/#change,3353). Regards, Raghavendra Bhat ----- Original Message ----- From: "Emmanuel Dreyfus" To: gluster-devel at nongnu.org Sent: Thursday, May 17, 2012 2:32:32 PM Subject: [Gluster-devel] Crashes with latest git code Hi I get a lot of crashes on NetBSD with latest git code. looking at core dumps, it is obvious I get memory corruption, as I find various structure overwritten by texts (file path or content). Linking with electric fence produces a much earlier crash, always at the same place. Here is how it looks: Program terminated with signal 11, Segmentation fault. #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 (gdb) bt #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 #1 0xbaa5ec1e in gf_strdup (src=0x0) at ../../../../libglusterfs/src/mem-pool.h:119 #2 0xbaa76dbf in client3_1_getxattr (frame=0xbb77f5c0, this=0xba3cd000, data=0xbfbfe18c) at client3_1-fops.c:4641 #3 0xbaa59ab8 in client_getxattr (frame=0xbb77f5c0, this=0xba3cd000, loc=0xb9402dd0, name=0x0, xdata=0x0) at client.c:1452 #4 0xb9ac3c7d in afr_sh_metadata_sync_prepare (frame=0xba8026bc, this=0xba3ce000) at afr-self-heal-metadata.c:419 #5 0xb9ac428b in afr_sh_metadata_fix (frame=0xba8026bc, this=0xba3ce000, op_ret=0, op_errno=0) at afr-self-heal-metadata.c:522 #6 0xb9abeb2b in afr_sh_common_lookup_cbk (frame=0xba8026bc, cookie=0x1, this=0xba3ce000, op_ret=0, op_errno=0, inode=0xb8b001a0, buf=0xbfbfe424, xattr=0xba401394, postparent=0xbfbfe3bc) at afr-self-heal-common.c:1311 #7 0xbaa6dc10 in client3_1_lookup_cbk (req=0xb92010d8, iov=0xb92010f8, count=1, myframe=0xbb77f550) at client3_1-fops.c:2636 Frame 4 is this: STACK_WIND (frame, afr_sh_metadata_getxattr_cbk, priv->children[source], priv->children[source]->fops->getxattr, &local->loc, NULL, NULL); Then in frame 3, I get args.name = NULL client_getxattr (call_frame_t *frame, xlator_t *this, loc_t *loc, const char *name, dict_t *xdata) (...) args.name = name; (...) ret = proc->fn (frame, this, &args); In frame 2, args->name = NULL client3_1_getxattr (call_frame_t *frame, xlator_t *this, void *data) (...) args = data; (...) local->name = gf_strdup (args->name); And there we will crash in gf_strdup(). The root cause is afr_sh_metadata_sync_prepare() calling client_getxattr with NULL arguments. The fix is beyond my knowledge of glusterfs internals, but I am sure that some folks here will be able to comment. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From manu at netbsd.org Thu May 17 09:18:29 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 17 May 2012 11:18:29 +0200 Subject: [Gluster-devel] Crashes with latest git code In-Reply-To: <3044c1d3-8b15-4a9d-9d18-7343cf8a33f4@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1kk8kq6.1m2qnzf1so7qgfM%manu@netbsd.org> Raghavendra Bhat wrote: > A bug has already been filed for this (822385) and patch has been sent for > the review (http://review.gluster.com/#change,3353). I looked at the patch, it does not fix the problem I reported: args->name is still NULL. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From rabhat at redhat.com Thu May 17 09:46:20 2012 From: rabhat at redhat.com (Raghavendra Bhat) Date: Thu, 17 May 2012 05:46:20 -0400 (EDT) Subject: [Gluster-devel] Crashes with latest git code In-Reply-To: <1kk8kq6.1m2qnzf1so7qgfM%manu@netbsd.org> Message-ID: In getxattr name is NULL means its equivalent listxattr. So args->name being NULL is ok. Process was crashing because it tried to do strdup (actually strlen in the gf_strdup) of the NULL pointer to a string. On wire we will send it as a null string with namelen set to 0 and protocol/server will understand it. On client side: req.name = (char *)args->name; if (!req.name) { req.name = ""; req.namelen = 0; } On server side: if (args.namelen) state->name = gf_strdup (args.name); ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Raghavendra Bhat" Cc: gluster-devel at nongnu.org Sent: Thursday, May 17, 2012 2:48:29 PM Subject: Re: [Gluster-devel] Crashes with latest git code Raghavendra Bhat wrote: > A bug has already been filed for this (822385) and patch has been sent for > the review (http://review.gluster.com/#change,3353). I looked at the patch, it does not fix the problem I reported: args->name is still NULL. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From jdarcy at redhat.com Thu May 17 11:47:52 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 17 May 2012 07:47:52 -0400 Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <4FB4AA3F.1090700@redhat.com> References: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> <4FB4AA3F.1090700@redhat.com> Message-ID: <4FB4E568.8050601@redhat.com> On 05/17/2012 03:35 AM, Anush Shetty wrote: > > On 05/17/2012 12:56 PM, John Mark Walker wrote: >> There are close to 600 people now subscribed to gluster-devel - how many >> of them actually have an account on Gerritt? I honestly have no idea. >> Another thing this would do is send a subtle message to subscribers that >> this is not the place to discuss user issues, but perhaps there are better >> ways to do that. >> >> I've seen many projects do this - as well as send all bugzilla and github >> notifications, but I could also see some people getting annoyed. > > How about a weekly digest of the same. Excellent idea. From johnmark at redhat.com Thu May 17 16:15:59 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 12:15:59 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <4FB4E568.8050601@redhat.com> Message-ID: ----- Original Message ----- > On 05/17/2012 03:35 AM, Anush Shetty wrote: > > > > How about a weekly digest of the same. Sounds reasonable. Now we just have to figure out how to implement :) -JM From vijay at build.gluster.com Thu May 17 16:51:43 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Thu, 17 May 2012 09:51:43 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0beta4 released Message-ID: <20120517165144.1BB041803EB@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0beta4/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0beta4.tar.gz This release is made off From johnmark at redhat.com Thu May 17 18:08:01 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 14:08:01 -0400 (EDT) Subject: [Gluster-devel] glusterfs-3.3.0beta4 released In-Reply-To: <20120517165144.1BB041803EB@build.gluster.com> Message-ID: <864fe250-bfd3-49ca-9310-2fc601411b83@zmail01.collab.prod.int.phx2.redhat.com> Reminder: GlusterFS 3.3 has been branched on GitHub, so you can pull the latest code from this branch if you want to test new fixes after the beta was released: https://github.com/gluster/glusterfs/tree/release-3.3 Also, note that this release features a license change in some files. We noted that some developers could not contribute code to the project because of compatibility issues around GPLv3. So, as a compromise, we changed the licensing in files that we deemed client-specific to allow for more contributors and a stronger developer community. Those files are now dual-licensed under the LGPLv3 and the GPLv2. For text of both of these license, see these URLs: http://www.gnu.org/licenses/lgpl.html http://www.gnu.org/licenses/old-licenses/gpl-2.0.html To see the list of files we modified with the new licensing, see this patchset from Kaleb: http://review.gluster.com/#change,3304 If you have questions or comments about this change, please do reach out to me. Thanks, John Mark ----- Original Message ----- > > http://bits.gluster.com/pub/gluster/glusterfs/3.3.0beta4/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0beta4.tar.gz > > This release is made off > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From johnmark at redhat.com Thu May 17 20:34:56 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 16:34:56 -0400 (EDT) Subject: [Gluster-devel] GlusterFest! For GlusterFS 3.3 Beta 4 In-Reply-To: <5456de9c-6c8b-4995-ad1e-720c9c52c74f@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <88ecc073-688f-4edc-8ff3-ccba3b6142a3@zmail01.collab.prod.int.phx2.redhat.com> An update: Kaleb was kind enough to port his HekaFS testing page for Fedora to GlusterFS. If you're looking for a series of things to test, see this URL: http://www.gluster.org/community/documentation/index.php/3.3.0_Beta_4_Tests By tonight, I'll have a handy form for reporting your results. We are at T-6:30 hours and counting until GlusterFest begins in earnest. For all updates related to GlusterFest, see this page: http://www.gluster.org/community/documentation/index.php/GlusterFest Please do post any series of tests that you would like to run. In particular, we're looking to test some of the new features of GlusterFS 3.3: - Object storage - HDFS compatibility library - Granular locking - More proactive self-heal Happy hacking, JM ----- Original Message ----- > Greetings, > > We are planning to have one more beta release tomorrow. If all goes > as planned, this will be the release candidate. In conjunction with > the beta, I thought we should have a 24-hour GlusterFest, starting > tomorrow at 8pm - > http://www.gluster.org/community/documentation/index.php/GlusterFest > > 'What's a GlusterFest?' you may be asking. Well, it's all of the > below: > > > - Testing the software. Install the new beta (when it's released > tomorrow) and put it through its paces. We will put some basic > testing procedures on the GlusterFest page here - > http://www.gluster.org/community/documentation/index.php/GlusterFest > > - Feel free to create your own testing procedures and link to it > from the GlusterFest page > > > - Finding bugs. See the current list of bugs targeted for this > release: http://bit.ly/beta4bugs > > > - Fixing bugs. If you're the kind of person who wants to submit > patches, see our development workflow doc: > http://www.gluster.org/community/documentation/index.php/Development_Work_Flow > > - and then get to know Gerritt: http://review.gluster.com/ > > > The GlusterFest page will be updated with some basic testing > procedures tomorrow, and GlusterFest will officially begin at 8pm > PDT May 17/03:00 UTC May 18 (coinciding with the end of our meetup > tomorrow), and ending at 8pm PDT May 18/03:00 UTC May 19. > > > If you need assistance, see #gluster on Freenode for "real-time" > questions, gluster-users and community.gluster.org for general usage > questions, and gluster-devel for anything related to building, > patching, and bug-fixing. > > > To keep up with GlusterFest activity, I'll be sending updates from > the @glusterorg account on Twitter, and I'm sure there will be > traffic on the mailing lists, as well. > > > Happy testing and bug-hunting! > > -JM > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From manu at netbsd.org Fri May 18 07:49:29 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 18 May 2012 07:49:29 +0000 Subject: [Gluster-devel] python version In-Reply-To: References: <4FB13314.3060708@redhat.com> <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Message-ID: <20120518074929.GJ3985@homeworld.netbsd.org> On Mon, May 14, 2012 at 04:58:18PM -0700, Anand Avati wrote: > Emmanuel, since the problem is not going to be a long lasting one (either > of the two should fix your problem), I suggest you find a solution local to > you in the interim. I submitted a tiny hack that solves the problem for everyone until automake is upgraded on glusterfs build system: http://review.gluster.com/3360 -- Emmanuel Dreyfus manu at netbsd.org From johnmark at redhat.com Fri May 18 15:02:50 2012 From: johnmark at redhat.com (John Mark Walker) Date: Fri, 18 May 2012 11:02:50 -0400 (EDT) Subject: [Gluster-devel] GlusterFest! For GlusterFS 3.3 Beta 4 In-Reply-To: <88ecc073-688f-4edc-8ff3-ccba3b6142a3@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: Looks like we have a few testers who have reported their results already: http://www.gluster.org/community/documentation/index.php/GlusterFest 12 more hours! -JM ----- Original Message ----- > An update: > > Kaleb was kind enough to port his HekaFS testing page for Fedora to > GlusterFS. If you're looking for a series of things to test, see > this URL: > http://www.gluster.org/community/documentation/index.php/3.3.0_Beta_4_Tests > > > By tonight, I'll have a handy form for reporting your results. We are > at T-6:30 hours and counting until GlusterFest begins in earnest. > For all updates related to GlusterFest, see this page: > http://www.gluster.org/community/documentation/index.php/GlusterFest > > > Please do post any series of tests that you would like to run. In > particular, we're looking to test some of the new features of > GlusterFS 3.3: > > - Object storage > - HDFS compatibility library > - Granular locking > - More proactive self-heal > > > Happy hacking, > JM > > > ----- Original Message ----- > > Greetings, > > > > We are planning to have one more beta release tomorrow. If all goes > > as planned, this will be the release candidate. In conjunction with > > the beta, I thought we should have a 24-hour GlusterFest, starting > > tomorrow at 8pm - > > http://www.gluster.org/community/documentation/index.php/GlusterFest > > > > 'What's a GlusterFest?' you may be asking. Well, it's all of the > > below: > > > > > > - Testing the software. Install the new beta (when it's released > > tomorrow) and put it through its paces. We will put some basic > > testing procedures on the GlusterFest page here - > > http://www.gluster.org/community/documentation/index.php/GlusterFest > > > > - Feel free to create your own testing procedures and link to it > > from the GlusterFest page > > > > > > - Finding bugs. See the current list of bugs targeted for this > > release: http://bit.ly/beta4bugs > > > > > > - Fixing bugs. If you're the kind of person who wants to submit > > patches, see our development workflow doc: > > http://www.gluster.org/community/documentation/index.php/Development_Work_Flow > > > > - and then get to know Gerritt: http://review.gluster.com/ > > > > > > The GlusterFest page will be updated with some basic testing > > procedures tomorrow, and GlusterFest will officially begin at 8pm > > PDT May 17/03:00 UTC May 18 (coinciding with the end of our meetup > > tomorrow), and ending at 8pm PDT May 18/03:00 UTC May 19. > > > > > > If you need assistance, see #gluster on Freenode for "real-time" > > questions, gluster-users and community.gluster.org for general > > usage > > questions, and gluster-devel for anything related to building, > > patching, and bug-fixing. > > > > > > To keep up with GlusterFest activity, I'll be sending updates from > > the @glusterorg account on Twitter, and I'm sure there will be > > traffic on the mailing lists, as well. > > > > > > Happy testing and bug-hunting! > > > > -JM > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > From manu at netbsd.org Fri May 18 16:15:20 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 18 May 2012 16:15:20 +0000 Subject: [Gluster-devel] memory corruption in release-3.3 Message-ID: <20120518161520.GL3985@homeworld.netbsd.org> Hi I still get crashes caused by memory corruption with latest release-3.3. My test case is a rm -Rf on a large tree. It seems I crash in two places: First crash flavor (trav is sometimes unmapped memory, sometimes NULL) #0 0xbbbb60ad in __iobuf_select_arena (iobuf_pool=0xbb70d400, page_size=128) at iobuf.c:453 453 if (trav->passive_cnt) { (gdb) print trav $1 = (struct iobuf_arena *) 0x414d202c (gdb) bt #0 0xbbbb60ad in __iobuf_select_arena (iobuf_pool=0xbb70d400, page_size=128) at iobuf.c:453 #1 0xbbbb655a in iobuf_get2 (iobuf_pool=0xbb70d400, page_size=24) at iobuf.c:604 #2 0xbaa549c7 in client_submit_request () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #3 0xbaa732c5 in client3_1_open () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #4 0xbaa574e6 in client_open () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #5 0xb9abac10 in afr_sh_data_open () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #6 0xb9abacb9 in afr_self_heal_data () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #7 0xb9ac2751 in afr_sh_metadata_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #8 0xb9ac457a in afr_self_heal_metadata () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #9 0xb9abd93f in afr_sh_missing_entries_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #10 0xb9ac169b in afr_self_heal () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #11 0xb9ae2e5b in afr_launch_self_heal () #12 0xb9ae3de9 in afr_lookup_perform_self_heal () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #13 0xb9ae4804 in afr_lookup_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #14 0xb9ae4fab in afr_lookup_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #15 0xbaa6dc10 in client3_1_lookup_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #16 0xbbb69716 in rpc_clnt_handle_reply () from /usr/local/lib/libgfrpc.so.0 #17 0xbbb699b3 in rpc_clnt_notify () from /usr/local/lib/libgfrpc.so.0 #18 0xbbb65989 in rpc_transport_notify () from /usr/local/lib/libgfrpc.so.0 #19 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #20 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #21 0xbbbb270f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=3) at event.c:357 #22 0xbbbb297b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #23 0xbbbb2ca7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #24 0x08050078 in main () Second crash flavor (it looks more like a double free) Program terminated with signal 11, Segmentation fault. #0 0xbb92661e in ?? () from /lib/libc.so.12 (gdb) bt #0 0xbb92661e in ?? () from /lib/libc.so.12 #1 0xbb92891b in free () from /lib/libc.so.12 #2 0xbbbb376f in __gf_free (free_ptr=0xbb70d160) at mem-pool.c:258 #3 0xbbb7e17d in data_destroy (data=0xba301d4c) at dict.c:135 #4 0xbbb7ee18 in data_unref (this=0xba301d4c) at dict.c:470 #5 0xbbb7eb6b in dict_destroy (this=0xba4022d0) at dict.c:395 #6 0xbbb7ecab in dict_unref (this=0xba4022d0) at dict.c:432 #7 0xbaa164ba in __qr_inode_free () from /usr/local/lib/glusterfs/3.3git/xlator/performance/quick-read.so #8 0xbaa27164 in qr_forget () from /usr/local/lib/glusterfs/3.3git/xlator/performance/quick-read.so #9 0xbbb9b221 in __inode_destroy (inode=0xb8b017e4) at inode.c:320 #10 0xbbb9d0a5 in inode_table_prune (table=0xba3cc160) at inode.c:1235 #11 0xbbb9b64e in inode_unref (inode=0xb8b017e4) at inode.c:445 #12 0xbbb85249 in loc_wipe (loc=0xb9402dd0) at xlator.c:530 #13 0xb9ae126e in afr_local_cleanup () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #14 0xb9a9c66b in afr_unlink_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #15 0xb9ad2d5b in afr_unlock_common_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #16 0xb9ad38a2 in afr_unlock_entrylk_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so ---Type to continue, or q to quit--- #17 0xbaa68370 in client3_1_entrylk_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #18 0xbbb69716 in rpc_clnt_handle_reply () from /usr/local/lib/libgfrpc.so.0 #19 0xbbb699b3 in rpc_clnt_notify () from /usr/local/lib/libgfrpc.so.0 #20 0xbbb65989 in rpc_transport_notify () from /usr/local/lib/libgfrpc.so.0 #21 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #22 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #23 0xbbbb270f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=3) at event.c:357 #24 0xbbbb297b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #25 0xbbbb2ca7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #26 0x08050078 in main () (gdb) frame 2 #2 0xbbbb376f in __gf_free (free_ptr=0xbb70d160) at mem-pool.c:258 258 FREE (free_ptr); (gdb) x/1w free_ptr 0xbb70d160: 538978863 -- Emmanuel Dreyfus manu at netbsd.org From amarts at redhat.com Sat May 19 06:15:09 2012 From: amarts at redhat.com (Amar Tumballi) Date: Sat, 19 May 2012 11:45:09 +0530 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <20120518161520.GL3985@homeworld.netbsd.org> References: <20120518161520.GL3985@homeworld.netbsd.org> Message-ID: <4FB73A6D.9050601@redhat.com> On 05/18/2012 09:45 PM, Emmanuel Dreyfus wrote: > Hi > > I still get crashes caused by memory corruption with latest release-3.3. > My test case is a rm -Rf on a large tree. It seems I crash in two places: > Emmanuel, Can you please file bug report? different bugs corresponding to different crash dumps will help us. That helps in tracking development internally. Regards, Amar From manu at netbsd.org Sat May 19 10:29:55 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 12:29:55 +0200 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <20120518161520.GL3985@homeworld.netbsd.org> Message-ID: <1kkccac.1lkm3tq166jzftM%manu@netbsd.org> Emmanuel Dreyfus wrote: > Second crash flavor (it looks more like a double free) Here it is again at a different place. This is in loc_wipe, where loc->path is free'ed. Looking at the code, I see that there are places where loc->path is allocated by gf_strdup(). I see other places where it is copied from another buffer. Since this is done without reference counts, it seems likely that there is a double free somewhere. Opinions? (gdb) bt #0 0xbb92652a in ?? () from /lib/libc.so.12 #1 0xbb92891b in free () from /lib/libc.so.12 #2 0xbbbb376f in __gf_free (free_ptr=0xb8250040) at mem-pool.c:258 #3 0xbbb85269 in loc_wipe (loc=0xba4cd010) at xlator.c:534 #4 0xbaa5e68a in client_local_wipe (local=0xba4cd010) at client-helpers.c:125 #5 0xbaa614d5 in client3_1_open_cbk (req=0xb92010d8, iov=0xb92010f8, count=1, myframe=0xbb77fa20) at client3_1-fops.c:421 #6 0xbbb69716 in rpc_clnt_handle_reply (clnt=0xba3c51c0, pollin=0xbb77d220) at rpc-clnt.c:788 #7 0xbbb699b3 in rpc_clnt_notify (trans=0xbb70ec00, mydata=0xba3c51e0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-clnt.c:907 #8 0xbbb65989 in rpc_transport_notify (this=0xbb70ec00, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-transport.c:489 #9 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #10 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #11 0xbbbb270f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=3) at event.c:357 #12 0xbbbb297b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #13 0xbbbb2ca7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #14 0x08050078 in main () -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vijay at build.gluster.com Sat May 19 12:35:21 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Sat, 19 May 2012 05:35:21 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa42 released Message-ID: <20120519123524.842501803FC@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa42/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa42.tar.gz This release is made off v3.3.0qa42 From manu at netbsd.org Sat May 19 13:50:25 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 15:50:25 +0200 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <1kkccac.1lkm3tq166jzftM%manu@netbsd.org> Message-ID: <1kkcml0.c7hab41bl4auaM%manu@netbsd.org> Emmanuel Dreyfus wrote: > Looking at the code, I see that there are places where loc->path is > allocated by gf_strdup(). I see other places where it is copied from > another buffer. Since this is done without reference counts, it seems > likely that there is a double free somewhere. Opinions? I added a second argument to gf_strdup() so that the calling function can pass __func__, and I started logging gf_strdup() allocations to track a possible double free. ANd the result is... the offending free() is done on a loc->path that was not allocated by gf_strdup(). Can it be allocated by another function? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 19 15:07:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 17:07:53 +0200 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <1kkccac.1lkm3tq166jzftM%manu@netbsd.org> Message-ID: <1kkcpny.16h3fbd1pfhutzM%manu@netbsd.org> Emmanuel Dreyfus wrote: > Looking at the code, I see that there are places where loc->path is > allocated by gf_strdup(). I see other places where it is copied from > another buffer. Since this is done without reference counts, it seems > likely that there is a double free somewhere. Opinions? I found a bug: Thou shalt not free(3) memory dirname(3) returned On Linux basename() and dirname() return a pointer with the string passed as argument. On BSD flavors, basename() and dirname() return static storage, or pthread specific storage. Both behaviour are compliant, but calling free on the result in the second case is a bug. --- xlators/cluster/afr/src/afr-dir-write.c.orig 2012-05-19 16:45:30.000000000 +0200 +++ xlators/cluster/afr/src/afr-dir-write.c 2012-05-19 17:03:17.000000000 +0200 @@ -55,14 +55,22 @@ if (op_errno) *op_errno = ENOMEM; goto out; } - parent->path = dirname (child_path); + parent->path = gf_strdup( dirname (child_path) ); + if (!parent->path) { + if (op_errno) + *op_errno = ENOMEM; + goto out; + } parent->inode = inode_ref (child->parent); uuid_copy (parent->gfid, child->pargfid); ret = 0; out: + if (child_path) + GF_FREE(child_path); + return ret; } /* {{{ create */-- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 19 17:34:51 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 19:34:51 +0200 Subject: [Gluster-devel] mkdir race condition Message-ID: <1kkcuhi.1vqgbxy1lxb8w2M%manu@netbsd.org> On a replicated volume, mkdir quickly followed by the rename of a new directory child fails. # rm -Rf test && mkdir test && touch test/a && mv test/a test/b mv: rename test/a to test/b: No such file or directory # rm -Rf test && mkdir test && sleep 1 && touch test/a && mv test/a test/b (it works) Client log: [2012-05-19 18:49:43.933090] W [client3_1-fops.c:327:client3_1_mkdir_cbk] 0-pfs-client-0: remote operation failed: No such file or directory. Path: /test (00000000-0000-0000-0000-000000000000) [2012-05-19 18:49:43.944883] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-pfs-client-0: remote operation failed: No such file or directory [2012-05-19 18:49:43.946265] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-pfs-client-0: remote operation failed: No such file or directory [2012-05-19 18:49:43.961028] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-pfs-client-0: remote operation failed: No such file or directory [2012-05-19 18:49:43.961528] W [fuse-bridge.c:1515:fuse_rename_cbk] 0-glusterfs-fuse: 27: /test/a -> /test/b => -1 (No such file or directory) Server log: [2012-05-19 18:49:58.455280] E [posix-handle.c:412:posix_handle_mkdir_hashes] 0-pfs-posix: error mkdir hash-1 /export/wd3a/.glusterfs/f6/8b (No such file or directory) [2012-05-19 18:49:58.455384] W [posix-handle.c:521:posix_handle_soft] 0-pfs-posix: mkdir /export/wd3a/.glusterfs/f6/8b/f68b2a33-a649-4705-9dfd-40a15f22589a failed (No such file or directory) [2012-05-19 18:49:58.455425] E [posix.c:968:posix_mkdir] 0-pfs-posix: setting gfid on /export/wd3a/test failed [2012-05-19 18:49:58.455558] E [posix.c:1010:posix_mkdir] 0-pfs-posix: post-operation lstat on parent of /export/wd3a/test failed: No such file or directory [2012-05-19 18:49:58.455664] I [server3_1-fops.c:529:server_mkdir_cbk] 0-pfs-server: 41: MKDIR /test (00000000-0000-0000-0000-000000000000) ==> -1 (No such file or directory) [2012-05-19 18:49:58.467548] I [server3_1-fops.c:346:server_entrylk_cbk] 0-pfs-server: 46: ENTRYLK (null) (--) ==> -1 (No such file or directory) [2012-05-19 18:49:58.468990] I [server3_1-fops.c:346:server_entrylk_cbk] 0-pfs-server: 47: ENTRYLK (null) (--) ==> -1 (No such file or directory) [2012-05-19 18:49:58.483726] I [server3_1-fops.c:346:server_entrylk_cbk] 0-pfs-server: 51: ENTRYLK (null) (--) ==> -1 (No such file or directory) It says it fails, but it seems it succeeded: silo# getextattr -x trusted.gfid /export/wd3a/test /export/wd3a/test 000 f6 8b 2a 33 a6 49 47 05 9d fd 40 a1 5f 22 58 9a ..*3.IG... at ._"X. Client is release-3.3 from yesterday. Server is master branch from may 14th. Is it a known problem? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 05:36:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 07:36:02 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / Message-ID: <1kkdu4r.aq5gouehux9cM%manu@netbsd.org> On release-3.3, glusterd_is_path_in_use() in xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has extended attribute enabled, and makes it impossible to create a volume with bricks from other filesystems (with EA enabled), if / does not support extended attributes. Is it on purpose? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From kparthas at redhat.com Sun May 20 05:53:35 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 01:53:35 -0400 (EDT) Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kkdu4r.aq5gouehux9cM%manu@netbsd.org> Message-ID: <1d6e3018-e614-4273-883c-1cca9efaf0b8@zmail01.collab.prod.int.phx2.redhat.com> Emmanuel, The assumption of EA being enabled in / filesystem or any prefix of brick path is an accidental side-effect of the way glusterd_is_path_in_use() is used in glusterd_brick_create_path(). The error handling should be accommodative to ENOTSUP. In short it is a bug. Will send out a patch immediately. thanks, krish ----- Original Message ----- From: "Emmanuel Dreyfus" To: gluster-devel at nongnu.org Sent: Sunday, May 20, 2012 11:06:02 AM Subject: [Gluster-devel] 3.3 requires extended attribute on / On release-3.3, glusterd_is_path_in_use() in xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has extended attribute enabled, and makes it impossible to create a volume with bricks from other filesystems (with EA enabled), if / does not support extended attributes. Is it on purpose? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From manu at netbsd.org Sun May 20 05:56:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 07:56:53 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kkdu4r.aq5gouehux9cM%manu@netbsd.org> Message-ID: <1kkdul5.4vmrbe1owph67M%manu@netbsd.org> Emmanuel Dreyfus wrote: > On release-3.3, glusterd_is_path_in_use() in > xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has > extended attribute enabled, and makes it impossible to create a volume > with bricks from other filesystems (with EA enabled), if / does not > support extended attributes. And even with EA enabled on root, creating a volume loops forever on reading unexistant trusted.gfid and trusted.glusterfs.volume-id on brick's parent directory. It gets ENODATA and retry forever. If I patch the function to just set in_use = 0 and return 0, I can create a volume. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From ian.latter at midnightcode.org Sun May 20 06:12:39 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 16:12:39 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> Hello, Couple of questions that might help make my module a little more sane; 0) Is there any developer docco? I've just done another quick search and I can't see any. Let me know if there is and I'll try and answer the below myself. 1) What is the difference between STACK_WIND and STACK_WIND_COOKIE? I.e. I've only ever used STACK_WIND, when should I use it versus the other? 2) Is there a way to write linearly within a single function within Gluster (or is there a reason why I wouldn't want to do that)? RE 2: This may stem from my lack of understanding of the broader Gluster internals. I am performing multiple fops per fop, which is creating structural inelegances in the code that make me think I'm heading down the wrong rabbit hole. I want to say; read() { // pull in other content while(want more) { _lookup() _open() _read() _close() } return iovec } But the way I've understood the Gluster internal structure is that I need to operate in a chain of related functions; _read_lookup_cbk_open_cbk_read_cbk() { wind _close() } _read_lookup_cbk_open_cbk() { wind _read() add to local->iovec } _lookup_cbk() { wind _open() } read() { while(want more) { wind _lookup() } return local->iovec } Am I missing something - or is there a nicer way of doing this? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From manu at netbsd.org Sun May 20 06:13:32 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 08:13:32 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1d6e3018-e614-4273-883c-1cca9efaf0b8@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1kkdvl3.1p663u6iyul1oM%manu@netbsd.org> Krishnan Parthasarathi wrote: > Will send out a patch immediately. Great :-) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 06:13:33 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 08:13:33 +0200 Subject: [Gluster-devel] mkdir race condition In-Reply-To: <1kkcuhi.1vqgbxy1lxb8w2M%manu@netbsd.org> Message-ID: <1kkdvma.10s8o2rtrmcvpM%manu@netbsd.org> Emmanuel Dreyfus wrote: > On a replicated volume, mkdir quickly followed by the rename of a new > directory child fails. > > # rm -Rf test && mkdir test && touch test/a && mv test/a test/b > mv: rename test/a to test/b: No such file or directory > # rm -Rf test && mkdir test && sleep 1 && touch test/a && mv test/a test/b > (it works) I just reinstalled server from release-3.3 and now things make more sense. Any directory creation will report failure but will succeed: bacasel# mkdir /gfs/manu mkdir: /gfs/manu: No such file or directory bacasel# cd /gfs bacasel# ls manu Server log reports it fails because: [2012-05-20 07:59:23.775789] E [posix-handle.c:412:posix_handle_mkdir_hashes] 0-pfs-posix: error mkdir hash-1 /export/wd3a/.glusterfs/ec/e2 (No such file or directory) It seems posix_handle_mkdir_hashes() attempts to mkdir two directories at once: ec/ec2. How is it supposed to work? Should parent directory be created somewhere else? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From kparthas at redhat.com Sun May 20 06:36:44 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 02:36:44 -0400 (EDT) Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kkdul5.4vmrbe1owph67M%manu@netbsd.org> Message-ID: <80c2c170-133e-4509-9ac5-062293a199ad@zmail01.collab.prod.int.phx2.redhat.com> ----- Original Message ----- From: "Emmanuel Dreyfus" To: gluster-devel at nongnu.org Sent: Sunday, May 20, 2012 11:26:53 AM Subject: Re: [Gluster-devel] 3.3 requires extended attribute on / Emmanuel Dreyfus wrote: > On release-3.3, glusterd_is_path_in_use() in > xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has > extended attribute enabled, and makes it impossible to create a volume > with bricks from other filesystems (with EA enabled), if / does not > support extended attributes. > And even with EA enabled on root, creating a volume loops forever on > reading unexistant trusted.gfid and trusted.glusterfs.volume-id on > brick's parent directory. It gets ENODATA and retry forever. If I patch > the function to just set in_use = 0 and return 0, I can create a volume. It is strange that the you see glusterd_path_in_use() loop forever. If I am not wrong, the inner loop checks for presence of trusted.gfid and trusted.glusterfs.volume-id and should exit after that, and the outer loop performs dirname on the path repeatedly and dirname(3) guarantees such an operation should return "/" eventually, which we check. It would be great if you could provide values of local variables, "used" and "curdir" when you see the looping forever. I dont have a setup to check this immediately. thanks, krish -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Sun May 20 06:47:57 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 16:47:57 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205200647.q4K6lvdN009529@singularity.tronunltd.com> > > And I am sick of the word-wrap on this client .. I think > > you've finally convinced me to fix it ... what's normal > > these days - still 80 chars? > > I used to line-wrap (gnus and cool emacs extensions). It doesn't make > sense to line wrap any more. Let the email client handle it depending > on the screen size of the device (mobile / tablet / desktop). FYI found this; an hour of code parsing in the mail software and it turns out that it had no wrapping .. it came from the stupid textarea tag in the browser (wrap="hard"). Same principle (server side coded, non client savvy) - now set to "soft". So hopefully fixed :) Cheers. -- Ian Latter Late night coder .. http://midnightcode.org/ From kparthas at redhat.com Sun May 20 06:54:54 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 02:54:54 -0400 (EDT) Subject: [Gluster-devel] Gluster internals In-Reply-To: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> Message-ID: Couple of questions that might help make my module a little more sane; 0) Is there any developer docco? I've just done another quick search and I can't see any. Let me know if there is and I'll try and answer the below myself. 1) What is the difference between STACK_WIND and STACK_WIND_COOKIE? I.e. I've only ever used STACK_WIND, when should I use it versus the other? STACK_WIND_COOKIE is used when we need to 'tie' the call wound with its corresponding callback. You can see this variant being used extensively in cluster xlators where it is used to identify the callback with the subvolume no. it is coming from. 2) Is there a way to write linearly within a single function within Gluster (or is there a reason why I wouldn't want to do that)? RE 2: This may stem from my lack of understanding of the broader Gluster internals. I am performing multiple fops per fop, which is creating structural inelegances in the code that make me think I'm heading down the wrong rabbit hole. I want to say; read() { // pull in other content while(want more) { _lookup() _open() _read() _close() } return iovec } But the way I've understood the Gluster internal structure is that I need to operate in a chain of related functions; _read_lookup_cbk_open_cbk_read_cbk() { wind _close() } _read_lookup_cbk_open_cbk() { wind _read() add to local->iovec } _lookup_cbk() { wind _open() } read() { while(want more) { wind _lookup() } return local->iovec } Am I missing something - or is there a nicer way of doing this? The above method you are trying to use is the "continuation passing style" that is extensively used in afr-inode-read.c and afr-transaction.c to perform multiple internal fops on the trigger of a single fop from the application. cluster/afr may give you some ideas on how you could structure it if you like that more. The other method I can think of (not sure if it would suit your needs) is to use the syncop framework (see libglusterfs/src/syncop.c). This allows one to make a 'synchronous' glusterfs fop. inside a xlator. The downside is that you can only make one call at a time. This may not be acceptable for cluster xlators (ie, xlator with more than one child xlator). Hope that helps, krish _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Sun May 20 07:23:12 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 17:23:12 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205200723.q4K7NCO3009706@singularity.tronunltd.com> > 1) What is the difference between STACK_WIND > and STACK_WIND_COOKIE? I.e. I've only > ever used STACK_WIND, when should I use > it versus the other? > > STACK_WIND_COOKIE is used when we need to 'tie' the call > wound with its corresponding callback. You can see this > variant being used extensively in cluster xlators where it > is used to identify the callback with the subvolume no. it > is coming from. Ok - thanks. I will take a closer look at the examples for this .. this may help me ... > 2) Is there a way to write linearly within a single > function within Gluster (or is there a reason > why I wouldn't want to do that)? > > > RE 2: > > This may stem from my lack of understanding > of the broader Gluster internals. I am performing > multiple fops per fop, which is creating structural > inelegances in the code that make me think I'm > heading down the wrong rabbit hole. I want to > say; > > read() { > // pull in other content > while(want more) { > _lookup() > _open() > _read() > _close() > } > return iovec > } > > > But the way I've understood the Gluster internal > structure is that I need to operate in a chain of > related functions; > > _read_lookup_cbk_open_cbk_read_cbk() { > wind _close() > } > > _read_lookup_cbk_open_cbk() { > wind _read() > add to local->iovec > } > > _lookup_cbk() { > wind _open() > } > > read() { > while(want more) { > wind _lookup() > } > return local->iovec > } > > > > Am I missing something - or is there a nicer way of > doing this? > > The above method you are trying to use is the "continuation passing style" that > is extensively used in afr-inode-read.c and afr-transaction.c to perform multiple > internal fops on the trigger of a single fop from the application. cluster/afr may > give you some ideas on how you could structure it if you like that more. These may have been where I got that code style from originally .. I will go back to these two programs, thanks for the reference. I'm currently working my way through the afr-heal programs .. > The other method I can think of (not sure if it would suit your needs) > is to use the syncop framework (see libglusterfs/src/syncop.c). > This allows one to make a 'synchronous' glusterfs fop. inside a xlator. > The downside is that you can only make one call at a time. This may not > be acceptable for cluster xlators (ie, xlator with more than one child xlator). In the syncop framework, how much gets affected when I use it in my xlator. Does it mean that there's only one call at a time in the whole xlator (so the current write will stop all other reads) or is the scope only the fop (so that within this write, my child->fops are serial, but neighbouring reads on my xlator will continue in other threads)? And does that then restrict what can go above and below my xlator? I mean that my xlator isn't a cluster xlator but I would like it to be able to be used on top of (or underneath) a cluster xlator, will that no longer be possible? > Hope that helps, > krish Thanks Krish, every bit helps! -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Sun May 20 07:40:54 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 17:40:54 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205200740.q4K7esfl009777@singularity.tronunltd.com> > > The other method I can think of (not sure if it would suit your needs) > > is to use the syncop framework (see libglusterfs/src/syncop.c). > > This allows one to make a 'synchronous' glusterfs fop. inside a xlator. > > The downside is that you can only make one call at a time. This may not > > be acceptable for cluster xlators (ie, xlator with more than one child xlator). > > In the syncop framework, how much gets affected when I > use it in my xlator. Does it mean that there's only one call > at a time in the whole xlator (so the current write will stop > all other reads) or is the scope only the fop (so that within > this write, my child->fops are serial, but neighbouring reads > on my xlator will continue in other threads)? And does that > then restrict what can go above and below my xlator? I > mean that my xlator isn't a cluster xlator but I would like it > to be able to be used on top of (or underneath) a cluster > xlator, will that no longer be possible? > I've just taken a look at xlators/cluster/afr/src/pump.c for some syncop usage examples and I really like what I see there. If syncop only serialises/syncs activity that I code within a given fop of my xlator and doesn't impose serial/ sync limits on the parents or children of my xlator then this looks like the right path. I want to be sure that it won't result in a globally syncronous outcome though (like ignoring a cache xlator under mine to get a true disk read) - I just need the internals of my calls to be linear. -- Ian Latter Late night coder .. http://midnightcode.org/ From manu at netbsd.org Sun May 20 08:11:04 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 10:11:04 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <80c2c170-133e-4509-9ac5-062293a199ad@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1kke141.ffp9fr1meqkgbM%manu@netbsd.org> Krishnan Parthasarathi wrote: > It is strange that the you see glusterd_path_in_use() loop forever. If I > am not wrong, the inner loop checks for presence of trusted.gfid and > trusted.glusterfs.volume-id and should exit after that, and the outer loop > performs dirname on the path repeatedly and dirname(3) guarantees such an > operation should return "/" eventually, which we check. Here is my setup when I tried that: / with EA enabled /export/wd3a ibrick with EA enabled But I may have been testing with an untintended patch in glusterd_path_in_use(). I will retry with the right fix once it will be available. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 08:30:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 10:30:53 +0200 Subject: [Gluster-devel] mkdir race condition In-Reply-To: <1kkdvma.10s8o2rtrmcvpM%manu@netbsd.org> Message-ID: <1kke28c.rugeav1w049sdM%manu@netbsd.org> Emmanuel Dreyfus wrote: > It seems posix_handle_mkdir_hashes() attempts to mkdir two directories > at once: ec/ec2. How is it supposed to work? Should parent directory be > created somewhere else? This fixes the problem. Any comment? --- xlators/storage/posix/src/posix-handle.c.orig +++ xlators/storage/posix/src/posix-handle.c @@ -405,8 +405,16 @@ parpath = dirname (duppath); parpath = dirname (duppath); ret = mkdir (parpath, 0700); + if (ret == -1 && errno == ENOENT) { + char *tmppath = NULL; + + tmppath = strdupa(parpath); + ret = mkdir (dirname (tmppath), 0700); + if (ret == 0) + ret = mkdir (parpath, 0700); + } if (ret == -1 && errno != EEXIST) { gf_log (this->name, GF_LOG_ERROR, "error mkdir hash-1 %s (%s)", parpath, strerror (errno)); -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 08:47:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 10:47:02 +0200 Subject: [Gluster-devel] rename(2) race condition Message-ID: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> After I patched to fix the mkdir issue, I now encounter a race in rename(2). Most of the time it works, but sometimes: 3548 1 tar CALL open(0xbb9010e0,0xa02,0x180) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET open 8 3548 1 tar CALL __fstat50(8,0xbfbfe69c) 3548 1 tar RET __fstat50 0 3548 1 tar CALL write(8,0x8067880,0x16) 3548 1 tar GIO fd 8 wrote 22 bytes "Nnetbsd-5-1-2-RELEASE\n" 3548 1 tar RET write 22/0x16 3548 1 tar CALL close(8) 3548 1 tar RET close 0 3548 1 tar CALL lchmod(0xbb9010e0,0x1a4) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET lchmod 0 3548 1 tar CALL __lutimes50(0xbb9010e0,0xbfbfe6d8) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET __lutimes50 0 3548 1 tar CALL rename(0xbb9010e0,0x8071584) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET rename -1 errno 13 Permission denied I can reproduce it with the command below. It runs fine for a few seconds and then hit permission denied. It needs a level of hierarchy to exhibit the hebavior: just install a b will not fail. mkdir test && echo "xxx" > tmp/a while [ 1 ] ; do rm -f test/b && install test/a test/b ; done -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From mihai at patchlog.com Sun May 20 09:19:34 2012 From: mihai at patchlog.com (Mihai Secasiu) Date: Sun, 20 May 2012 12:19:34 +0300 Subject: [Gluster-devel] glusterfs on MacOSX Message-ID: <4FB8B726.10500@patchlog.com> Hello, I am trying to get glusterfs ( 3.2.6, server ) to work on MacOSX ( Lion - I think , darwin kernel 11.3 ). So far I've been able to make it compile with a few patches and --disable-fuse-client. I want to create a volume on a MacMini that will be a replica of another volume stored on a linux server in a different location. The volume stored on the MacMini would also have to be mounted on the macmini. Since the fuse client is broken because it's built to use macfuse and that doesn't work anymore on the latest MacOSX I want to mount the volume over nfs and I've been able to do that ( with a small patch to the xdr code ) but it's really really slow. It's so slow that mounting the volume through a remote node is a lot faster. Also mounting the same volume on a remote node is fast so the problem is definitely in the nfs server on the MacOSX. I did a strace ( dtruss ) on it and it seems like it's doing a lot of polling. Could this be the cause of the slowness ? If anyone wants to try this you can fetch it from https://github.com/mihaisecasiu/glusterfs/tree/release-3.2 Thanks From manu at netbsd.org Sun May 20 12:43:52 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 14:43:52 +0200 Subject: [Gluster-devel] rename(2) race condition In-Reply-To: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> Message-ID: <1kkee8d.8hdhfs177z5zdM%manu@netbsd.org> Emmanuel Dreyfus wrote: > After I patched to fix the mkdir issue, I now encounter a race in > rename(2). Most of the time it works, but sometimes: And the problem onoy happens when running as an unprivilegied user. It works fine for root. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From kparthas at redhat.com Sun May 20 14:14:10 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 10:14:10 -0400 (EDT) Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kke141.ffp9fr1meqkgbM%manu@netbsd.org> Message-ID: Emmanuel, I have submitted the fix for review: http://review.gluster.com/3380 I have not tested the fix with "/" having EA disabled. It would be great if you could confirm the looping forever doesn't happen with this fix. thanks, krish ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Krishnan Parthasarathi" Cc: gluster-devel at nongnu.org Sent: Sunday, May 20, 2012 1:41:04 PM Subject: Re: [Gluster-devel] 3.3 requires extended attribute on / Krishnan Parthasarathi wrote: > It is strange that the you see glusterd_path_in_use() loop forever. If I > am not wrong, the inner loop checks for presence of trusted.gfid and > trusted.glusterfs.volume-id and should exit after that, and the outer loop > performs dirname on the path repeatedly and dirname(3) guarantees such an > operation should return "/" eventually, which we check. Here is my setup when I tried that: / with EA enabled /export/wd3a ibrick with EA enabled But I may have been testing with an untintended patch in glusterd_path_in_use(). I will retry with the right fix once it will be available. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 21 04:51:59 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 21 May 2012 06:51:59 +0200 Subject: [Gluster-devel] zero'ed local data in mdc_setattr_cbk Message-ID: <1kkfmt2.2blugrtqcea2M%manu@netbsd.org> Hi Here is a backtrace for a SIGSEGV in md-cache code. Note inode = NULL when mdc_inode_iatt_set() is called by mdc_setattr_cbk(). It seems that local got corupted in the later function #0 0xbbb3a7c9 in pthread_spin_lock () from /usr/lib/libpthread.so.1 #1 0xbaa09d8c in mdc_inode_prep (this=0xba3e5000, inode=0x0) at md-cache.c:267 #2 0xbaa0a1bf in mdc_inode_iatt_set (this=0xba3e5000, inode=0x0, iatt=0xb9401d40) at md-cache.c:384 #3 0xbaa0ee16 in mdc_setattr_cbk (frame=0xbb77f400, cookie=0xbb77f470, this=0xba3e5000, op_ret=0, op_errno=0, prebuf=0xb9401cd8, postbuf=0xb9401d40, xdata=0x0) at md-cache.c:1423 #4 0xbaa1d0ec in qr_fsetattr_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/performance/quick-read.so #5 0xbbb8ac72 in default_fsetattr_cbk (frame=0xbb77f0f0, cookie=0xbb77f160, this=0xba3e3000, op_ret=0, op_errno=0, statpre=0xb9401cd8, statpost=0xb9401d40, xdata=0x0) at defaults.c:452 #6 0xbbb8ac72 in default_fsetattr_cbk (frame=0xbb77f160, cookie=0xbb77f1d0, this=0xba3e2000, op_ret=0, op_errno=0, statpre=0xb9401cd8, statpost=0xb9401d40, xdata=0x0) at defaults.c:452 #7 0xbbb8ac72 in default_fsetattr_cbk (frame=0xbb77f1d0, cookie=0xbb77f240, this=0xba3e1000, op_ret=0, op_errno=0, statpre=0xb9401cd8, statpost=0xb9401d40, xdata=0x0) at defaults.c:452 #8 0xb9aa9d23 in afr_fsetattr_unwind (frame=0xba801ee8, this=0xba3d1000) at afr-inode-write.c:1160 #9 0xb9aa9f01 in afr_fsetattr_wind_cbk (frame=0xba801ee8, cookie=0x0, this=0xba3d1000, op_ret=0, op_errno=0, preop=0xbfbfe880, postop=0xbfbfe818, xdata=0x0) at afr-inode-write.c:1221 #10 0xbaa6a099 in client3_1_fsetattr_cbk (req=0xb90010d8, iov=0xb90010f8, count=1, myframe=0xbb77f010) at client3_1-fops.c:1897 #11 0xbbb6975e in rpc_clnt_handle_reply (clnt=0xba3c5270, pollin=0xbb77d220) at rpc-clnt.c:788 #12 0xbbb699fb in rpc_clnt_notify (trans=0xbb70f000, mydata=0xba3c5290, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-clnt.c:907 #13 0xbbb659c7 in rpc_transport_notify (this=0xbb70f000, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-transport.c:489 #14 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #15 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #16 0xbbbb281f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=2) at event.c:357 #17 0xbbbb2a8b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #18 0xbbbb2db7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #19 0x0805015e in main () (gdb) frame 3 #3 0xbaa0ee16 in mdc_setattr_cbk (frame=0xbb77f400, cookie=0xbb77f470, this=0xba3e5000, op_ret=0, op_errno=0, prebuf=0xb9401cd8, postbuf=0xb9401d40, xdata=0x0) at md-cache.c:1423 1423 mdc_inode_iatt_set (this, local->loc.inode, postbuf); (gdb) print *local $2 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, fd = 0xb8f9d054, linkname = 0x0, xattr = 0x0} -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 21 10:14:24 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 21 May 2012 10:14:24 +0000 Subject: [Gluster-devel] zero'ed local data in mdc_setattr_cbk In-Reply-To: <1kkfmt2.2blugrtqcea2M%manu@netbsd.org> References: <1kkfmt2.2blugrtqcea2M%manu@netbsd.org> Message-ID: <20120521101424.GA10504@homeworld.netbsd.org> On Mon, May 21, 2012 at 06:51:59AM +0200, Emmanuel Dreyfus wrote: > Here is a backtrace for a SIGSEGV in md-cache code. Note inode = NULL > when mdc_inode_iatt_set() is called by mdc_setattr_cbk(). I submitted a patch to fix it, please review http://review.gluster.com/3383 -- Emmanuel Dreyfus manu at netbsd.org From jdarcy at redhat.com Mon May 21 12:24:30 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 21 May 2012 08:24:30 -0400 Subject: [Gluster-devel] Gluster internals In-Reply-To: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> References: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> Message-ID: <4FBA33FE.3050602@redhat.com> On 05/20/2012 02:12 AM, Ian Latter wrote: > Hello, > > > Couple of questions that might help make my > module a little more sane; > > 0) Is there any developer docco? I've just done > another quick search and I can't see any. Let > me know if there is and I'll try and answer the > below myself. Your best bet right now (if I may say so) is the stuff I've posted on hekafs.org - the "Translator 101" articles plus the API overview at http://hekafs.org/dist/xlator_api_2.html > 1) What is the difference between STACK_WIND > and STACK_WIND_COOKIE? I.e. I've only > ever used STACK_WIND, when should I use > it versus the other? I see Krishnan has already covered this. > 2) Is there a way to write linearly within a single > function within Gluster (or is there a reason > why I wouldn't want to do that)? Any blocking ops would have to be built on top of async ops plus semaphores etc. because (unlike e.g. an HTTP server) the underlying sockets etc. are shared/multiplexed between users and activities. Thus you'd get much more context switching that way than if you stay within the async/continuation style. Some day in the distant future, I'd like to work some more on a preprocessor that turns linear code into async code so that it's easier to write but retains the performance and resource-efficiency advantages of an essentially async style. I did some work (http://pl.atyp.us/ripper/UserGuide.html) in this area several years ago, but it has probably bit-rotted to hell since then. With more recent versions of gcc and LLVM it should be possible to overcome some of the limitations that version had. From manu at netbsd.org Mon May 21 16:27:21 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 21 May 2012 18:27:21 +0200 Subject: [Gluster-devel] rename(2) race condition In-Reply-To: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> Message-ID: <1kkgieh.28gd5j1k9erqhM%manu@netbsd.org> Emmanuel Dreyfus wrote: > 3548 1 tar CALL rename(0xbb9010e0,0x8071584) > 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" > 3548 1 tar RET rename -1 errno 13 Permission denied I tracked this down to FUSE LOOKUP operation that do not set fuse_entry's attr.uid correctly (it is left set to 0). Here is the summary of my findings so far: - as un unprivilegied user, I create and delete files like crazy - most of the time everything is fine - sometime a LOOKUP for a file I created (as an unprivilegied user) will return a fuse_entry with uid set to 0, which cause the kernel to raise EACCESS when I try to delete the file. Here is an example of a FUSE trace, produced by the test case while [ 1 ] ; do cp /etc/fstab test/foo1 ; rm test/foo1 ; done > unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1) < unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1), error = -2 > unique = 1436, nodeid = 3098542296, opcode = CREATE (35) < unique = 1436, nodeid = 3098542296, opcode = CREATE (35), error = 0 > unique = 1437, nodeid = 3098542396, opcode = SETATTR (4) < unique = 1437, nodeid = 3098542396, opcode = SETATTR (4), error = 0 > unique = 1438, nodeid = 3098542396, opcode = WRITE (16) < unique = 1438, nodeid = 3098542396, opcode = WRITE (16), error = 0 > unique = 1439, nodeid = 3098542396, opcode = FSYNC (20) < unique = 1439, nodeid = 3098542396, opcode = FSYNC (20), error = 0 > unique = 1440, nodeid = 3098542396, opcode = RELEASE (18) < unique = 1440, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > unique = 1441, nodeid = 3098542396, opcode = GETATTR (3) < unique = 1441, nodeid = 3098542396, opcode = GETATTR (3), error = 0 > unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1) < unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 --> here I sometimes get fuse_entry's attr.uid incorrectly set to 0 --> When this happens, LOOKUP fails and returns EACCESS. > unique = 1443, nodeid = 3098542296, opcode = UNLINK (10) < unique = 1443, nodeid = 3098542296, opcode = UNLINK (10), error = 0 > unique = 1444, nodeid = 3098542396, opcode = FORGET (2) Is it possible that metadata writes are now so asynchronous that a subsequent lookup cannot retreive the up to date value? If that is the problem, how can I fix it? There is nothing telling the FUSE implementation that a CREATE or SETATTR has just partially completed and has metadata pending. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From ian.latter at midnightcode.org Mon May 21 23:02:44 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 22 May 2012 09:02:44 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205212302.q4LN2idg017478@singularity.tronunltd.com> > > 0) Is there any developer docco? I've just done > > another quick search and I can't see any. Let > > me know if there is and I'll try and answer the > > below myself. > > Your best bet right now (if I may say so) is the stuff I've posted on > hekafs.org - the "Translator 101" articles plus the API overview at > > http://hekafs.org/dist/xlator_api_2.html You must say so - there is so little docco. Actually before I posted I went and re-read your Translator 101 docs as you referred them to me on 10 May, but I hadn't found your API overview - thanks (for both)! > > 2) Is there a way to write linearly within a single > > function within Gluster (or is there a reason > > why I wouldn't want to do that)? > > Any blocking ops would have to be built on top of async ops plus semaphores > etc. because (unlike e.g. an HTTP server) the underlying sockets etc. are > shared/multiplexed between users and activities. Thus you'd get much more > context switching that way than if you stay within the async/continuation style. Interesting - I haven't ever done semaphore coding, but it may not be needed. The syncop framework that Krish referred too seems to do this via a mutex lock (synctask_yawn) and a context switch (synctask_yield). What's the drawback with increased context switching? After my email thread with Krish I decided against syncop, but the flow without was going to be horrific. The only way I could bring it back to anything even half as sane as the afr code (which can cleverly loop through its own _cbk's recursively - I like that, whoever put that together) was to have the last cbk in a chain (say the "close_cbk") call the original function with an index or stepper increment. But after sitting on the idea for a couple of days I actually came to the same conclusion as Manu did in the last message. I.e. without docco I have been writing to what seems to work, and in my 2009 code (I saw last night) a "mkdir" wind followed by "create" code in the same function - which I believe, now, is probably a race condition (because of the threaded/async structure forced through the wind/call macro model). In that case I *do* want a synchronous write - but only within my xlator (which, if I'm reading this right, *is* what syncop does) - as opposed to an end-to-end synchronous write (being sync'd through the full stack of xlators: ignoring caching, waiting for replication to be validated, etc). Although, the same synchronous outcome comes from the chained async calls ... but then we get back to the readability/ fixability of the code. > Some day in the distant future, I'd like to work some more on a preprocessor > that turns linear code into async code so that it's easier to write but retains > the performance and resource-efficiency advantages of an essentially async > style. I did some work (http://pl.atyp.us/ripper/UserGuide.html) in this area > several years ago, but it has probably bit-rotted to hell since then. With > more recent versions of gcc and LLVM it should be possible to overcome some of > the limitations that version had. Yes, I had a very similar thought - a C pre-processor isn't in my experience or time scale though; I considered writing up a script that would chain it out in C for me. I was going to borrow from a script that I wrote which builds one of the libMidnightCode header files but even that seemed impractical .. would anyone be able to debug it? Would I even understand in 2yrs from now - lol So I think the long and the short of it is that anything I do here won't be pretty .. or perhaps: one will look pretty and the other will run pretty :) -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Mon May 21 23:59:07 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 16:59:07 -0700 Subject: [Gluster-devel] rename(2) race condition In-Reply-To: <1kkgieh.28gd5j1k9erqhM%manu@netbsd.org> References: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> <1kkgieh.28gd5j1k9erqhM%manu@netbsd.org> Message-ID: Is the FUSE SETATTR implementation in NetBSD synchronous? i.e, does the chown() or chmod() syscall issued by the application strictly block till GlusterFS's fuse_setattr_cbk() is called? Avati On Mon, May 21, 2012 at 9:27 AM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > > > 3548 1 tar CALL rename(0xbb9010e0,0x8071584) > > 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" > > 3548 1 tar RET rename -1 errno 13 Permission denied > > I tracked this down to FUSE LOOKUP operation that do not set > fuse_entry's attr.uid correctly (it is left set to 0). > > Here is the summary of my findings so far: > - as un unprivilegied user, I create and delete files like crazy > - most of the time everything is fine > - sometime a LOOKUP for a file I created (as an unprivilegied user) will > return a fuse_entry with uid set to 0, which cause the kernel to raise > EACCESS when I try to delete the file. > > Here is an example of a FUSE trace, produced by the test case > while [ 1 ] ; do cp /etc/fstab test/foo1 ; rm test/foo1 ; done > > > unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1) > < unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1), error = -2 > > unique = 1436, nodeid = 3098542296, opcode = CREATE (35) > < unique = 1436, nodeid = 3098542296, opcode = CREATE (35), error = 0 > > unique = 1437, nodeid = 3098542396, opcode = SETATTR (4) > < unique = 1437, nodeid = 3098542396, opcode = SETATTR (4), error = 0 > > unique = 1438, nodeid = 3098542396, opcode = WRITE (16) > < unique = 1438, nodeid = 3098542396, opcode = WRITE (16), error = 0 > > unique = 1439, nodeid = 3098542396, opcode = FSYNC (20) > < unique = 1439, nodeid = 3098542396, opcode = FSYNC (20), error = 0 > > unique = 1440, nodeid = 3098542396, opcode = RELEASE (18) > < unique = 1440, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > > unique = 1441, nodeid = 3098542396, opcode = GETATTR (3) > < unique = 1441, nodeid = 3098542396, opcode = GETATTR (3), error = 0 > > unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1) > < unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 > > --> here I sometimes get fuse_entry's attr.uid incorrectly set to 0 > --> When this happens, LOOKUP fails and returns EACCESS. > > > unique = 1443, nodeid = 3098542296, opcode = UNLINK (10) > < unique = 1443, nodeid = 3098542296, opcode = UNLINK (10), error = 0 > > unique = 1444, nodeid = 3098542396, opcode = FORGET (2) > > > Is it possible that metadata writes are now so asynchronous that a > subsequent lookup cannot retreive the up to date value? If that is the > problem, how can I fix it? There is nothing telling the FUSE > implementation that a CREATE or SETATTR has just partially completed and > has metadata pending. > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anand.avati at gmail.com Tue May 22 00:11:47 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 17:11:47 -0700 Subject: [Gluster-devel] A healing translator In-Reply-To: <4FA8E8AB.2040604@datalab.es> References: <4FA8E8AB.2040604@datalab.es> Message-ID: On Tue, May 8, 2012 at 2:34 AM, Xavier Hernandez wrote: > Hello developers, > > I would like to expose some ideas we are working on to create a new kind > of translator that should be able to unify and simplify to some extent the > healing procedures of complex translators. > > Currently, the only translator with complex healing capabilities that we > are aware of is AFR. We are developing another translator that will also > need healing capabilities, so we thought that it would be interesting to > create a new translator able to handle the common part of the healing > process and hence to simplify and avoid duplicated code in other > translators. > > The basic idea of the new translator is to handle healing tasks nearer the > storage translator on the server nodes instead to control everything from a > translator on the client nodes. Of course the heal translator is not able > to handle healing entirely by itself, it needs a client translator which > will coordinate all tasks. The heal translator is intended to be used by > translators that work with multiple subvolumes. > > I will try to explain how it works without entering into too much details. > > There is an important requisite for all client translators that use > healing: they must have exactly the same list of subvolumes and in the same > order. Currently, I think this is not a problem. > > The heal translator treats each file as an independent entity, and each > one can be in 3 modes: > > 1. Normal mode > > This is the normal mode for a copy or fragment of a file when it is > synchronized and consistent with the same file on other nodes (for example > with other replicas. It is the client translator who decides if it is > synchronized or not). > > 2. Healing mode > > This is the mode used when a client detects an inconsistency in the copy > or fragment of the file stored on this node and initiates the healing > procedures. > > 3. Provider mode (I don't like very much this name, though) > > This is the mode used by client translators when an inconsistency is > detected in this file, but the copy or fragment stored in this node is > considered good and it will be used as a source to repair the contents of > this file on other nodes. > > Initially, when a file is created, it is set in normal mode. Client > translators that make changes must guarantee that they send the > modification requests in the same order to all the servers. This should be > done using inodelk/entrylk. > > When a change is sent to a server, the client must include a bitmap mask > of the clients to which the request is being sent. Normally this is a > bitmap containing all the clients, however, when a server fails for some > reason some bits will be cleared. The heal translator uses this bitmap to > early detect failures on other nodes from the point of view of each client. > When this condition is detected, the request is aborted with an error and > the client is notified with the remaining list of valid nodes. If the > client considers the request can be successfully server with the remaining > list of nodes, it can resend the request with the updated bitmap. > > The heal translator also updates two file attributes for each change > request to mantain the "version" of the data and metadata contents of the > file. A similar task is currently made by AFR using xattrop. This would not > be needed anymore, speeding write requests. > > The version of data and metadata is returned to the client for each read > request, allowing it to detect inconsistent data. > > When a client detects an inconsistency, it initiates healing. First of > all, it must lock the entry and inode (when necessary). Then, from the data > collected from each node, it must decide which nodes have good data and > which ones have bad data and hence need to be healed. There are two > possible cases: > > 1. File is not a regular file > > In this case the reconstruction is very fast and requires few requests, so > it is done while the file is locked. In this case, the heal translator does > nothing relevant. > > 2. File is a regular file > > For regular files, the first step is to synchronize the metadata to the > bad nodes, including the version information. Once this is done, the file > is set in healing mode on bad nodes, and provider mode on good nodes. Then > the entry and inode are unlocked. > > When a file is in provider mode, it works as in normal mode, but refuses > to start another healing. Only one client can be healing a file. > > When a file is in healing mode, each normal write request from any client > are handled as if the file were in normal mode, updating the version > information and detecting possible inconsistencies with the bitmap. > Additionally, the healing translator marks the written region of the file > as "good". > > Each write request from the healing client intended to repair the file > must be marked with a special flag. In this case, the area that wants to be > written is filtered by the list of "good" ranges (if there are any > intersection with a good range, it is removed from the request). The > resulting set of ranges are propagated to the lower translator and added to > the list of "good" ranges but the version information is not updated. > > Read requests are only served if the range requested is entirely contained > into the "good" regions list. > > There are some additional details, but I think this is enough to have a > general idea of its purpose and how it works. > > The main advantages of this translator are: > > 1. Avoid duplicated code in client translators > 2. Simplify and unify healing methods in client translators > 3. xattrop is not needed anymore in client translators to keep track of > changes > 4. Full file contents are repaired without locking the file > 5. Better detection and prevention of some split brain situations as soon > as possible > > I think it would be very useful. It seems to me that it works correctly in > all situations, however I don't have all the experience that other > developers have with the healing functions of AFR, so I will be happy to > answer any question or suggestion to solve problems it may have or to > improve it. > > What do you think about it ? > > The goals you state above are all valid. What would really help (adoption) is if you can implement this as a modification of AFR by utilizing all the work already done, and you get brownie points if it is backward compatible with existing AFR. If you already have any code in a publishable state, please share it with us (github link?). Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Tue May 22 00:40:03 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 22 May 2012 10:40:03 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205220040.q4M0e3ah017846@singularity.tronunltd.com> Actually, while we're at this level I'd like to bolt on another thought / query - these were my words; > But after sitting on the idea for a couple of days I actually came > to the same conclusion as Manu did in the last message. I.e. > without docco I have been writing to what seems to work, and > in my 2009 code (I saw last night) a "mkdir" wind followed by "create" > code in the same function - which I believe, now, is probably a > race condition (because of the threaded/async structure forced > through the wind/call macro model). But they include an assumption. The query is: are async writes and reads sequential? The two specific cases are; 1) Are all reads that are initiated in time after a write guaranteed to occur after that write has taken affect? 2) Are all writes that are initiated in time after a write (x) guaranteed to occur after that write (x) has taken affect? I could also appreciate that there may be a difference between the top/user layer view and the xlator internals .. if there is then can you please include that view in the explanation? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Tue May 22 01:27:41 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 18:27:41 -0700 Subject: [Gluster-devel] Gluster internals In-Reply-To: <201205220040.q4M0e3ah017846@singularity.tronunltd.com> References: <201205220040.q4M0e3ah017846@singularity.tronunltd.com> Message-ID: On Mon, May 21, 2012 at 5:40 PM, Ian Latter wrote: > > But they include an assumption. > > The query is: are async writes and reads sequential? The > two specific cases are; > > 1) Are all reads that are initiated in time after a write > guaranteed to occur after that write has taken affect? > Yes > > 2) Are all writes that are initiated in time after a write (x) > guaranteed to occur after that write (x) has taken > affect? > Only overlapping offsets/regions retain causal ordering of completion. It is write-behind which acknowledges writes pre-maturely and therefore the layer which must maintain the 'effects' for further reads and writes by making the dependent IOs (overlapping offset/regions) wait for previous write's actual completion. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 22 05:33:37 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 22 May 2012 07:33:37 +0200 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: Message-ID: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Anand Avati wrote: > Is the FUSE SETATTR implementation in NetBSD synchronous? i.e, does the > chown() or chmod() syscall issued by the application strictly block till > GlusterFS's fuse_setattr_cbk() is called? I have been able to narrow the test down to the code below, which does not even call chown(). #include #include #include #include #include #include int main(void) { int fd; (void)mkdir("subdir", 0755); do { if ((fd = open("subdir/bugc1.txt", O_CREAT|O_RDWR, 0644)) == -1) err(EX_OSERR, "open failed"); if (close(fd) == -1) err(EX_OSERR, "close failed"); if (unlink("subdir/bugc1.txt") == -1) err(EX_OSERR, "unlink failed"); } while (1 /*CONSTCOND */); /* NOTREACHED */ return EX_OK; } It produces a FUSE trace without SETATTR: > unique = 393, nodeid = 3098542496, opcode = LOOKUP (1) < unique = 393, nodeid = 3098542496, opcode = LOOKUP (1), error = -2 > unique = 394, nodeid = 3098542496, opcode = CREATE (35) < unique = 394, nodeid = 3098542496, opcode = CREATE (35), error = 0 -> I suspect (not yet checked) this is the place where I get fuse_entry_out with attr.uid = 0. This will be cached since attr_valid tells us to do so. > unique = 395, nodeid = 3098542396, opcode = RELEASE (18) < unique = 395, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > unique = 396, nodeid = 3098542296, opcode = LOOKUP (1) < unique = 396, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 >From other traces, I can tell that this last lookup is for the parent directory (subdir). The FUSE request for looking up bugc1.txt with the intent of deleting is not even sent: from cached uid we obtained from fuse_entry_out, we know that permissions shall be denied (I had a debug printf to check that). We do not even ask. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From anand.avati at gmail.com Tue May 22 05:44:30 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 22:44:30 -0700 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Message-ID: On Mon, May 21, 2012 at 10:33 PM, Emmanuel Dreyfus wrote: > Anand Avati wrote: > > > Is the FUSE SETATTR implementation in NetBSD synchronous? i.e, does the > > chown() or chmod() syscall issued by the application strictly block till > > GlusterFS's fuse_setattr_cbk() is called? > > I have been able to narrow the test down to the code below, which does not > even > call chown(). > > #include > #include > #include > #include > #include > #include > > int > main(void) > { > int fd; > > (void)mkdir("subdir", 0755); > > do { > if ((fd = open("subdir/bugc1.txt", O_CREAT|O_RDWR, 0644)) > == -1) > err(EX_OSERR, "open failed"); > > if (close(fd) == -1) > err(EX_OSERR, "close failed"); > > if (unlink("subdir/bugc1.txt") == -1) > err(EX_OSERR, "unlink failed"); > } while (1 /*CONSTCOND */); > > /* NOTREACHED */ > return EX_OK; > } > > It produces a FUSE trace without SETATTR: > > > unique = 393, nodeid = 3098542496, opcode = LOOKUP (1) > < unique = 393, nodeid = 3098542496, opcode = LOOKUP (1), error = -2 > > unique = 394, nodeid = 3098542496, opcode = CREATE (35) > < unique = 394, nodeid = 3098542496, opcode = CREATE (35), error = 0 > > -> I suspect (not yet checked) this is the place where I get > fuse_entry_out > with attr.uid = 0. This will be cached since attr_valid tells us to do so. > > > unique = 395, nodeid = 3098542396, opcode = RELEASE (18) > < unique = 395, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > > unique = 396, nodeid = 3098542296, opcode = LOOKUP (1) > < unique = 396, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 > > From other traces, I can tell that this last lookup is for the parent > directory > (subdir). The FUSE request for looking up bugc1.txt with the intent of > deleting > is not even sent: from cached uid we obtained from fuse_entry_out, we know > that > permissions shall be denied (I had a debug printf to check that). We do > not even > ask. > > Even in the case where bugc1.txt got a wrong uid returned (assuming so), it should not influence the permissibility of it getting deleted. The deletability of a file is based on the permissions on the parent directory and not the ownership of the file (unless +t sticky bit was set on the directory). Is there a way you can extend the trace code above to show the UIDs getting returned? Maybe it was the parent directory (subdir) that got a wrong UID returned? Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From aavati at redhat.com Tue May 22 07:11:36 2012 From: aavati at redhat.com (Anand Avati) Date: Tue, 22 May 2012 00:11:36 -0700 Subject: [Gluster-devel] RFC on fix to bug #802414 In-Reply-To: <7a290bd4-c833-4a35-af04-adb0052f6ff2@zmail01.collab.prod.int.phx2.redhat.com> References: <7a290bd4-c833-4a35-af04-adb0052f6ff2@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FBB3C28.2020106@redhat.com> The PARENT_DOWN_HANDLED approach will take us backwards from the current state where we are resiliant to frame losses and other class of bugs (i.e, if a frame loss happens on either server or client, it only results in prevented graph cleanup but the graph switch still happens). The root "cause" here is that we are giving up on a very important and fundamental principle of immutability on the fd object. The real solution here is to never modify fd->inode. Instead we must bring about a more native fd "migration" than just re-opening an existing fd on the new graph. Think of the inode migration analogy. The handle coming from FUSE (the address of the object) is a "hint". Usually the hint is right, if the object in the address belongs to the latest graph. If not, using the GFID we resolve a new inode on the latest graph and use it. In case of FD we can do something similar, except there are not GFIDs (which should not be a problem). We need to make the handle coming from FUSE (the address of fd_t) just a hint. If the fd->inode->table->xl->graph is the latest, then the hint was a HIT. If the graph was not the latest, we look for a previous migration attempt+result in the "base" (original) fd's context. If that does not exist or is not fresh (on the latest graph) then we do a new fd creation, open on new graph, fd_unref the old cached result in the fd context of the "base fd" and keep ref to this new result. All this must happen from fuse_resolve_fd(). The setting of the latest fd and updation of the latest fd pointer happens under the scope of the base_fd->lock() which gives it a very clear and unambiguous scope which was missing with the old scheme. [The next step will be to nuke the fd->inode swapping in fuse_create_cbk] Avati On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote: > > > ----- Original Message ----- >> From: "Pranith Kumar Karampuri" >> To: "Anand Avati" >> Cc: "Vijay Bellur", "Amar Tumballi", "Krishnan Parthasarathi" >> , "Raghavendra Gowdappa" >> Sent: Tuesday, May 22, 2012 8:42:58 AM >> Subject: Re: RFC on fix to bug #802414 >> >> Dude, >> We have already put logs yesterday in LOCK and UNLOCK and saw >> that the&fd->inode->lock address changed from LOCK to UNLOCK. > > Yes, even I too believe that the hang is because of fd->inode swap in fuse_migrate_fd and not the one in fuse_create_cbk. We could clearly see in the log files following race: > fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this was a naive fix - hold lock on inode in old graph - to the race-condition caused by swapping fd->inode, which didn't work) > > poll-thr: tries to acquire fd->inode->lock (inode is old_inode present in old-graph) in afr_local_cleanup > fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock > poll-thr: gets woken up from lock call on old_inode->lock. > poll-thr: does its work, but while unlocking, uses fd->inode where inode belongs to new graph. > > we had logs printing lock address before and after acquisition of lock and we could clearly see that lock address changed after acquiring lock in afr_local_cleanup. > >> >>>> "The hang in fuse_migrate_fd is _before_ the inode swap performed >>>> there." >> All the fds are opened on the same file. So all fds in the fd >> migration point to same inode. The race is hit by nth fd, (n+1)th fd >> hangs. We have seen that afr_local_cleanup was doing fd_unref, and >> LOCK(fd->inode->lock) was done with one address then by the time >> UNLOCK(fd->inode->lock) is done the address changed. So the next fd >> that has to migrate hung because the prev inode lock is not >> unlocked. >> >> If after nth fd introduces the race a _cbk comes in epoll thread on >> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will >> hang. >> Which is my theory for the hang we observed on Saturday. >> >> Pranith. >> ----- Original Message ----- >> From: "Anand Avati" >> To: "Raghavendra Gowdappa" >> Cc: "Vijay Bellur", "Amar Tumballi" >> , "Krishnan Parthasarathi" >> , "Pranith Kumar Karampuri" >> >> Sent: Tuesday, May 22, 2012 2:09:33 AM >> Subject: Re: RFC on fix to bug #802414 >> >> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote: >>> Avati, >>> >>> fuse_migrate_fd (running in reader thread - rdthr) assigns new >>> inode to fd, once it looks up inode in new graph. But this >>> assignment can race with code that accesses fd->inode->lock >>> executing in poll-thread (pthr) as follows >>> >>> pthr: LOCK (fd->inode->lock); (inode in old graph) >>> rdthr: fd->inode = inode (resolved in new graph) >>> pthr: UNLOCK (fd->inode->lock); (inode in new graph) >>> >> >> The way I see it (the backtrace output in the other mail), the swap >> happening in fuse_create_cbk() must be the one causing lock/unlock to >> land on different inode objects. The hang in fuse_migrate_fd is >> _before_ >> the inode swap performed there. Can you put some logs in >> fuse_create_cbk()'s inode swap code and confirm this? >> >> >>> Now, any lock operations on inode in old graph will block. Thanks >>> to pranith for pointing to this race-condition. >>> >>> The problem here is we don't have a single lock that can >>> synchronize assignment "fd->inode = inode" and other locking >>> attempts on fd->inode->lock. So, we are thinking that instead of >>> trying to synchronize, eliminate the parallel accesses altogether. >>> This can be done by splitting fd migration into two tasks. >>> >>> 1. Actions on old graph (like fsync to flush writes to disk) >>> 2. Actions in new graph (lookup, open) >>> >>> We can send PARENT_DOWN when, >>> 1. Task 1 is complete. >>> 2. No fop sent by fuse is pending. >>> >>> on receiving PARENT_DOWN, protocol/client will shutdown transports. >>> As part of transport cleanup, all pending frames are unwound and >>> protocol/client will notify its parents with PARENT_DOWN_HANDLED >>> event. Each of the translator will pass this event to its parents >>> once it is convinced that there are no pending fops started by it >>> (like background self-heal, reads as part of read-ahead etc). Once >>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there >>> will be no replies that will be racing with migration (note that >>> migration is done using syncops). At this point in time, it is >>> safe to start Task 2 (which associates fd with an inode in new >>> graph). >>> >>> Also note that reader thread will not do other operations till it >>> completes both tasks. >>> >>> As far as the implementation of this patch goes, major work is in >>> translators like read-ahead, afr, dht to provide the guarantee >>> required to send PARENT_DOWN_HANDLED event to their parents. >>> >>> Please let me know your thoughts on this. >>> >> >> All the above steps might not apply if it is caused by the swap in >> fuse_create_cbk(). Let's confirm that first. >> >> Avati >> From ian.latter at midnightcode.org Tue May 22 07:18:08 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 22 May 2012 17:18:08 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205220718.q4M7I8sJ019827@singularity.tronunltd.com> > > But they include an assumption. > > > > The query is: are async writes and reads sequential? The > > two specific cases are; > > > > 1) Are all reads that are initiated in time after a write > > guaranteed to occur after that write has taken affect? > > > > Yes > Excellent. > > > > 2) Are all writes that are initiated in time after a write (x) > > guaranteed to occur after that write (x) has taken > > affect? > > > > Only overlapping offsets/regions retain causal ordering of completion. It > is write-behind which acknowledges writes pre-maturely and therefore the > layer which must maintain the 'effects' for further reads and writes by > making the dependent IOs (overlapping offset/regions) wait for previous > write's actual completion. > Ok, that should do the trick. Let me mull over this for a while .. Thanks for that info. > Avati > -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Tue May 22 07:44:25 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Tue, 22 May 2012 09:44:25 +0200 Subject: [Gluster-devel] A healing translator In-Reply-To: References: <4FA8E8AB.2040604@datalab.es> Message-ID: <4FBB43D9.9070605@datalab.es> On 05/22/2012 02:11 AM, Anand Avati wrote: > > > On Tue, May 8, 2012 at 2:34 AM, Xavier Hernandez > > wrote: > > Hello developers, > > I would like to expose some ideas we are working on to create a > new kind of translator that should be able to unify and simplify > to some extent the healing procedures of complex translators. > > Currently, the only translator with complex healing capabilities > that we are aware of is AFR. We are developing another translator > that will also need healing capabilities, so we thought that it > would be interesting to create a new translator able to handle the > common part of the healing process and hence to simplify and avoid > duplicated code in other translators. > > The basic idea of the new translator is to handle healing tasks > nearer the storage translator on the server nodes instead to > control everything from a translator on the client nodes. Of > course the heal translator is not able to handle healing entirely > by itself, it needs a client translator which will coordinate all > tasks. The heal translator is intended to be used by translators > that work with multiple subvolumes. > > I will try to explain how it works without entering into too much > details. > > There is an important requisite for all client translators that > use healing: they must have exactly the same list of subvolumes > and in the same order. Currently, I think this is not a problem. > > The heal translator treats each file as an independent entity, and > each one can be in 3 modes: > > 1. Normal mode > > This is the normal mode for a copy or fragment of a file when > it is synchronized and consistent with the same file on other > nodes (for example with other replicas. It is the client > translator who decides if it is synchronized or not). > > 2. Healing mode > > This is the mode used when a client detects an inconsistency > in the copy or fragment of the file stored on this node and > initiates the healing procedures. > > 3. Provider mode (I don't like very much this name, though) > > This is the mode used by client translators when an > inconsistency is detected in this file, but the copy or > fragment stored in this node is considered good and it will be > used as a source to repair the contents of this file on other > nodes. > > Initially, when a file is created, it is set in normal mode. > Client translators that make changes must guarantee that they send > the modification requests in the same order to all the servers. > This should be done using inodelk/entrylk. > > When a change is sent to a server, the client must include a > bitmap mask of the clients to which the request is being sent. > Normally this is a bitmap containing all the clients, however, > when a server fails for some reason some bits will be cleared. The > heal translator uses this bitmap to early detect failures on other > nodes from the point of view of each client. When this condition > is detected, the request is aborted with an error and the client > is notified with the remaining list of valid nodes. If the client > considers the request can be successfully server with the > remaining list of nodes, it can resend the request with the > updated bitmap. > > The heal translator also updates two file attributes for each > change request to mantain the "version" of the data and metadata > contents of the file. A similar task is currently made by AFR > using xattrop. This would not be needed anymore, speeding write > requests. > > The version of data and metadata is returned to the client for > each read request, allowing it to detect inconsistent data. > > When a client detects an inconsistency, it initiates healing. > First of all, it must lock the entry and inode (when necessary). > Then, from the data collected from each node, it must decide which > nodes have good data and which ones have bad data and hence need > to be healed. There are two possible cases: > > 1. File is not a regular file > > In this case the reconstruction is very fast and requires few > requests, so it is done while the file is locked. In this > case, the heal translator does nothing relevant. > > 2. File is a regular file > > For regular files, the first step is to synchronize the > metadata to the bad nodes, including the version information. > Once this is done, the file is set in healing mode on bad > nodes, and provider mode on good nodes. Then the entry and > inode are unlocked. > > When a file is in provider mode, it works as in normal mode, but > refuses to start another healing. Only one client can be healing a > file. > > When a file is in healing mode, each normal write request from any > client are handled as if the file were in normal mode, updating > the version information and detecting possible inconsistencies > with the bitmap. Additionally, the healing translator marks the > written region of the file as "good". > > Each write request from the healing client intended to repair the > file must be marked with a special flag. In this case, the area > that wants to be written is filtered by the list of "good" ranges > (if there are any intersection with a good range, it is removed > from the request). The resulting set of ranges are propagated to > the lower translator and added to the list of "good" ranges but > the version information is not updated. > > Read requests are only served if the range requested is entirely > contained into the "good" regions list. > > There are some additional details, but I think this is enough to > have a general idea of its purpose and how it works. > > The main advantages of this translator are: > > 1. Avoid duplicated code in client translators > 2. Simplify and unify healing methods in client translators > 3. xattrop is not needed anymore in client translators to keep > track of changes > 4. Full file contents are repaired without locking the file > 5. Better detection and prevention of some split brain situations > as soon as possible > > I think it would be very useful. It seems to me that it works > correctly in all situations, however I don't have all the > experience that other developers have with the healing functions > of AFR, so I will be happy to answer any question or suggestion to > solve problems it may have or to improve it. > > What do you think about it ? > > > The goals you state above are all valid. What would really help > (adoption) is if you can implement this as a modification of AFR by > utilizing all the work already done, and you get brownie points if it > is backward compatible with existing AFR. If you already have any code > in a publishable state, please share it with us (github link?). > > Avati I've tried to understand how AFR works and, in some way, some of the ideas have been taken from it. However it is very complex and a lot of changes have been carried out in the master branch over the latest months. It's hard for me to follow them while actively working on my translator. Nevertheless, the main reason to take a separate path was that AFR is strongly bound to replication (at least from what I saw when I analyzed it more deeply. Maybe things have changed now, but haven't had time to review them). The requirements for my translator didn't fit very well with AFR, and the needed effort to understand and modify it to adapt it was too high. It also seems that there isn't any detailed developer info about internals of AFR that could have helped to be more confident to modify it (at least I haven't found it). I'm currenty working on it, but it's not ready yet. As soon as it is in a minimally stable state we will publish it, probably on github. I'll write the url to this list. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From anand.avati at gmail.com Tue May 22 07:48:43 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 22 May 2012 00:48:43 -0700 Subject: [Gluster-devel] A healing translator In-Reply-To: <4FBB43D9.9070605@datalab.es> References: <4FA8E8AB.2040604@datalab.es> <4FBB43D9.9070605@datalab.es> Message-ID: > > > I've tried to understand how AFR works and, in some way, some of the > ideas have been taken from it. However it is very complex and a lot of > changes have been carried out in the master branch over the latest months. > It's hard for me to follow them while actively working on my translator. > Nevertheless, the main reason to take a separate path was that AFR is > strongly bound to replication (at least from what I saw when I analyzed it > more deeply. Maybe things have changed now, but haven't had time to review > them). > Have you reviewed the proactive self-heal daemon (+ changelog indexing translator) which is a potential functional replacement for what you might be attempting? Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 22 08:16:06 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 22 May 2012 08:16:06 +0000 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Message-ID: <20120522081606.GA3976@homeworld.netbsd.org> On Mon, May 21, 2012 at 10:44:30PM -0700, Anand Avati wrote: > Even in the case where bugc1.txt got a wrong uid returned (assuming so), it > should not influence the permissibility of it getting deleted. The > deletability of a file is based on the permissions on the parent directory > and not the ownership of the file (unless +t sticky bit was set on the > directory). This is interesting: I get the behavior you describe on Linux (ext2fs), but NetBSD (FFS) hehaves differently (these are native test, without glusterfs). Is it a grey area in standards? $ ls -la test/ total 16 drwxr-xr-x 2 root wheel 512 May 22 10:10 . drwxr-xr-x 19 manu wheel 5632 May 22 10:10 .. -rw-r--r-- 1 manu wheel 0 May 22 10:10 toto $ whoami manu $ rm -f test/toto rm: test/toto: Permission denied $ uname -sr NetBSD 5.1_STABLE -- Emmanuel Dreyfus manu at netbsd.org From rgowdapp at redhat.com Tue May 22 08:44:00 2012 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Tue, 22 May 2012 04:44:00 -0400 (EDT) Subject: [Gluster-devel] RFC on fix to bug #802414 In-Reply-To: <4FBB3C28.2020106@redhat.com> Message-ID: <96991134-54b7-4e4b-a325-b0cdafec8abb@zmail01.collab.prod.int.phx2.redhat.com> ----- Original Message ----- > From: "Anand Avati" > To: "Raghavendra Gowdappa" > Cc: "Pranith Kumar Karampuri" , "Vijay Bellur" , "Amar Tumballi" > , "Krishnan Parthasarathi" , gluster-devel at nongnu.org > Sent: Tuesday, May 22, 2012 12:41:36 PM > Subject: Re: RFC on fix to bug #802414 > > > > The PARENT_DOWN_HANDLED approach will take us backwards from the > current > state where we are resiliant to frame losses and other class of bugs > (i.e, if a frame loss happens on either server or client, it only > results in prevented graph cleanup but the graph switch still > happens). > > The root "cause" here is that we are giving up on a very important > and > fundamental principle of immutability on the fd object. The real > solution here is to never modify fd->inode. Instead we must bring > about > a more native fd "migration" than just re-opening an existing fd on > the > new graph. > > Think of the inode migration analogy. The handle coming from FUSE > (the > address of the object) is a "hint". Usually the hint is right, if the > object in the address belongs to the latest graph. If not, using the > GFID we resolve a new inode on the latest graph and use it. > > In case of FD we can do something similar, except there are not GFIDs > (which should not be a problem). We need to make the handle coming > from > FUSE (the address of fd_t) just a hint. If the > fd->inode->table->xl->graph is the latest, then the hint was a HIT. > If > the graph was not the latest, we look for a previous migration > attempt+result in the "base" (original) fd's context. If that does > not > exist or is not fresh (on the latest graph) then we do a new fd > creation, open on new graph, fd_unref the old cached result in the fd > context of the "base fd" and keep ref to this new result. All this > must > happen from fuse_resolve_fd(). The setting of the latest fd and > updation > of the latest fd pointer happens under the scope of the > base_fd->lock() > which gives it a very clear and unambiguous scope which was missing > with > the old scheme. I remember discussing this solution during initial design. But, not sure why we dropped it. So, Can I go ahead with the implementation? Is this fix required post 3.3? > > [The next step will be to nuke the fd->inode swapping in > fuse_create_cbk] > > Avati > > On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote: > > > > > > ----- Original Message ----- > >> From: "Pranith Kumar Karampuri" > >> To: "Anand Avati" > >> Cc: "Vijay Bellur", "Amar > >> Tumballi", "Krishnan Parthasarathi" > >> , "Raghavendra Gowdappa" > >> Sent: Tuesday, May 22, 2012 8:42:58 AM > >> Subject: Re: RFC on fix to bug #802414 > >> > >> Dude, > >> We have already put logs yesterday in LOCK and UNLOCK and saw > >> that the&fd->inode->lock address changed from LOCK to UNLOCK. > > > > Yes, even I too believe that the hang is because of fd->inode swap > > in fuse_migrate_fd and not the one in fuse_create_cbk. We could > > clearly see in the log files following race: > > fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this > > was a naive fix - hold lock on inode in old graph - to the > > race-condition caused by swapping fd->inode, which didn't work) > > > > poll-thr: tries to acquire fd->inode->lock (inode is old_inode > > present in old-graph) in afr_local_cleanup > > fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock > > poll-thr: gets woken up from lock call on old_inode->lock. > > poll-thr: does its work, but while unlocking, uses fd->inode where > > inode belongs to new graph. > > > > we had logs printing lock address before and after acquisition of > > lock and we could clearly see that lock address changed after > > acquiring lock in afr_local_cleanup. > > > >> > >>>> "The hang in fuse_migrate_fd is _before_ the inode swap > >>>> performed > >>>> there." > >> All the fds are opened on the same file. So all fds in the fd > >> migration point to same inode. The race is hit by nth fd, (n+1)th > >> fd > >> hangs. We have seen that afr_local_cleanup was doing fd_unref, and > >> LOCK(fd->inode->lock) was done with one address then by the time > >> UNLOCK(fd->inode->lock) is done the address changed. So the next > >> fd > >> that has to migrate hung because the prev inode lock is not > >> unlocked. > >> > >> If after nth fd introduces the race a _cbk comes in epoll thread > >> on > >> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will > >> hang. > >> Which is my theory for the hang we observed on Saturday. > >> > >> Pranith. > >> ----- Original Message ----- > >> From: "Anand Avati" > >> To: "Raghavendra Gowdappa" > >> Cc: "Vijay Bellur", "Amar Tumballi" > >> , "Krishnan Parthasarathi" > >> , "Pranith Kumar Karampuri" > >> > >> Sent: Tuesday, May 22, 2012 2:09:33 AM > >> Subject: Re: RFC on fix to bug #802414 > >> > >> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote: > >>> Avati, > >>> > >>> fuse_migrate_fd (running in reader thread - rdthr) assigns new > >>> inode to fd, once it looks up inode in new graph. But this > >>> assignment can race with code that accesses fd->inode->lock > >>> executing in poll-thread (pthr) as follows > >>> > >>> pthr: LOCK (fd->inode->lock); (inode in old graph) > >>> rdthr: fd->inode = inode (resolved in new graph) > >>> pthr: UNLOCK (fd->inode->lock); (inode in new graph) > >>> > >> > >> The way I see it (the backtrace output in the other mail), the > >> swap > >> happening in fuse_create_cbk() must be the one causing lock/unlock > >> to > >> land on different inode objects. The hang in fuse_migrate_fd is > >> _before_ > >> the inode swap performed there. Can you put some logs in > >> fuse_create_cbk()'s inode swap code and confirm this? > >> > >> > >>> Now, any lock operations on inode in old graph will block. Thanks > >>> to pranith for pointing to this race-condition. > >>> > >>> The problem here is we don't have a single lock that can > >>> synchronize assignment "fd->inode = inode" and other locking > >>> attempts on fd->inode->lock. So, we are thinking that instead of > >>> trying to synchronize, eliminate the parallel accesses > >>> altogether. > >>> This can be done by splitting fd migration into two tasks. > >>> > >>> 1. Actions on old graph (like fsync to flush writes to disk) > >>> 2. Actions in new graph (lookup, open) > >>> > >>> We can send PARENT_DOWN when, > >>> 1. Task 1 is complete. > >>> 2. No fop sent by fuse is pending. > >>> > >>> on receiving PARENT_DOWN, protocol/client will shutdown > >>> transports. > >>> As part of transport cleanup, all pending frames are unwound and > >>> protocol/client will notify its parents with PARENT_DOWN_HANDLED > >>> event. Each of the translator will pass this event to its parents > >>> once it is convinced that there are no pending fops started by it > >>> (like background self-heal, reads as part of read-ahead etc). > >>> Once > >>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there > >>> will be no replies that will be racing with migration (note that > >>> migration is done using syncops). At this point in time, it is > >>> safe to start Task 2 (which associates fd with an inode in new > >>> graph). > >>> > >>> Also note that reader thread will not do other operations till it > >>> completes both tasks. > >>> > >>> As far as the implementation of this patch goes, major work is in > >>> translators like read-ahead, afr, dht to provide the guarantee > >>> required to send PARENT_DOWN_HANDLED event to their parents. > >>> > >>> Please let me know your thoughts on this. > >>> > >> > >> All the above steps might not apply if it is caused by the swap in > >> fuse_create_cbk(). Let's confirm that first. > >> > >> Avati > >> > > From xhernandez at datalab.es Tue May 22 08:51:22 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Tue, 22 May 2012 10:51:22 +0200 Subject: [Gluster-devel] A healing translator In-Reply-To: References: <4FA8E8AB.2040604@datalab.es> <4FBB43D9.9070605@datalab.es> Message-ID: <4FBB538A.70201@datalab.es> On 05/22/2012 09:48 AM, Anand Avati wrote: > >> > I've tried to understand how AFR works and, in some way, some of > the ideas have been taken from it. However it is very complex and > a lot of changes have been carried out in the master branch over > the latest months. It's hard for me to follow them while actively > working on my translator. Nevertheless, the main reason to take a > separate path was that AFR is strongly bound to replication (at > least from what I saw when I analyzed it more deeply. Maybe things > have changed now, but haven't had time to review them). > > > Have you reviewed the proactive self-heal daemon (+ changelog indexing > translator) which is a potential functional replacement for what you > might be attempting? > > Avati I must admit that I've read something about it but I haven't had time to explore it in detail. If I understand it correctly, the self-heal daemon works as a client process but can be executed on server nodes. I suppose that multiple self-heal daemons can be running on different nodes. Then, each daemon detects invalid files (not sure exactly how) and replicates the changes from one good node to the bad nodes. The problem is that in the translator I'm working on, the information is dispersed among multiple nodes, so there isn't a single server node that contains the whole data. To repair a node, data must be read from at least two other nodes (it depends on configuration). From what I've read from AFR and the self-healing daemon, it's not straightforward to adapt them to this mechanism because they would need to know a subset of nodes with consistent data, not only one. Each daemon would have to contact all other nodes, read data from each one, determine which ones are valid, rebuild the data and send it to the bad nodes. This means that the daemon will have to be as complex as the clients. My impression (but I may be wrong) is that AFR and the self-healing daemon are closely bound to the replication schema, so it is very hard to try to use them for other purposes. The healing translator I'm writing tries to offer generic server side helpers for the healing process, but it is the client side who really manages the healing operation (though heavily simplified) and could use it to replicate data, to disperse data, or some other schema. Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 22 09:08:48 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 22 May 2012 09:08:48 +0000 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Message-ID: <20120522090848.GC3976@homeworld.netbsd.org> On Mon, May 21, 2012 at 10:44:30PM -0700, Anand Avati wrote: > Is there a way you can extend the trace code above to show the UIDs getting > returned? Maybe it was the parent directory (subdir) that got a wrong UID > returned? Further investigation shows you are right. I traced the struct fuse_entry_out returned by glusterfs on LOOKUP; "/subdir", uid = 500, gid = 500, mode = 040755, attr_valid = 1 "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 bugc1.txt is looked up many times as I loop creating/deleting it subdir is not looked up often since it is cached for 1 second. New subdir lookups will return correct uid/gid/mode. After some time, though, it will return incorrect information: "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 "/subdir", uid = 0, gid = 0, mode = 040700, attr_valid = 1 -- Emmanuel Dreyfus manu at netbsd.org From aavati at redhat.com Tue May 22 17:47:49 2012 From: aavati at redhat.com (Anand Avati) Date: Tue, 22 May 2012 10:47:49 -0700 Subject: [Gluster-devel] RFC on fix to bug #802414 In-Reply-To: <96991134-54b7-4e4b-a325-b0cdafec8abb@zmail01.collab.prod.int.phx2.redhat.com> References: <96991134-54b7-4e4b-a325-b0cdafec8abb@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FBBD145.3030303@redhat.com> On 05/22/2012 01:44 AM, Raghavendra Gowdappa wrote: > > > ----- Original Message ----- >> From: "Anand Avati" >> To: "Raghavendra Gowdappa" >> Cc: "Pranith Kumar Karampuri", "Vijay Bellur", "Amar Tumballi" >> , "Krishnan Parthasarathi", gluster-devel at nongnu.org >> Sent: Tuesday, May 22, 2012 12:41:36 PM >> Subject: Re: RFC on fix to bug #802414 >> >> >> >> The PARENT_DOWN_HANDLED approach will take us backwards from the >> current >> state where we are resiliant to frame losses and other class of bugs >> (i.e, if a frame loss happens on either server or client, it only >> results in prevented graph cleanup but the graph switch still >> happens). >> >> The root "cause" here is that we are giving up on a very important >> and >> fundamental principle of immutability on the fd object. The real >> solution here is to never modify fd->inode. Instead we must bring >> about >> a more native fd "migration" than just re-opening an existing fd on >> the >> new graph. >> >> Think of the inode migration analogy. The handle coming from FUSE >> (the >> address of the object) is a "hint". Usually the hint is right, if the >> object in the address belongs to the latest graph. If not, using the >> GFID we resolve a new inode on the latest graph and use it. >> >> In case of FD we can do something similar, except there are not GFIDs >> (which should not be a problem). We need to make the handle coming >> from >> FUSE (the address of fd_t) just a hint. If the >> fd->inode->table->xl->graph is the latest, then the hint was a HIT. >> If >> the graph was not the latest, we look for a previous migration >> attempt+result in the "base" (original) fd's context. If that does >> not >> exist or is not fresh (on the latest graph) then we do a new fd >> creation, open on new graph, fd_unref the old cached result in the fd >> context of the "base fd" and keep ref to this new result. All this >> must >> happen from fuse_resolve_fd(). The setting of the latest fd and >> updation >> of the latest fd pointer happens under the scope of the >> base_fd->lock() >> which gives it a very clear and unambiguous scope which was missing >> with >> the old scheme. > > I remember discussing this solution during initial design. But, not sure why we dropped it. So, Can I go ahead with the implementation? Is this fix required post 3.3? > The solution you are probably referring to was dropped because there we were talking about chaining FDs to the one on the "next graph" as graphs keep getting changed. The one described above is different because here there will one base fd (the original one on which open() by fuse was performed) and new graphs result in creation of an internal new fd directly referred by the base fd (and naturally unref the previous "new fd") thereby keeping things quite trim. Avati From anand.avati at gmail.com Tue May 22 20:09:52 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 22 May 2012 13:09:52 -0700 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: <20120522090848.GC3976@homeworld.netbsd.org> References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> <20120522090848.GC3976@homeworld.netbsd.org> Message-ID: On Tue, May 22, 2012 at 2:08 AM, Emmanuel Dreyfus wrote: > > Further investigation shows you are right. I traced the > struct fuse_entry_out returned by glusterfs on LOOKUP; > > "/subdir", uid = 500, gid = 500, mode = 040755, attr_valid = 1 > ... > "/subdir", uid = 0, gid = 0, mode = 040700, attr_valid = 1 > Note that even mode has changed, not just the uid/gid. It will probably help if you can put a breakpoint in this case and inspect the stack about where these attribute fields are fetched from (some cache? from posix?) Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Wed May 23 02:04:25 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 23 May 2012 04:04:25 +0200 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: Message-ID: <1kkj4ca.1knxmw01kr7wlgM%manu@netbsd.org> Anand Avati wrote: > Note that even mode has changed, not just the uid/gid. It will probably > help if you can put a breakpoint in this case and inspect the stack about > where these attribute fields are fetched from (some cache? from posix?) My tests shows that the garbage is introduced by mdc_inode_iatt_get() in mdc_lookup(). -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vijay at build.gluster.com Wed May 23 13:57:15 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Wed, 23 May 2012 06:57:15 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa43 released Message-ID: <20120523135718.0E6111008C@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa43/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz This release is made off v3.3.0qa43 From manu at netbsd.org Wed May 23 16:58:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 23 May 2012 16:58:02 +0000 Subject: [Gluster-devel] preparent and postparent? Message-ID: <20120523165802.GC17268@homeworld.netbsd.org> Hi in the protocol/server xlator, there are many occurences where callbacks have a struct iatt for preparent and postparent. What are these for? Is it a normal behavior to have different things in preparent and postparent? -- Emmanuel Dreyfus manu at netbsd.org From jdarcy at redhat.com Wed May 23 17:03:41 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Wed, 23 May 2012 13:03:41 -0400 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <20120523165802.GC17268@homeworld.netbsd.org> References: <20120523165802.GC17268@homeworld.netbsd.org> Message-ID: <20120523130341.1ee693a3@jdarcy-dt.usersys.redhat.com> On Wed, 23 May 2012 16:58:02 +0000 Emmanuel Dreyfus wrote: > in the protocol/server xlator, there are many occurences where > callbacks have a struct iatt for preparent and postparent. What are > these for? NFS needs them to support its style of caching. From manu at netbsd.org Thu May 24 01:31:18 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 24 May 2012 03:31:18 +0200 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <20120523130341.1ee693a3@jdarcy-dt.usersys.redhat.com> Message-ID: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> Jeff Darcy wrote: > > in the protocol/server xlator, there are many occurences where > > callbacks have a struct iatt for preparent and postparent. What are > > these for? > > NFS needs them to support its style of caching. Let me rephrase: what information is stored in preparent and postparent? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Thu May 24 04:29:39 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 24 May 2012 06:29:39 +0200 Subject: [Gluster-devel] gerrit Message-ID: <1kkl5w4.dyowb9lel6oM%manu@netbsd.org> Hi In gerrit, if I sign it and look at the Download field in a patchset, I see this: git fetch ssh://manu@/glusterfs refs/changes/13/3413/2 && git format-patch -1 --stdout FETCH_HEAD It would be nice if I had ssh://manu at git.gluster.com/glusterfs.git so that the line can be copy/pasted without the need to edit each time. Is it something I need to configure (where?), or is it a global setting beyond my reach (in that case, please someone fix it!) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From anand.avati at gmail.com Thu May 24 06:30:20 2012 From: anand.avati at gmail.com (Anand Avati) Date: Wed, 23 May 2012 23:30:20 -0700 Subject: [Gluster-devel] gerrit In-Reply-To: <1kkl5w4.dyowb9lel6oM%manu@netbsd.org> References: <1kkl5w4.dyowb9lel6oM%manu@netbsd.org> Message-ID: fixed! On Wed, May 23, 2012 at 9:29 PM, Emmanuel Dreyfus wrote: > Hi > > In gerrit, if I sign it and look at the Download field in a patchset, I > see this: > > git fetch ssh://manu@/glusterfs refs/changes/13/3413/2 && git > format-patch -1 --stdout FETCH_HEAD > > It would be nice if I had ssh://manu at git.gluster.com/glusterfs.git > so that the line can be copy/pasted without the need to edit each time. > Is it something I need to configure (where?), or is it a global setting > beyond my reach (in that case, please someone fix it!) > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at datalab.es Thu May 24 07:10:59 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Thu, 24 May 2012 09:10:59 +0200 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> References: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> Message-ID: <4FBDDF03.8080203@datalab.es> On 05/24/2012 03:31 AM, Emmanuel Dreyfus wrote: > Jeff Darcy wrote: > >>> in the protocol/server xlator, there are many occurences where >>> callbacks have a struct iatt for preparent and postparent. What are >>> these for? >> NFS needs them to support its style of caching. > Let me rephrase: what information is stored in preparent and postparent? preparent and postparent have the attributes (modification time, size, permissions, ...) of the parent directory of the file being modified before and after the modification is done. Xavi From jdarcy at redhat.com Thu May 24 13:05:08 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 24 May 2012 09:05:08 -0400 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <4FBDDF03.8080203@datalab.es> References: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> <4FBDDF03.8080203@datalab.es> Message-ID: <4FBE3204.7050005@redhat.com> On 05/24/2012 03:10 AM, Xavier Hernandez wrote: > preparent and postparent have the attributes (modification time, size, > permissions, ...) of the parent directory of the file being modified > before and after the modification is done. Thank you, Xavi. :) If you really want to have some fun, you can take a look at the rename callback, which has pre- and post-attributes for both the old and new parent. From johnmark at redhat.com Thu May 24 19:21:22 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 24 May 2012 15:21:22 -0400 (EDT) Subject: [Gluster-devel] glusterfs-3.3.0qa43 released In-Reply-To: <20120523135718.0E6111008C@build.gluster.com> Message-ID: <7c8ea685-d794-451e-820a-25f784e7873d@zmail01.collab.prod.int.phx2.redhat.com> A reminder: As we come down to the final days, it is vitally important that we test these last few qa releases. This one, in particular, contains fixes added to the 3.3 branch after beta 4 was release last week: http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz Please consider using the testing page when evaluating: http://www.gluster.org/community/documentation/index.php/3.3.0_Beta_4_Tests Also, if someone would like to test the object storage as well as the HDFS piece, please report here, or create another test page on the wiki. Finally, you can track all commits to the master and 3.3 branches on Twitter (@glusterdev) ...and via Atom/Rss - https://github.com/gluster/glusterfs/commits/release-3.3.atom https://github.com/gluster/glusterfs/commits/master.atom -JM ----- Original Message ----- > > http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa43/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz > > This release is made off v3.3.0qa43 > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From xhernandez at datalab.es Fri May 25 07:28:43 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Fri, 25 May 2012 09:28:43 +0200 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <4FBE3204.7050005@redhat.com> References: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> <4FBDDF03.8080203@datalab.es> <4FBE3204.7050005@redhat.com> Message-ID: <4FBF34AB.6070606@datalab.es> On 05/24/2012 03:05 PM, Jeff Darcy wrote: > On 05/24/2012 03:10 AM, Xavier Hernandez wrote: >> preparent and postparent have the attributes (modification time, size, >> permissions, ...) of the parent directory of the file being modified >> before and after the modification is done. > Thank you, Xavi. :) If you really want to have some fun, you can take a look > at the rename callback, which has pre- and post-attributes for both the old and > new parent. Yes, I've had some "fun" with them. Without them almost all callbacks would seem too short to me now... hehehe From fernando.frediani at qubenet.net Fri May 25 09:44:10 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 09:44:10 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> Hi, I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 beta4 and the new type of volume striped + replicated. My go is to use it to run Virtual Machines (.vmdk files). Volume is created fine and the ESXi server mountw the Datastore using Gluster built-in NFS, however when trying to use the Datastore or even read, it hangs. Looking at the Gluster NFS logs I see: "[socket.c:195:__socket_rwv] 0-socket.nfs-server: readv failed (Connection reset by peer)" In order to get the rpm files installed I had first to install these two because of the some libraries: "compat-readline5-5.2-17.1.el6.x86_64".rpm and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has anything to do with that. Has anyone ever used Gluster as a backend storage for ESXi ? Does it actually work ? Regards, Fernando Frediani Lead Systems Engineer Qube Managed Services Limited 260-266 Goswell Road, London, EC1V 7EB, United Kingdom sales: +44 (0) 20 7150 3800 ddi: +44 (0) 20 7150 3803 fax: +44 (0) 20 7336 8420 web: http://www.qubenet.net/ P Please consider the environment before printing this email -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.frediani at qubenet.net Fri May 25 11:36:55 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 11:36:55 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> Actually, even on another Linux machine mounting NFS has the same behaviour. I am able to mount it with "mount -t nfs ..." but when I try "ls" it hangs as well. One particular thing of the Gluster servers is that they have two networks, one for management with default gateway and another only for storage. I am only able to mount on the storage network. The hosts file has all nodes' names with the ips on the storage network. I tried to use this but didn't work either. gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* Watching the nfs logs when I try a "ls" from the remote client it shows: pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2012-05-25 11:38:09 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.3.0beta4 /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup+0xa5)[0x7f1c8e7a6ac5] /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdirp_cbk+0x536)[0x7f1c8e543346] /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdirp_cbk+0x1ca)[0x7f1c8e76269a] /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_readdirp_cbk+0x170)[0x7f1c8e9dbbe0] /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x7f1c8f818c8f] /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_handler+0x188)[0x7f1c8f818e38] /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] /usr/sbin/glusterfs(main+0x502)[0x406612] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] /usr/sbin/glusterfs[0x404399] Thanks Fernando From: Fernando Frediani (Qube) Sent: 25 May 2012 10:44 To: 'gluster-devel at nongnu.org' Subject: Can't use NFS with VMware ESXi Hi, I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 beta4 and the new type of volume striped + replicated. My go is to use it to run Virtual Machines (.vmdk files). Volume is created fine and the ESXi server mountw the Datastore using Gluster built-in NFS, however when trying to use the Datastore or even read, it hangs. Looking at the Gluster NFS logs I see: "[socket.c:195:__socket_rwv] 0-socket.nfs-server: readv failed (Connection reset by peer)" In order to get the rpm files installed I had first to install these two because of the some libraries: "compat-readline5-5.2-17.1.el6.x86_64".rpm and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has anything to do with that. Has anyone ever used Gluster as a backend storage for ESXi ? Does it actually work ? Regards, Fernando Frediani Lead Systems Engineer Qube Managed Services Limited 260-266 Goswell Road, London, EC1V 7EB, United Kingdom sales: +44 (0) 20 7150 3800 ddi: +44 (0) 20 7150 3803 fax: +44 (0) 20 7336 8420 web: http://www.qubenet.net/ P Please consider the environment before printing this email -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Fri May 25 13:35:19 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 25 May 2012 13:35:19 +0000 Subject: [Gluster-devel] mismatching ino/dev between file Message-ID: <20120525133519.GC19383@homeworld.netbsd.org> Hi Here is a bug with release-3.3. It happens on a 2 way replicated. Here is what I have in one brick: [2012-05-25 15:03:09.463446] W [posix-handle.c:487:posix_handle_hard] 0-pfs-posix: mismatching ino/dev between file /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (57943061/16) and handle /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce (57943060/16) [2012-05-25 15:03:09.463552] E [posix.c:1277:posix_symlink] 0-pfs-posix: setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 failed On the other one: [2012-05-25 15:03:09.447682] W [posix-handle.c:487:posix_handle_hard] 0-pfs-posix: mismatching ino/dev between file /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (50557989/24) and handle /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce (50557988/24) [2012-05-25 15:03:09.447774] E [posix.c:1277:posix_symlink] 0-pfs-posix: setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 failed Someone can give me a hint of what happens, and how to track it down? -- Emmanuel Dreyfus manu at netbsd.org From abperiasamy at gmail.com Fri May 25 17:09:09 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Fri, 25 May 2012 10:09:09 -0700 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with ?mount ?t nfs ?? but when I try ?ls? it hangs as > well. > > One particular thing of the Gluster servers is that they have two networks, > one for management with default gateway and another only for storage. I am > only able to mount on the storage network. > > The hosts file has all nodes? names with the ips on the storage network. > > > > I tried to use this but didn?t work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a ?ls? from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup+0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdirp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdirp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_readdirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_handler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I?ve setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 beta4 > and the new type of volume striped + replicated. My go is to use it to run > Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or even > read, it hangs. > > > > Looking at the Gluster NFS logs I see: ????[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)? > > > > In order to get the rpm files installed I had first to install these two > because of the some libraries: ?compat-readline5-5.2-17.1.el6.x86_64?.rpm > and ?openssl098e-0.9.8e-17.el6.centos.x86_64.rpm?.Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From pmatthaei at debian.org Fri May 25 18:56:37 2012 From: pmatthaei at debian.org (=?ISO-8859-1?Q?Patrick_Matth=E4i?=) Date: Fri, 25 May 2012 20:56:37 +0200 Subject: [Gluster-devel] glusterfs-3.2.7qa1 released In-Reply-To: <20120412172933.6A2A8102E6@build.gluster.com> References: <20120412172933.6A2A8102E6@build.gluster.com> Message-ID: <4FBFD5E5.1060901@debian.org> Am 12.04.2012 19:29, schrieb Vijay Bellur: > > http://bits.gluster.com/pub/gluster/glusterfs/3.2.7qa1/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.2.7qa1.tar.gz > > This release is made off v3.2.7qa1 Hey, I have tested this qa release and could not find any regression/problem. It would be realy nice to have a 3.2.7 release in the next days (max 2 weeks from now on) so that we could ship glusterfs 3.2.7 instead of 3.2.6 with our next release Debian Wheezy! -- /* Mit freundlichem Gru? / With kind regards, Patrick Matth?i GNU/Linux Debian Developer E-Mail: pmatthaei at debian.org patrick at linux-dev.org */ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: OpenPGP digital signature URL: From fernando.frediani at qubenet.net Fri May 25 19:33:37 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 19:33:37 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From fernando.frediani at qubenet.net Fri May 25 20:32:25 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 20:32:25 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> Anand, Further to that I managed to mount the Datastore and deploy machines there, but when trying to power them On I get an error as if it couldn't find a file. Has anyone seen these kind of error before ? I would say that it could be a lock problem, but it doesn't seem to. Permissions maybe ? Or the way the NFS is exported ? (root_squash, no_root_squash, etc) Here is the log: An unexpected error was received from the ESX host while powering on VM vm-21112. Failed to power on VM. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 20:34 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From manu at netbsd.org Sat May 26 05:37:51 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 07:37:51 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate Message-ID: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> here is a bug in release-3.3: ./xinstall -c -p -r -m 555 xinstall /pfs/manu/netbsd/usr/src/tooldir.NetBSD-6.99.4-i386/bin/i386--netbsdelf-instal xinstall: /pfs/manu/netbsd/usr/src/tooldir.NetBSD-6.99.4-i386/bin/inst.00033a: chmod: Permission denied Kernel trace, client side: 33 1 xinstall CALL open(0xbfbfd8e0,0xa02,0x180) 33 1 xinstall NAMI "/pfs/manu/netbsd/usr/src/tooldir.NetBSD-6.99.4-i38 6/bin/inst.00033a" 33 1 xinstall RET open 3 33 1 xinstall CALL open(0x (...) 33 1 xinstall CALL fchmod(3,0x16d) 33 1 xinstall RET fchmod -1 errno 13 Permission denied I tracked this down to posix_acl_truncate() on the server, where loc->inode and loc->pah are NULL. This code goes red and raise EACCESS: if (acl_permits (frame, loc->inode, POSIX_ACL_WRITE)) goto green; else goto red; Here is the relevant baccktrace: #9 0xb99d1ca6 in posix_acl_truncate (frame=0xbb77f780, this=0xb9d20000, loc=0xb9d41020, off=48933, xdata=0x0) at posix-acl.c:898 #10 0xb99ba4f8 in truncate_stat_cbk (frame=0xbb77f400, cookie=0xbb77f6a0, this=0xb9d22000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at posix.c:204 #11 0xbbb87933 in default_stat_cbk (frame=0xbb77f6a0, cookie=0xbb77f710, this=0xb9d20000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at defaults.c:47 #12 0xb99e1751 in posix_stat (frame=0xbb77f710, this=0xb9d1f000, loc=0xba60091c, xdata=0x0) at posix.c:231 In frame 12, loc->inode is not NULL, and loc->path makes sense: "/netbsd/usr/src/tooldir.NetBSD-6.9 9.4-i386/bin/inst.01911a" In frame 10, loc->path and loc->inode are NULL. In note that xlators/features/locks/src/posix.c:pl_ftruncate() sets truncate_stat_cbk() as the callback, and not ftruncate_stat_cbk(). That later function does not even exist. f-style functions not calling f-style callbacks have been the root of various bugs so far, is it one more of them? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vbellur at redhat.com Sat May 26 07:44:52 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Sat, 26 May 2012 13:14:52 +0530 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> References: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> Message-ID: <4FC089F4.3070004@redhat.com> On 05/26/2012 11:07 AM, Emmanuel Dreyfus wrote: > here is a bug in release-3.3: > > > I tracked this down to posix_acl_truncate() on the server, where loc->inode > and loc->pah are NULL. This code goes red and raise EACCESS: > > if (acl_permits (frame, loc->inode, POSIX_ACL_WRITE)) > goto green; > else > goto red; > > Here is the relevant baccktrace: > > #9 0xb99d1ca6 in posix_acl_truncate (frame=0xbb77f780, this=0xb9d20000, > loc=0xb9d41020, off=48933, xdata=0x0) at posix-acl.c:898 > #10 0xb99ba4f8 in truncate_stat_cbk (frame=0xbb77f400, cookie=0xbb77f6a0, > this=0xb9d22000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) > at posix.c:204 > #11 0xbbb87933 in default_stat_cbk (frame=0xbb77f6a0, cookie=0xbb77f710, > this=0xb9d20000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) > at defaults.c:47 > #12 0xb99e1751 in posix_stat (frame=0xbb77f710, this=0xb9d1f000, > loc=0xba60091c, xdata=0x0) at posix.c:231 > > In frame 12, loc->inode is not NULL, and loc->path makes sense: > "/netbsd/usr/src/tooldir.NetBSD-6.9 > 9.4-i386/bin/inst.01911a" > > In frame 10, loc->path and loc->inode are NULL. > > In note that xlators/features/locks/src/posix.c:pl_ftruncate() sets > truncate_stat_cbk() as the callback, and not ftruncate_stat_cbk(). That later > function does not even exist. f-style functions not calling f-style callbacks > have been the root of various bugs so far, is it one more of them? I don't think it is a f-style problem. I do not get a EPERM with the testcase that you posted for qa39. Can you please provide a bigger bt? Thanks, Vijay > > From manu at netbsd.org Sat May 26 09:00:22 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 11:00:22 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC089F4.3070004@redhat.com> Message-ID: <1kkp7w9.1a5c4mz1tiqw8rM%manu@netbsd.org> Vijay Bellur wrote: > I don't think it is a f-style problem. I do not get a EPERM with the > testcase that you posted for qa39. Can you please provide a bigger bt? #3 0xb99414c4 in server_truncate_cbk (frame=0xba901714, cookie=0xbb77f010, this=0xb9d27000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at server3_1-fops.c:1218 #4 0xb9968bd6 in io_stats_truncate_cbk (frame=0xbb77f010, cookie=0xbb77f080, this=0xb9d26000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at io-stats.c:1600 #5 0xb998036e in marker_truncate_cbk (frame=0xbb77f080, cookie=0xbb77f0f0, this=0xb9d25000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at marker.c:1535 #6 0xbbb87a85 in default_truncate_cbk (frame=0xbb77f0f0, cookie=0xbb77f160, this=0xb9d24000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at defaults.c:58 #7 0xb99a8fa2 in iot_truncate_cbk (frame=0xbb77f160, cookie=0xbb77f400, this=0xb9d23000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at io-threads.c:1270 #8 0xb99b9fe0 in pl_truncate_cbk (frame=0xbb77f400, cookie=0xbb77f780, this=0xb9d22000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at posix.c:119 #9 0xb99d1ca6 in posix_acl_truncate (frame=0xbb77f780, this=0xb9d20000, loc=0xb9d41020, off=48933, xdata=0x0) at posix-acl.c:898 #10 0xb99ba4f8 in truncate_stat_cbk (frame=0xbb77f400, cookie=0xbb77f6a0, this=0xb9d22000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at posix.c:204 #11 0xbbb87933 in default_stat_cbk (frame=0xbb77f6a0, cookie=0xbb77f710, this=0xb9d20000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at defaults.c:47 #12 0xb99e1751 in posix_stat (frame=0xbb77f710, this=0xb9d1f000, loc=0xba60091c, xdata=0x0) at posix.c:231 #13 0xbbb94d76 in default_stat (frame=0xbb77f6a0, this=0xb9d20000, loc=0xba60091c, xdata=0x0) at defaults.c:1231 #14 0xb99babb0 in pl_truncate (frame=0xbb77f400, this=0xb9d22000, loc=0xba60091c, offset=48933, xdata=0x0) at posix.c:249 #15 0xb99a91ac in iot_truncate_wrapper (frame=0xbb77f160, this=0xb9d23000, loc=0xba60091c, offset=48933, xdata=0x0) at io-threads.c:1280 #16 0xbbba76d8 in call_resume_wind (stub=0xba6008fc) at call-stub.c:2474 #17 0xbbbae729 in call_resume (stub=0xba6008fc) at call-stub.c:4151 #18 0xb99a22a3 in iot_worker (data=0xb9d12110) at io-threads.c:131 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 26 11:51:46 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 13:51:46 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC089F4.3070004@redhat.com> Message-ID: <1kkpd53.bn09pz1v8qmwtM%manu@netbsd.org> Vijay Bellur wrote: > I don't think it is a f-style problem. I do not get a EPERM with the > testcase that you posted for qa39. I wonder if the bug can occur because some mess in the .glusterfs directory cause by an earlier problem. Is it possible? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 26 12:55:08 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 14:55:08 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkpd53.bn09pz1v8qmwtM%manu@netbsd.org> Message-ID: <1kkpirc.geu5yvq0165fM%manu@netbsd.org> Emmanuel Dreyfus wrote: > I wonder if the bug can occur because some mess in the .glusterfs > directory cause by an earlier problem. Is it possible? That is not the problem: I nuked .glusterfs on all bricks and the problem remain. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 26 14:20:10 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 16:20:10 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC089F4.3070004@redhat.com> Message-ID: <1kkpmmr.rrgubdjz6w9fM%manu@netbsd.org> Vijay Bellur wrote: > I don't think it is a f-style problem. I do not get a EPERM with the > testcase that you posted for qa39. Can you please provide a bigger bt? Here is a minimal test case that reproduces the problem at mine. Run it as un unprivilegied user in a directory you on which you have write access: $ pwd /pfs/manu/xinstall $ ls -ld . drwxr-xr-x 4 manu manu 512 May 26 16:17 . $ id uid=500(manu) gid=500(manu) groups=500(manu),0(wheel) $ ./test test: fchmod failed: Permission denied #include #include #include #include #include #include #include #define TESTFILE "testfile" int main(void) { int fd; char buf[16384]; if ((unlink(TESTFILE) == -1) && (errno != ENOENT)) err(EX_OSERR, "unlink failed"); if ((fd = open(TESTFILE, O_CREAT|O_EXCL|O_RDWR, 0600)) == -1) err(EX_OSERR, "open failed"); if (write(fd, buf, sizeof(buf)) != sizeof(buf)) err(EX_OSERR, "write failed"); if (fchmod(fd, 0555) == -1) err(EX_OSERR, "fchmod failed"); if (close(fd) == -1) err(EX_OSERR, "close failed"); return EX_OK; } -- Emmanuel Dreyfus http://hcpnet.free.fr/pubzx@ manu at netbsd.org From manu at netbsd.org Sun May 27 05:17:51 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 27 May 2012 07:17:51 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> Message-ID: <1kkqh7z.uvmz7na7peuaM%manu@netbsd.org> Emmanuel Dreyfus wrote: > In frame 10, loc->path and loc->inode are NULL. Here is the investigation so far: xlators/features/locks/src/posix.c:truncate_stat_cbk() has a NULL loc->inode, and this leads to the acl check that fails. As I understand this is a FUSE implentation problem. fchmod() produces a FUSE SETATTR. If the file is being written, NetBSD FUSE will set mode, size, atime, mtime, and fh in this operation. I suspect Linux FUSE only sets mode and fh and this is why the bug does not appear on Linux: the truncate code path is probably not involved. Can someone confirm? If this is the case, it suggests the code path may have never been tested. I suspect there are bugs there, for instance, in pl_truncate_cbk, local is erased after being retreived, which does not look right: local = frame->local; local = mem_get0 (this->local_pool); if (local->op == TRUNCATE) loc_wipe (&local->loc); I tried fixing that one without much improvments. There may be other problems. About fchmod() setting size: is it a reasonable behavior? FUSE does not specify what must happens, so if glusterfs rely on the Linux kernel not doing it may be begging for future bugs if that behavior change. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vbellur at redhat.com Sun May 27 06:54:43 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Sun, 27 May 2012 12:24:43 +0530 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkqh7z.uvmz7na7peuaM%manu@netbsd.org> References: <1kkqh7z.uvmz7na7peuaM%manu@netbsd.org> Message-ID: <4FC1CFB3.7050808@redhat.com> On 05/27/2012 10:47 AM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > >> In frame 10, loc->path and loc->inode are NULL. > > > As I understand this is a FUSE implentation problem. fchmod() produces a > FUSE SETATTR. If the file is being written, NetBSD FUSE will set mode, > size, atime, mtime, and fh in this operation. I suspect Linux FUSE only > sets mode and fh and this is why the bug does not appear on Linux: the > truncate code path is probably not involved. For the testcase that you sent out, I see fsi->valid being set to 1 which indicates only mode on Linux. The truncate path does not get involved. I modified the testcase to send ftruncate/truncate and it completed successfully. > > > Can someone confirm? If this is the case, it suggests the code path may > have never been tested. I suspect there are bugs there, for instance, in > pl_truncate_cbk, local is erased after being retreived, which does not > look right: > > local = frame->local; > > local = mem_get0 (this->local_pool); I don't see this in pl_truncate_cbk(). mem_get0 is done only in pl_truncate(). A code inspection in pl_(f)truncate did not raise any suspicions to me. > > > About fchmod() setting size: is it a reasonable behavior? FUSE does not > specify what must happens, so if glusterfs rely on the Linux kernel not > doing it may be begging for future bugs if that behavior change. > I am not sure why fchmod() should set size. Csaba, any thoughts on this? Vijay From manu at netbsd.org Sun May 27 07:34:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 27 May 2012 09:34:02 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC1CFB3.7050808@redhat.com> Message-ID: <1kkqxw0.1smapd1jsih9iM%manu@netbsd.org> Vijay Bellur wrote: > For the testcase that you sent out, I see fsi->valid being set to 1 > which indicates only mode on Linux. The truncate path does not get > involved. I modified the testcase to send ftruncate/truncate and it > completed successfully. I modified by FUSE implementation to send FATTR_SIZE|FATTR_FH in one request, and FATTR_MODE|FATTR_FH|FATTR_MTIME|FATTR_ATIME in a separate one, and the test passes fine. On your test not raising the bug: Is it possible that Linux already sent a FATTR_SIZE|FATTR_FH when fchmod() is invoked, and that glusterfs discards a FATTR_SIZE that does not really resize? Did you try with supplying a bigger size? > > local = mem_get0 (this->local_pool); > I don't see this in pl_truncate_cbk(). mem_get0 is done only in > pl_truncate(). A code inspection in pl_(f)truncate did not raise any > suspicions to me. Right, this was an unfortunate copy/paste. However reverting to correct code does not fix the bug when FUSE sends FATTR_SIZE is set with FATTR_MODE at the same time. > I am not sure why fchmod() should set size. Csaba, any thoughts on this? This is an optimization. You have an open file, you just grew it and you change mode. The NetBSD kernel and its FUSE implementation do the two operations in a single FUSE request, because they are smart :-) I will commit the fix in NetBSD FUSE. But one day the Linux kernel could decide to use the same shortcut too. It may be wise to fix glusterfs so that it does not assume FATTR_SIZE is not sent with other metadata changes. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From anand.avati at gmail.com Sun May 27 21:40:35 2012 From: anand.avati at gmail.com (Anand Avati) Date: Sun, 27 May 2012 14:40:35 -0700 Subject: [Gluster-devel] mismatching ino/dev between file In-Reply-To: <20120525133519.GC19383@homeworld.netbsd.org> References: <20120525133519.GC19383@homeworld.netbsd.org> Message-ID: Can you give some more steps how you reproduced this? This has never happened in any of our testing. This might probably related to the dirname() differences in BSD? Have you noticed this after the GNU dirname usage? Avati On Fri, May 25, 2012 at 6:35 AM, Emmanuel Dreyfus wrote: > Hi > > Here is a bug with release-3.3. It happens on a 2 way replicated. Here is > what I have in one brick: > > [2012-05-25 15:03:09.463446] W [posix-handle.c:487:posix_handle_hard] > 0-pfs-posix: mismatching ino/dev between file > /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (57943061/16) > and handle > /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce > (57943060/16) > [2012-05-25 15:03:09.463552] E [posix.c:1277:posix_symlink] 0-pfs-posix: > setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 > failed > > > On the other one: > > [2012-05-25 15:03:09.447682] W [posix-handle.c:487:posix_handle_hard] > 0-pfs-posix: mismatching ino/dev between file > /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (50557989/24) > and handle > /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce > (50557988/24) > [2012-05-25 15:03:09.447774] E [posix.c:1277:posix_symlink] 0-pfs-posix: > setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 > failed > > Someone can give me a hint of what happens, and how to track it down? > -- > Emmanuel Dreyfus > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Mon May 28 01:52:41 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 03:52:41 +0200 Subject: [Gluster-devel] mismatching ino/dev between file In-Reply-To: Message-ID: <1kkscxr.1k0ou1xcxcd7rM%manu@netbsd.org> Anand Avati wrote: > Can you give some more steps how you reproduced this? This has never > happened in any of our testing. This might probably related to the > dirname() differences in BSD? Have you noticed this after the GNU dirname > usage? I will investigate further. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 28 02:08:19 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 04:08:19 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkqxw0.1smapd1jsih9iM%manu@netbsd.org> Message-ID: <1kkscze.1y0ip7wj3y9uoM%manu@netbsd.org> Emmanuel Dreyfus wrote: > I modified by FUSE implementation to send FATTR_SIZE|FATTR_FH in one > request, and FATTR_MODE|FATTR_FH|FATTR_MTIME|FATTR_ATIME in a separate > one, and the test passes fine. Um, I spoke too fast. Please disreagard the previous post. The problem was not setting size, and mode in the same request. That works fine. The bug appear when setting size, atime and mtime. It also appear when setting mode, atime and mtime. So here is the summary so far: ATTR_SIZE|FATTR_FH -> ok ATTR_SIZE|FATTR_FH|FATTR_ATIME|FATTR_MTIME -> breaks (*) ATTR_MODE|FATTR_FH -> ok ATTR_MODE|FATTR_FH|FATTR_ATIME|FATTR_MTIME -> breaks ATTR_MODE|FATTR_SIZE|FATTR_FH -> ok (I was wrong here) (*) I noticed that one long time ago, and NetBSD FUSE already strips atime and mtime if ATTR_SIZE is set without ATTR_MODE|ATTR_UID|ATTR_GID. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 28 05:07:46 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 07:07:46 +0200 Subject: [Gluster-devel] Testing server down in replicated volume Message-ID: <1kkslvx.fbj1ua1gom7oyM%manu@netbsd.org> Hi everybody After the last fix in NetBSD FUSE (cf NULL loc in posix_acl_truncate), glusterfs release-3.3 now behaves quite nicely on NetBSD. I have been able to build stuff in a replicated glusterfs volume for a few hours, and it seems much faster than 3.2.6. However things turn badly when I tried to kill glusterfsd on a server. Since the volume is replicated, I would have expected the build to carry on unaffected. but this is now what happens: a ENOTCONN is raised up to the processes using the glusterfs volume: In file included from /pfs/manu/netbsd/usr/src/sys/sys/signal.h:114, from /pfs/manu/netbsd/usr/src/sys/sys/param.h:150, from /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/net/__cmsg_align bytes.c:40: /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string /machine/signal.h: Socket is not connected Is it the intended behavior? Here is the client log: [2012-05-28 05:48:27.440017] W [socket.c:195:__socket_rwv] 0-pfs-client-1: writev failed (Broken pipe) [2012-05-28 05:48:27.440989] W [socket.c:195:__socket_rwv] 0-pfs-client-1: readv failed (Connection reset by peer) [2012-05-28 05:48:27.441496] W [socket.c:1512:__socket_proto_state_machine] 0-pfs-client-1: reading from socket failed. Error (Connection reset by peer), peer (193.54.82.98:24011) [2012-05-28 05:48:27.441825] E [rpc-clnt.c:373:saved_frames_unwind] 0-pfs-client-1: forced unwinding frame type(GlusterFS 3.1) op(INODELK(29)) called at 2012-05-28 05:48:27.439249 (xid=0x1715867x) [2012-05-28 05:48:27.442222] W [client3_1-fops.c:1495:client3_1_inodelk_cbk] 0-pfs-client-1: remote operation failed: Socket is not connected [2012-05-28 05:48:27.442528] E [rpc-clnt.c:373:saved_frames_unwind] 0-pfs-client-1: forced unwinding frame type(GlusterFS 3.1) op(SETATTR(38)) called at 2012-05-28 05:48:27.440397 (xid=0x1715868x) [2012-05-28 05:48:27.442971] W [client3_1-fops.c:1954:client3_1_setattr_cbk] 0-pfs-client-1: remote operation failed: Socket is not connected (and so on with other saved_frames_unwind) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 28 05:08:36 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 07:08:36 +0200 Subject: [Gluster-devel] mismatching ino/dev between file In-Reply-To: <1kkscxr.1k0ou1xcxcd7rM%manu@netbsd.org> Message-ID: <1kksmhc.zfnn6i6bllp8M%manu@netbsd.org> Emmanuel Dreyfus wrote: > > Can you give some more steps how you reproduced this? This has never > > happened in any of our testing. This might probably related to the > > dirname() differences in BSD? Have you noticed this after the GNU dirname > > usage? > I will investigate further. It does not happen anymore. I think it was a consequence of the other bug I fixed. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 29 07:55:09 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 29 May 2012 07:55:09 +0000 Subject: [Gluster-devel] Testing server down in replicated volume In-Reply-To: <1kkslvx.fbj1ua1gom7oyM%manu@netbsd.org> References: <1kkslvx.fbj1ua1gom7oyM%manu@netbsd.org> Message-ID: <20120529075509.GE19383@homeworld.netbsd.org> On Mon, May 28, 2012 at 07:07:46AM +0200, Emmanuel Dreyfus wrote: [One server down in a replicated volume] > /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: > /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string > /machine/signal.h: Socket is not connected > > Is it the intended behavior? No reply? I would like to know if I have a NetBSD-specific bug to fix or if it is standard glusterfs behavior. -- Emmanuel Dreyfus manu at netbsd.org From pkarampu at redhat.com Tue May 29 09:09:04 2012 From: pkarampu at redhat.com (Pranith Kumar Karampuri) Date: Tue, 29 May 2012 05:09:04 -0400 (EDT) Subject: [Gluster-devel] Testing server down in replicated volume In-Reply-To: <20120529075509.GE19383@homeworld.netbsd.org> Message-ID: <97e7abfe-e431-47b8-bb26-cf70adbef253@zmail01.collab.prod.int.phx2.redhat.com> I am looking into this. Will reply soon. Pranith ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Emmanuel Dreyfus" Cc: gluster-devel at nongnu.org Sent: Tuesday, May 29, 2012 1:25:09 PM Subject: Re: [Gluster-devel] Testing server down in replicated volume On Mon, May 28, 2012 at 07:07:46AM +0200, Emmanuel Dreyfus wrote: [One server down in a replicated volume] > /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: > /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string > /machine/signal.h: Socket is not connected > > Is it the intended behavior? No reply? I would like to know if I have a NetBSD-specific bug to fix or if it is standard glusterfs behavior. -- Emmanuel Dreyfus manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From vijay at build.gluster.com Tue May 29 13:44:11 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Tue, 29 May 2012 06:44:11 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa44 released Message-ID: <20120529134412.E8A3C100CB@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa44/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa44.tar.gz This release is made off v3.3.0qa44 From pkarampu at redhat.com Tue May 29 17:28:32 2012 From: pkarampu at redhat.com (Pranith Kumar Karampuri) Date: Tue, 29 May 2012 13:28:32 -0400 (EDT) Subject: [Gluster-devel] Testing server down in replicated volume In-Reply-To: <20120529075509.GE19383@homeworld.netbsd.org> Message-ID: <4fb4ce32-9683-44cd-a7bd-aa935c79db29@zmail01.collab.prod.int.phx2.redhat.com> hi Emmanuel, I tried this for half an hour, everytime it failed because of readdir. It did not fail in any other fop. I saw that FINODELKs which relate to transactions in afr failed, but the fop succeeded on the other brick. I am not sure why a setattr (metadata transaction) is failing in your setup when a node is down. I will instrument the code to simulate the inodelk failure in setattr. Will update you tomorrow. Fop failing in readdir is also an issue that needs to be addressed. Pranith. ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Emmanuel Dreyfus" Cc: gluster-devel at nongnu.org Sent: Tuesday, May 29, 2012 1:25:09 PM Subject: Re: [Gluster-devel] Testing server down in replicated volume On Mon, May 28, 2012 at 07:07:46AM +0200, Emmanuel Dreyfus wrote: [One server down in a replicated volume] > /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: > /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string > /machine/signal.h: Socket is not connected > > Is it the intended behavior? No reply? I would like to know if I have a NetBSD-specific bug to fix or if it is standard glusterfs behavior. -- Emmanuel Dreyfus manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From bfoster at redhat.com Wed May 30 15:16:16 2012 From: bfoster at redhat.com (Brian Foster) Date: Wed, 30 May 2012 11:16:16 -0400 Subject: [Gluster-devel] glusterfs client and page cache Message-ID: <4FC639C0.6020503@redhat.com> Hi all, I've been playing with a little hack recently to add a gluster mount option to support FOPEN_KEEP_CACHE and I wanted to solicit some thoughts on whether there's value to find an intelligent way to support this functionality. To provide some context: Our current behavior with regard to fuse is that page cache is utilized by fuse, from what I can tell, just about in the same manner as a typical local fs. The primary difference is that by default, the address space mapping for an inode is completely invalidated on open. So for example, if process A opens and reads a file in a loop, subsequent reads are served from cache (bypassing fuse and gluster). If process B steps in and opens the same file, the cache is flushed and the next reads from either process are passed down through fuse. The FOPEN_KEEP_CACHE option simply disables this cache flash on open behavior. The following are some notes on my experimentation thus far: - With FOPEN_KEEP_CACHE, fuse currently only invalidates on file size changes. This is a problem in that I can rewrite some or all of a file from another client and the cached client wouldn't notice. I've sent a patch to fuse-devel to also invalidate on mtime changes (similar to nfsv3 or cifs), so we'll see how well that is received. fuse also supports a range based invalidation notification that we could take advantage of if necessary. - I reproduce a measurable performance benefit in the local/cached read situation. For example, running a kernel compile against a source tree in a gluster volume (no other xlators and build output to local storage) improves to 6 minutes from just under 8 minutes with the default graph (9.5 minutes with only the client xlator and 1:09 locally). - Some of the specific differences from current io-cache caching: - io-cache supports time based invalidation and tunables such as cache size and priority. The page cache has no such controls. - io-cache invalidates more frequently on various fops. It also looks like we invalidate on writes and don't take advantage of the write data most recently sent, whereas page cache writes are cached (errors notwithstanding). - Page cache obviously has tighter integration with the system (i.e., drop_caches controls, more specific reporting, ability to drop cache when memory is needed). All in all, I'm curious what people think about enabling the cache behavior in gluster. We could support anything from the basic mount option I'm currently using (i.e., similar to attribute/dentry caching) to something integrated with io-cache (doing invalidations when necessary), or maybe even something eventually along the lines of the nfs weak cache consistency model where it validates the cache after every fop based on file attributes. In general, are there other big issues/questions that would need to be explored before this is useful (i.e., the size invalidation issue)? Are there other performance tests that should be explored? Thoughts appreciated. Thanks. Brian From fernando.frediani at qubenet.net Wed May 30 16:19:36 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Wed, 30 May 2012 16:19:36 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F8BF93@QUBE-TR2-EXC01.qube.qubenet.net> Does anyone have an idea of this problem of not being able to power up the virtual machines on that NFS mount ? Also what do those logs mean that Anand say that there is a problem with the Repstr model. Is it something isn't finished yet ? Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 21:32 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Anand, Further to that I managed to mount the Datastore and deploy machines there, but when trying to power them On I get an error as if it couldn't find a file. Has anyone seen these kind of error before ? I would say that it could be a lock problem, but it doesn't seem to. Permissions maybe ? Or the way the NFS is exported ? (root_squash, no_root_squash, etc) Here is the log: An unexpected error was received from the ESX host while powering on VM vm-21112. Failed to power on VM. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 20:34 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From anand.avati at gmail.com Wed May 30 19:32:50 2012 From: anand.avati at gmail.com (Anand Avati) Date: Wed, 30 May 2012 12:32:50 -0700 Subject: [Gluster-devel] glusterfs client and page cache In-Reply-To: <4FC639C0.6020503@redhat.com> References: <4FC639C0.6020503@redhat.com> Message-ID: Brian, You are right, today we hardly leverage the page cache in the kernel. When Gluster started and performance translators were implemented, the fuse invalidation support did not exist, and since that support was brought in upstream fuse we haven't leveraged that effectively. We can actually do a lot more smart things using the invalidation changes. For the consistency concerns where an open fd continues to refer to local page cache - if that is a problem, today you need to mount with --enable-direct-io-mode to bypass the page cache altogether (this is very different from O_DIRECT open() support). On the other hand, to utilize the fuse invalidation APIs and promote using the page cache and still be consistent, we need to gear up glusterfs framework by first implementing server originated messaging support, then build some kind of opportunistic locking or leases to notify glusterfs clients about modifications from a second client, and third implement hooks in the client side listener to do things like sending fuse invalidations or purge pages in io-cache or flush pending writes in write-behind etc. This needs to happen, but we're short on resources to prioritize this sooner :-) Avati On Wed, May 30, 2012 at 8:16 AM, Brian Foster wrote: > Hi all, > > I've been playing with a little hack recently to add a gluster mount > option to support FOPEN_KEEP_CACHE and I wanted to solicit some thoughts > on whether there's value to find an intelligent way to support this > functionality. To provide some context: > > Our current behavior with regard to fuse is that page cache is utilized > by fuse, from what I can tell, just about in the same manner as a > typical local fs. The primary difference is that by default, the address > space mapping for an inode is completely invalidated on open. So for > example, if process A opens and reads a file in a loop, subsequent reads > are served from cache (bypassing fuse and gluster). If process B steps > in and opens the same file, the cache is flushed and the next reads from > either process are passed down through fuse. The FOPEN_KEEP_CACHE option > simply disables this cache flash on open behavior. > > The following are some notes on my experimentation thus far: > > - With FOPEN_KEEP_CACHE, fuse currently only invalidates on file size > changes. This is a problem in that I can rewrite some or all of a file > from another client and the cached client wouldn't notice. I've sent a > patch to fuse-devel to also invalidate on mtime changes (similar to > nfsv3 or cifs), so we'll see how well that is received. fuse also > supports a range based invalidation notification that we could take > advantage of if necessary. > > - I reproduce a measurable performance benefit in the local/cached read > situation. For example, running a kernel compile against a source tree > in a gluster volume (no other xlators and build output to local storage) > improves to 6 minutes from just under 8 minutes with the default graph > (9.5 minutes with only the client xlator and 1:09 locally). > > - Some of the specific differences from current io-cache caching: > - io-cache supports time based invalidation and tunables such as > cache > size and priority. The page cache has no such controls. > - io-cache invalidates more frequently on various fops. It also > looks > like we invalidate on writes and don't take advantage of the write data > most recently sent, whereas page cache writes are cached (errors > notwithstanding). > - Page cache obviously has tighter integration with the system > (i.e., > drop_caches controls, more specific reporting, ability to drop cache > when memory is needed). > > All in all, I'm curious what people think about enabling the cache > behavior in gluster. We could support anything from the basic mount > option I'm currently using (i.e., similar to attribute/dentry caching) > to something integrated with io-cache (doing invalidations when > necessary), or maybe even something eventually along the lines of the > nfs weak cache consistency model where it validates the cache after > every fop based on file attributes. > > In general, are there other big issues/questions that would need to be > explored before this is useful (i.e., the size invalidation issue)? Are > there other performance tests that should be explored? Thoughts > appreciated. Thanks. > > Brian > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bfoster at redhat.com Wed May 30 23:10:58 2012 From: bfoster at redhat.com (Brian Foster) Date: Wed, 30 May 2012 19:10:58 -0400 Subject: [Gluster-devel] glusterfs client and page cache In-Reply-To: References: <4FC639C0.6020503@redhat.com> Message-ID: <4FC6A902.9010406@redhat.com> On 05/30/2012 03:32 PM, Anand Avati wrote: > Brian, > You are right, today we hardly leverage the page cache in the kernel. > When Gluster started and performance translators were implemented, the > fuse invalidation support did not exist, and since that support was > brought in upstream fuse we haven't leveraged that effectively. We can > actually do a lot more smart things using the invalidation changes. > > For the consistency concerns where an open fd continues to refer to > local page cache - if that is a problem, today you need to mount with > --enable-direct-io-mode to bypass the page cache altogether (this is > very different from O_DIRECT open() support). On the other hand, to > utilize the fuse invalidation APIs and promote using the page cache and > still be consistent, we need to gear up glusterfs framework by first > implementing server originated messaging support, then build some kind > of opportunistic locking or leases to notify glusterfs clients about > modifications from a second client, and third implement hooks in the > client side listener to do things like sending fuse invalidations or > purge pages in io-cache or flush pending writes in write-behind etc. > This needs to happen, but we're short on resources to prioritize this > sooner :-) > Thanks for the context Avati. The fuse patch I sent lead to a similar thought process with regard to finer grained invalidation. So far it seems well received, and as I understand it, we can also utilize that mechanism to do full invalidations from gluster on older fuse modules that wouldn't have that fix. I'll look into incorporating that into what I have so far and making it available for review. Brian > Avati > > On Wed, May 30, 2012 at 8:16 AM, Brian Foster > wrote: > > Hi all, > > I've been playing with a little hack recently to add a gluster mount > option to support FOPEN_KEEP_CACHE and I wanted to solicit some thoughts > on whether there's value to find an intelligent way to support this > functionality. To provide some context: > > Our current behavior with regard to fuse is that page cache is utilized > by fuse, from what I can tell, just about in the same manner as a > typical local fs. The primary difference is that by default, the address > space mapping for an inode is completely invalidated on open. So for > example, if process A opens and reads a file in a loop, subsequent reads > are served from cache (bypassing fuse and gluster). If process B steps > in and opens the same file, the cache is flushed and the next reads from > either process are passed down through fuse. The FOPEN_KEEP_CACHE option > simply disables this cache flash on open behavior. > > The following are some notes on my experimentation thus far: > > - With FOPEN_KEEP_CACHE, fuse currently only invalidates on file size > changes. This is a problem in that I can rewrite some or all of a file > from another client and the cached client wouldn't notice. I've sent a > patch to fuse-devel to also invalidate on mtime changes (similar to > nfsv3 or cifs), so we'll see how well that is received. fuse also > supports a range based invalidation notification that we could take > advantage of if necessary. > > - I reproduce a measurable performance benefit in the local/cached read > situation. For example, running a kernel compile against a source tree > in a gluster volume (no other xlators and build output to local storage) > improves to 6 minutes from just under 8 minutes with the default graph > (9.5 minutes with only the client xlator and 1:09 locally). > > - Some of the specific differences from current io-cache caching: > - io-cache supports time based invalidation and tunables such > as cache > size and priority. The page cache has no such controls. > - io-cache invalidates more frequently on various fops. It > also looks > like we invalidate on writes and don't take advantage of the write data > most recently sent, whereas page cache writes are cached (errors > notwithstanding). > - Page cache obviously has tighter integration with the > system (i.e., > drop_caches controls, more specific reporting, ability to drop cache > when memory is needed). > > All in all, I'm curious what people think about enabling the cache > behavior in gluster. We could support anything from the basic mount > option I'm currently using (i.e., similar to attribute/dentry caching) > to something integrated with io-cache (doing invalidations when > necessary), or maybe even something eventually along the lines of the > nfs weak cache consistency model where it validates the cache after > every fop based on file attributes. > > In general, are there other big issues/questions that would need to be > explored before this is useful (i.e., the size invalidation issue)? Are > there other performance tests that should be explored? Thoughts > appreciated. Thanks. > > Brian > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > From johnmark at redhat.com Thu May 31 16:33:20 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 31 May 2012 12:33:20 -0400 (EDT) Subject: [Gluster-devel] A very special announcement from Gluster.org In-Reply-To: <344ab6e5-d6de-48d9-bfe8-e2727af7b45e@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <660ccad1-e191-405c-8645-1cb2fb02f80c@zmail01.collab.prod.int.phx2.redhat.com> Today, we?re announcing the next generation of GlusterFS , version 3.3. The release has been a year in the making and marks several firsts: the first post-acquisition release under Red Hat, our first major act as an openly-governed project and our first foray beyond NAS. We?ve also taken our first steps towards merging big data and unstructured data storage, giving users and developers new ways of managing their data scalability challenges. GlusterFS is an open source, fully distributed storage solution for the world?s ever-increasing volume of unstructured data. It is a software-only, highly available, scale-out, centrally managed storage pool that can be backed by POSIX filesystems that support extended attributes, such as Ext3/4, XFS, BTRFS and many more. This release provides many of the most commonly requested features including proactive self-healing, quorum enforcement, and granular locking for self-healing, as well as many additional bug fixes and enhancements. Some of the more noteworthy features include: ? Unified File and Object storage ? Blending OpenStack?s Object Storage API with GlusterFS provides simultaneous read and write access to data as files or as objects. ? HDFS compatibility ? Gives Hadoop administrators the ability to run MapReduce jobs on unstructured data on GlusterFS and access the data with well-known tools and shell scripts. ? Proactive self-healing ? GlusterFS volumes will now automatically restore file integrity after a replica recovers from failure. ? Granular locking ? Allows large files to be accessed even during self-healing, a feature that is particularly important for VM images. ? Replication improvements ? With quorum enforcement you can be confident that your data has been written in at least the configured number of places before the file operation returns, allowing a user-configurable adjustment to fault tolerance vs performance. Visit http://www.gluster.org to download. Packages are available for most distributions, including Fedora, Debian, RHEL, Ubuntu and CentOS. Get involved! Join us on #gluster on freenode, join our mailing list , ?like? our Facebook page , follow us on Twitter , or check out our LinkedIn group . GlusterFS is an open source project sponsored by Red Hat ?, who uses it in its line of Red Hat Storage products. (this post published at http://www.gluster.org/2012/05/introducing-glusterfs-3-3/ ) -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.frediani at qubenet.net Thu May 31 16:36:36 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Thu, 31 May 2012 16:36:36 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F8BF93@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F8BF93@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F8DC08@QUBE-TR2-EXC01.qube.qubenet.net> What is happening with this ? Non one actually care to take ownership about this ? If this is a bug why nobody is interested to get it fixed ? If not someone speak up please. Two things are not working as they supposed, I am reporting back and nobody seems to give a dam about it. -----Original Message----- From: Fernando Frediani (Qube) Sent: 30 May 2012 17:20 To: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Does anyone have an idea of this problem of not being able to power up the virtual machines on that NFS mount ? Also what do those logs mean that Anand say that there is a problem with the Repstr model. Is it something isn't finished yet ? Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 21:32 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Anand, Further to that I managed to mount the Datastore and deploy machines there, but when trying to power them On I get an error as if it couldn't find a file. Has anyone seen these kind of error before ? I would say that it could be a lock problem, but it doesn't seem to. Permissions maybe ? Or the way the NFS is exported ? (root_squash, no_root_squash, etc) Here is the log: An unexpected error was received from the ESX host while powering on VM vm-21112. Failed to power on VM. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 20:34 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From johnmark at redhat.com Thu May 31 16:48:45 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 31 May 2012 12:48:45 -0400 (EDT) Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F8DC08@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <59507de0-4264-4e27-ac94-c9b34890a5f4@zmail01.collab.prod.int.phx2.redhat.com> ----- Original Message ----- > What is happening with this ? > Non one actually care to take ownership about this ? > If this is a bug why nobody is interested to get it fixed ? If not > someone speak up please. > Two things are not working as they supposed, I am reporting back and > nobody seems to give a dam about it. Hi Fernando, If nobody is replying, it's because they don't have experience with your particular setup, or they've never seen this problem before. If you feel it's a bug, then please file a bug at http://bugzilla.redhat.com/ You can also ask questions on the IRC channel: #gluster Or on http://community.gluster.org/ I know it can be frustrating, but please understand that you will get a response only if someone out there has experience with your problem. Thanks, John Mark Community guy From manu at netbsd.org Tue May 1 02:18:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 1 May 2012 04:18:53 +0200 Subject: [Gluster-devel] Fwd: Re: Rejected NetBSD patches In-Reply-To: <4F9EED0C.2080203@redhat.com> Message-ID: <1kjeekq.1nkt3n11wtalkgM%manu@netbsd.org> Kaleb S. KEITHLEY wrote: > I haven't seen anything so far that needs to discriminate between NetBSD > and FreeBSD, but if we come across one, we can use __NetBSD__ and > __FreeBSD__ inside GF_BSD_HOST_OS. If you look at the code, NetBSD makes is way using GF_BSD_HOST_OS or GF_LINUX_HOST_OS, depending of the situation. NetBSD and FreeBSD forked 19 years ago, they had time to diverge. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 1 03:21:28 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 1 May 2012 05:21:28 +0200 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjdvf9.1o294sj12c16nlM%manu@netbsd.org> Message-ID: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> Emmanuel Dreyfus wrote: > I got a crash client-side. It happens in pthread_spin_lock() and I > recall fixing that kind of issue for a uninitialized lock. I added printf, and inode is NULL in mdc_inode_pre() therefore this is not an uninitializd lock problem. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 1 05:31:57 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 1 May 2012 07:31:57 +0200 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> Message-ID: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Emmanuel Dreyfus wrote: > I added printf, and inode is NULL in mdc_inode_pre() therefore this is > not an uninitializd lock problem. Indeed, this this the mdc_local_t structure that seems uninitialized: (gdb) frame 3 #3 0xbaa0ecb5 in mdc_setattr_cbk (frame=0xbb7e32a0, cookie=0xbb7a4380, this=0xba3e3000, op_ret=0, op_errno=0, prebuf=0xb940a57c, postbuf=0xb940a5e4, xdata=0x0) at md-cache.c:1365 1365 mdc_inode_iatt_set (this, local->loc.inode, postbuf); (gdb) print *(mdc_local_t *)frame->local $6 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, fd = 0xb8f9d230, linkname = 0x0, xattr = 0x0} And indeed local->loc it is not initialized in mdc_fsetattr(). I suspect there is away of obteining it from fd, but this is getting beyond by knowledge of glusterfs internals. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vbellur at redhat.com Wed May 2 04:21:08 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Wed, 02 May 2012 09:51:08 +0530 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> References: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Message-ID: <4FA0B634.5090605@redhat.com> On 05/01/2012 11:01 AM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > >> I added printf, and inode is NULL in mdc_inode_pre() therefore this is >> not an uninitializd lock problem. > > Indeed, this this the mdc_local_t structure that seems uninitialized: > > (gdb) frame 3 > #3 0xbaa0ecb5 in mdc_setattr_cbk (frame=0xbb7e32a0, cookie=0xbb7a4380, > this=0xba3e3000, op_ret=0, op_errno=0, prebuf=0xb940a57c, > postbuf=0xb940a5e4, xdata=0x0) at md-cache.c:1365 > 1365 mdc_inode_iatt_set (this, local->loc.inode, postbuf); > > (gdb) print *(mdc_local_t *)frame->local > $6 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, > gfid = '\000', pargfid = '\000' > }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, > parent = 0x0, gfid = '\000', > pargfid = '\000'}, fd = 0xb8f9d230, > linkname = 0x0, xattr = 0x0} > > And indeed local->loc it is not initialized in mdc_fsetattr(). I suspect > there is away of obteining it from fd, but this is getting beyond by > knowledge of glusterfs internals. > > Do you have a test case that causes this crash? Vijay From anand.avati at gmail.com Wed May 2 05:29:22 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 1 May 2012 22:29:22 -0700 Subject: [Gluster-devel] qa39 crash In-Reply-To: <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> References: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Message-ID: Can you confirm if this fixes (obvious bug) - diff --git a/xlators/performance/md-cache/src/md-cache.c b/xlators/performance/md-cache/src/md-cache.c index 9ef599a..66c0bf3 100644 --- a/xlators/performance/md-cache/src/md-cache.c +++ b/xlators/performance/md-cache/src/md-cache.c @@ -1423,7 +1423,7 @@ mdc_fsetattr (call_frame_t *frame, xlator_t *this, fd_t *fd, local->fd = fd_ref (fd); - STACK_WIND (frame, mdc_setattr_cbk, + STACK_WIND (frame, mdc_fsetattr_cbk, FIRST_CHILD(this), FIRST_CHILD(this)->fops->fsetattr, fd, stbuf, valid, xdata); On Mon, Apr 30, 2012 at 10:31 PM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > > > I added printf, and inode is NULL in mdc_inode_pre() therefore this is > > not an uninitializd lock problem. > > Indeed, this this the mdc_local_t structure that seems uninitialized: > > (gdb) frame 3 > #3 0xbaa0ecb5 in mdc_setattr_cbk (frame=0xbb7e32a0, cookie=0xbb7a4380, > this=0xba3e3000, op_ret=0, op_errno=0, prebuf=0xb940a57c, > postbuf=0xb940a5e4, xdata=0x0) at md-cache.c:1365 > 1365 mdc_inode_iatt_set (this, local->loc.inode, postbuf); > > (gdb) print *(mdc_local_t *)frame->local > $6 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, > gfid = '\000' , pargfid = '\000' > }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, > parent = 0x0, gfid = '\000' , > pargfid = '\000' }, fd = 0xb8f9d230, > linkname = 0x0, xattr = 0x0} > > And indeed local->loc it is not initialized in mdc_fsetattr(). I suspect > there is away of obteining it from fd, but this is getting beyond by > knowledge of glusterfs internals. > > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kshlmster at gmail.com Wed May 2 05:35:02 2012 From: kshlmster at gmail.com (Kaushal M) Date: Wed, 2 May 2012 11:05:02 +0530 Subject: [Gluster-devel] 3.3 and address family In-Reply-To: References: <1kj84l9.19kzk6dfdsrtsM%manu@netbsd.org> Message-ID: Didn't send the last message to list. Resending. On Wed, May 2, 2012 at 10:58 AM, Kaushal M wrote: > Hi Emmanuel, > > Took a look at your patch for fixing this problem. It solves the it for > the brick glusterfsd processes. But glusterd also spawns and communicates > with nfs server & self-heal daemon processes. The proper xlator-option is > not set for these. This might be the cause. These processes are started in > glusterd_nodesvc_start() in glusterd-utils, which is where you could look > into. > > Thanks, > Kaushal > > On Fri, Apr 27, 2012 at 10:31 PM, Emmanuel Dreyfus wrote: > >> Hi >> >> I am still trying on 3.3.0qa39, and now I have an address family issue: >> gluserfs defaults to inet6 transport while the machine is not configured >> for IPv6. >> >> I added option transport.address-family inet in glusterfs/glusterd.vol >> and now glusterd starts with an IPv4 address, but unfortunately, >> communication with spawned glusterfsd do not stick to the same address >> family: I can see packets going from ::1.1023 to ::1.24007 and they are >> rejected since I used transport.address-family inet. >> >> I need to tell glusterfs to use the same address family. I already did a >> patch for exactly the same problem some time ago, this is not very >> difficult, but it would save me some time if someone could tell me where >> should I look at in the code. >> >> -- >> Emmanuel Dreyfus >> http://hcpnet.free.fr/pubz >> manu at netbsd.org >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at nongnu.org >> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Wed May 2 09:30:32 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 2 May 2012 09:30:32 +0000 Subject: [Gluster-devel] qa39 crash In-Reply-To: References: <1kjehi5.3lc9mxggvfrlM%manu@netbsd.org> <1kjen07.9rmdeg1akdng7M%manu@netbsd.org> Message-ID: <20120502093032.GI3677@homeworld.netbsd.org> On Tue, May 01, 2012 at 10:29:22PM -0700, Anand Avati wrote: > Can you confirm if this fixes (obvious bug) - I do not crash anymore, but I spotted another bug, I do not know if it is related: removing owner write access to a non empty file open with write access fails with EPERMo Here is my test case. It works fine with glusterfs-3.2.6 but fchmod() fails with EPERM on 3.3.0qa39 #include #include #include #include #include #include int main(void) { int fd; char buf[16]; if ((fd = open("test.tmp", O_RDWR|O_CREAT, 0644)) == -1) err(EX_OSERR, "fopen failed"); if (write(fd, buf, sizeof(buf)) != sizeof(buf)) err(EX_OSERR, "write failed"); if (fchmod(fd, 0444) == -1) err(EX_OSERR, "fchmod failed"); if (close(fd) == -1) err(EX_OSERR, "close failed"); return EX_OK; } -- Emmanuel Dreyfus manu at netbsd.org From xhernandez at datalab.es Wed May 2 10:55:37 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Wed, 02 May 2012 12:55:37 +0200 Subject: [Gluster-devel] Some questions about requisites of translators Message-ID: <4FA112A9.1080101@datalab.es> Hello, I'm wondering if there are any requisites that translators must satisfy to work correctly inside glusterfs. In particular I need to know two things: 1. Are translators required to respect the order in which they receive the requests ? This is specially important in translators such as performance/io-threads or caching ones. It seems that these translators can reorder requests. If this is the case, is there any way to force some order between requests ? can inodelk/entrylk be used to force the order ? 2. Are translators required to propagate callback arguments even if the result of the operation is an error ? and if an internal translator error occurs ? When a translator has multiple subvolumes, I've seen that some arguments, such as xdata, are replaced with NULL. This can be understood, but are regular translators (those that only have one subvolume) allowed to do that or must they preserve the value of xdata, even in the case of an internal error ? If this is not a requisite, xdata loses it's function of delivering back extra information. Thank you very much, Xavi From anand.avati at gmail.com Sat May 5 06:02:30 2012 From: anand.avati at gmail.com (Anand Avati) Date: Fri, 4 May 2012 23:02:30 -0700 Subject: [Gluster-devel] Some questions about requisites of translators In-Reply-To: <4FA112A9.1080101@datalab.es> References: <4FA112A9.1080101@datalab.es> Message-ID: On Wed, May 2, 2012 at 3:55 AM, Xavier Hernandez wrote: > Hello, > > I'm wondering if there are any requisites that translators must satisfy to > work correctly inside glusterfs. > > In particular I need to know two things: > > 1. Are translators required to respect the order in which they receive the > requests ? > > This is specially important in translators such as performance/io-threads > or caching ones. It seems that these translators can reorder requests. If > this is the case, is there any way to force some order between requests ? > can inodelk/entrylk be used to force the order ? > > Translators are not expected to maintain ordering of requests. The only translator which takes care of ordering calls is write-behind. After acknowledging back write requests it has to make sure future requests see the true "effect" as though the previous write actually completed. To that end, it queues future "dependent" requests till the write acknowledgement is received from the server. inodelk/entrylk calls help achieve synchronization among clients (by getting into a critical section) - just like a mutex. It is an arbitrator. It does not help for ordering of two calls. If one call must strictly complete after another call from your translator's point of view (i.e, if it has such a requirement), then the latter call's STACK_WIND must happen in the callback of the former's STACK_UNWIND path. There are no guarantees maintained by the system to ensure that a second STACK_WIND issued right after a first STACK_WIND will complete and callback in the same order. Write-behind does all its ordering gimmicks only because it STACK_UNWINDs a write call prematurely and therefore must maintain the causal effects by means of queueing new requests behind the downcall towards the server. > 2. Are translators required to propagate callback arguments even if the > result of the operation is an error ? and if an internal translator error > occurs ? > > Usually no. If op_ret is -1, only op_errno is expected to be a usable value. Rest of the callback parameters are junk. > When a translator has multiple subvolumes, I've seen that some arguments, > such as xdata, are replaced with NULL. This can be understood, but are > regular translators (those that only have one subvolume) allowed to do that > or must they preserve the value of xdata, even in the case of an internal > error ? > > It is best to preserve the arguments unless you know specifically what you are doing. In case of error, all the non-op_{ret,errno} arguments are typically junk, including xdata. > If this is not a requisite, xdata loses it's function of delivering back > extra information. > > Can you explain? Are you seeing a use case for having a valid xdata in the callback even with op_ret == -1? Thanks, Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From tsato at valinux.co.jp Mon May 7 04:17:45 2012 From: tsato at valinux.co.jp (Tomoaki Sato) Date: Mon, 07 May 2012 13:17:45 +0900 Subject: [Gluster-devel] showmount reports many entries (Re: glusterfs-3.3.0qa39 released) In-Reply-To: <4F9A98E8.80400@gluster.com> References: <20120427053612.E08671804F5@build.gluster.com> <4F9A6422.3010000@valinux.co.jp> <4F9A98E8.80400@gluster.com> Message-ID: <4FA74CE9.8010805@valinux.co.jp> (2012/04/27 22:02), Vijay Bellur wrote: > On 04/27/2012 02:47 PM, Tomoaki Sato wrote: >> Vijay, >> >> I have been testing gluster-3.3.0qa39 NFS with 4 CentOS 6.2 NFS clients. >> The test set is like following: >> 1) All 4 clients mount 64 directories. (total 192 directories) >> 2) 192 procs runs on the 4 clients. each proc create a new unique file and write 1GB data to the file. (total 192GB) >> 3) All 4 clients umount 64 directories. >> >> The test finished successfully but showmount command reported many entries in spite of there were no NFS clients remain. >> Then I have restarted gluster related daemons. >> After restarting, showmount command reports no entries. >> Any insight into this is much appreciated. > > > http://review.gluster.com/2973 should fix this. Can you please confirm? > > > Thanks, > Vijay Vijay, I have confirmed that following instructions with c3a16c32. # showmount one Hosts on one: # mkdir /tmp/mnt # mount one:/one /tmp/mnt # showmount one Hosts on one: 172.17.200.108 # umount /tmp/mnt # showmount one Hosts on one: # And the test set has started running. It will take a couple of days to finish. by the way, I did following instructions to build RPM packages on a CentOS 5.6 x86_64 host. # yum install python-ctypes ncureses-devel readline-devel libibverbs-devel # git clone -b c3a16c32 ssh://@git.gluster.com/glusterfs.git glusterfs-3git # tar zcf /usr/src/redhat/SOURCES/glusterfs-3bit.tar.gz glusterfs-3git # rpmbuild -bb /usr/src/redhat/SOURCES/glusterfs-3git.tar.gz Thanks, Tomo Sato From manu at netbsd.org Mon May 7 04:39:22 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 7 May 2012 04:39:22 +0000 Subject: [Gluster-devel] Fixing Address family mess Message-ID: <20120507043922.GA10874@homeworld.netbsd.org> Hi Quick summary of the problem: when using transport-type socket with transport.address-family unspecified, glusterfs binds sockets with AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the kernel prefers. At mine it uses AF_INET6, while the machine is not configured to use IPv6. As a result, glusterfs client cannot connect to glusterfs server. A workaround is to use option transport.address-family inet in glusterfsd/glusterd.vol but that option must also be specified in all volume files for all bricks and FUSE client, which is unfortunate because they are automatically generated. I proposed a patch so that glusterd transport.address-family setting is propagated to various places: http://review.gluster.com/3261 That did not meet consensus. Jeff Darcy notes that we should be able to listen both on AF_INET and AF_INET6 sockets at the same time. I had a look at the code, and indeed it could easily be done. The only trouble is how to specify the listeners. For now option transport defaults to socket,rdma. I suggest we add socket families in that specification. We would then have this default: option transport socket/inet,socket/inet6,rdma With the following semantics: socket -> AF_UNSPEC socket (backward comaptibility) socket/inet -> AF_INET socket socket/inet6 -> AF_INET6 socket socket/sdp -> AF_SDP socket rdma -> sameas before Any opinion on that plan? Please comment before I writa code, it will save me some time is the proposal is wrong. -- Emmanuel Dreyfus manu at netbsd.org From xhernandez at datalab.es Mon May 7 08:07:52 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 07 May 2012 10:07:52 +0200 Subject: [Gluster-devel] Some questions about requisites of translators In-Reply-To: References: <4FA112A9.1080101@datalab.es> Message-ID: <4FA782D8.2000100@datalab.es> On 05/05/2012 08:02 AM, Anand Avati wrote: > > > On Wed, May 2, 2012 at 3:55 AM, Xavier Hernandez > > wrote: > > Hello, > > I'm wondering if there are any requisites that translators must > satisfy to work correctly inside glusterfs. > > In particular I need to know two things: > > 1. Are translators required to respect the order in which they > receive the requests ? > > This is specially important in translators such as > performance/io-threads or caching ones. It seems that these > translators can reorder requests. If this is the case, is there > any way to force some order between requests ? can inodelk/entrylk > be used to force the order ? > > > Translators are not expected to maintain ordering of requests. The > only translator which takes care of ordering calls is write-behind. > After acknowledging back write requests it has to make sure future > requests see the true "effect" as though the previous write actually > completed. To that end, it queues future "dependent" requests till the > write acknowledgement is received from the server. > > inodelk/entrylk calls help achieve synchronization among clients (by > getting into a critical section) - just like a mutex. It is an > arbitrator. It does not help for ordering of two calls. If one call > must strictly complete after another call from your translator's point > of view (i.e, if it has such a requirement), then the latter call's > STACK_WIND must happen in the callback of the former's STACK_UNWIND > path. There are no guarantees maintained by the system to ensure that > a second STACK_WIND issued right after a first STACK_WIND will > complete and callback in the same order. Write-behind does all its > ordering gimmicks only because it STACK_UNWINDs a write call > prematurely and therefore must maintain the causal effects by means of > queueing new requests behind the downcall towards the server. Good to know > 2. Are translators required to propagate callback arguments even > if the result of the operation is an error ? and if an internal > translator error occurs ? > > > Usually no. If op_ret is -1, only op_errno is expected to be a usable > value. Rest of the callback parameters are junk. > > When a translator has multiple subvolumes, I've seen that some > arguments, such as xdata, are replaced with NULL. This can be > understood, but are regular translators (those that only have one > subvolume) allowed to do that or must they preserve the value of > xdata, even in the case of an internal error ? > > > It is best to preserve the arguments unless you know specifically what > you are doing. In case of error, all the non-op_{ret,errno} arguments > are typically junk, including xdata. > > If this is not a requisite, xdata loses it's function of > delivering back extra information. > > > Can you explain? Are you seeing a use case for having a valid xdata in > the callback even with op_ret == -1? > As a part of a translator that I'm developing that works with multiple subvolumes, I need to implement some healing support to mantain data coherency (similar to AFR). After some thought, I decided that it could be advantageous to use a dedicated healing translator located near the bottom of the translators stack on the servers. This translator won't work by itself, it only adds support to be used by a higher level translator, which have to manage the logic of the healing and decide when a node needs to be healed. To do this, sometimes I need to return an error because an operation cannot be completed due to some condition related with healing itself (not with the underlying storage). However I need to send some specific healing information to let the upper translator know how it has to handle the detected condition. I cannot send a success answer because intermediate translators could take the fake data as valid and they could begin to operate incorrectly or even create inconsistencies. The other alternative is to use op_errno to encode the extra data, but this will also be difficult, even impossible in some cases, due to the amount of data and the complexity to combine it with an error code without mislead intermediate translators with strange or invalid error codes. I talked with John Mark about this translator and he suggested me to discuss it over the list. Therefore I'll initiate another thread to expose in more detail how it works and I would appreciate very much your opinion, and that of the other developers, about it. Especially if it can really be faster/safer that other solutions or not, or if you find any problem or have any suggestion to improve it. I think it could also be used by AFR and any future translator that may need some healing capabilities. Thank you very much, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From vijay at build.gluster.com Mon May 7 08:15:50 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Mon, 7 May 2012 01:15:50 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa40 released Message-ID: <20120507081553.5AA00100C5@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa40/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa40.tar.gz This release is made off v3.3.0qa40 From vijay at gluster.com Mon May 7 10:31:09 2012 From: vijay at gluster.com (Vijay Bellur) Date: Mon, 07 May 2012 16:01:09 +0530 Subject: [Gluster-devel] glusterfs-3.3.0qa40 released In-Reply-To: <20120507081553.5AA00100C5@build.gluster.com> References: <20120507081553.5AA00100C5@build.gluster.com> Message-ID: <4FA7A46D.2050506@gluster.com> This release is done by reverting commit 7d0397c2144810c8a396e00187a6617873c94002 as replace-brick and quota were not functioning with that commit. Hence the tag for this qa release would not be available in github. If you are interested in creating an equivalent of this qa release from git, it would be c4dadc74fd1d1188f123eae7f2b6d6f5232e2a0f - commit 7d0397c2144810c8a396e00187a6617873c94002. Thanks, Vijay On 05/07/2012 01:45 PM, Vijay Bellur wrote: > http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa40/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa40.tar.gz > > This release is made off v3.3.0qa40 > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From jdarcy at redhat.com Mon May 7 13:16:38 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 07 May 2012 09:16:38 -0400 Subject: [Gluster-devel] Fixing Address family mess In-Reply-To: <20120507043922.GA10874@homeworld.netbsd.org> References: <20120507043922.GA10874@homeworld.netbsd.org> Message-ID: <4FA7CB36.6040701@redhat.com> On 05/07/2012 12:39 AM, Emmanuel Dreyfus wrote: > Quick summary of the problem: when using transport-type socket with > transport.address-family unspecified, glusterfs binds sockets with > AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the > kernel prefers. At mine it uses AF_INET6, while the machine is not > configured to use IPv6. As a result, glusterfs client cannot connect > to glusterfs server. > > A workaround is to use option transport.address-family inet in > glusterfsd/glusterd.vol but that option must also be specified in > all volume files for all bricks and FUSE client, which is > unfortunate because they are automatically generated. I proposed a > patch so that glusterd transport.address-family setting is propagated > to various places: http://review.gluster.com/3261 > > That did not meet consensus. Jeff Darcy notes that we should be able > to listen both on AF_INET and AF_INET6 sockets at the same time. I > had a look at the code, and indeed it could easily be done. The only > trouble is how to specify the listeners. For now option transport > defaults to socket,rdma. I suggest we add socket families in that > specification. We would then have this default: > option transport socket/inet,socket/inet6,rdma > > With the following semantics: > socket -> AF_UNSPEC socket (backward comaptibility) > socket/inet -> AF_INET socket > socket/inet6 -> AF_INET6 socket > socket/sdp -> AF_SDP socket > rdma -> sameas before > > Any opinion on that plan? Please comment before I writa code, it will > save me some time is the proposal is wrong. I think it looks like the right solution. I understand that keeping the address-family multiplexing entirely in the socket code would be more complex, since it changes the relationship between transport instances and file descriptors (and threads in the SSL/multi-thread case). That's unfortunate, but far from the most unfortunate thing about our transport code. I do wonder whether we should use '/' as the separator, since it kind of implies the same kind of relationships between names and paths that we use for translator names - e.g. cluster/dht is actually used as part of the actual path for dht.so - and in this case that relationship doesn't actually exist. Another idea, which I don't actually like any better but which I'll suggest for completeness, would be to express the list of address families via an option: option transport.socket.address-family inet6 Now that I think about it, another benefit is that it supports multiple instances of the same address family with different options, e.g. to support segregated networks. Obviously we lack higher-level support for that right now, but if that should ever change then it would be nice to have the right low-level infrastructure in place for it. From jdarcy at redhat.com Mon May 7 14:43:47 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 07 May 2012 10:43:47 -0400 Subject: [Gluster-devel] ZkFarmer Message-ID: <4FA7DFA3.1030300@redhat.com> I've long felt that our ways of dealing with cluster membership and staging of config changes is not quite as robust and scalable as we might want. Accordingly, I spent a bit of time a couple of weeks ago looking into the possibility of using ZooKeeper to do some of this stuff. Yeah, it brings in a heavy Java dependency, but when I looked at some lighter-weight alternatives they all seemed to be lacking in more important ways. Basically the idea was to do this: * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or point everyone at an existing ZooKeeper cluster. * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" merely updates ZK, and "peer status" merely reads from it). * Store config information in ZK *once* instead of regenerating volfiles etc. on every node (and dealing with the ugly cases where a node was down when the config change happened). * Set watches on ZK nodes to be notified when config changes happen, and respond appropriately. I eventually ran out of time and moved on to other things, but this or something like it (e.g. using Riak Core) still seems like a better approach than what we have. In that context, it looks like ZkFarmer[1] might be a big help. AFAICT someone else was trying to solve almost exactly the same kind of server/config problem that we have, and wrapped their solution into a library. Is this a direction other devs might be interested in pursuing some day, if/when time allows? [1] https://github.com/rs/zkfarmer From johnmark at redhat.com Mon May 7 19:35:54 2012 From: johnmark at redhat.com (John Mark Walker) Date: Mon, 07 May 2012 15:35:54 -0400 (EDT) Subject: [Gluster-devel] Server outage - review.gluster.com - please stand by In-Reply-To: <5299ff98-4714-4702-8f26-0d6f62441fe3@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: Greetings, Our iWeb server, which hosts review.gluster.com, is currently down. I have filed an urgent request to reboot the server in question. If you notice anything else working poorly, aside from review.gluster.com, please let me know ASAP. I'll send a note when services are back to normal. -JM From ian.latter at midnightcode.org Mon May 7 22:17:41 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 08 May 2012 08:17:41 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205072217.q47MHfmr003867@singularity.tronunltd.com> Is there anything written up on why you/all want every node to be completely conscious of every other node? I could see a couple of architectures that might work better (be more scalable) if the config minutiae were either not necessary to be shared or shared in only cases where the config minutiae were a dependency. RE ZK, I have an issue with it not being a binary at the linux distribution level. This is the reason I don't currently have Gluster's geo replication module in place .. ----- Original Message ----- >From: "Jeff Darcy" >To: >Subject: [Gluster-devel] ZkFarmer >Date: Mon, 07 May 2012 10:43:47 -0400 > > I've long felt that our ways of dealing with cluster membership and staging of > config changes is not quite as robust and scalable as we might want. > Accordingly, I spent a bit of time a couple of weeks ago looking into the > possibility of using ZooKeeper to do some of this stuff. Yeah, it brings in a > heavy Java dependency, but when I looked at some lighter-weight alternatives > they all seemed to be lacking in more important ways. Basically the idea was > to do this: > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or > point everyone at an existing ZooKeeper cluster. > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" > merely updates ZK, and "peer status" merely reads from it). > > * Store config information in ZK *once* instead of regenerating volfiles etc. > on every node (and dealing with the ugly cases where a node was down when the > config change happened). > > * Set watches on ZK nodes to be notified when config changes happen, and > respond appropriately. > > I eventually ran out of time and moved on to other things, but this or > something like it (e.g. using Riak Core) still seems like a better approach > than what we have. In that context, it looks like ZkFarmer[1] might be a big > help. AFAICT someone else was trying to solve almost exactly the same kind of > server/config problem that we have, and wrapped their solution into a library. > Is this a direction other devs might be interested in pursuing some day, > if/when time allows? > > > [1] https://github.com/rs/zkfarmer > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From ab at gluster.com Mon May 7 22:55:22 2012 From: ab at gluster.com (Anand Babu Periasamy) Date: Mon, 7 May 2012 15:55:22 -0700 Subject: [Gluster-devel] Fixing Address family mess In-Reply-To: <4FA7CB36.6040701@redhat.com> References: <20120507043922.GA10874@homeworld.netbsd.org> <4FA7CB36.6040701@redhat.com> Message-ID: On Mon, May 7, 2012 at 6:16 AM, Jeff Darcy wrote: > On 05/07/2012 12:39 AM, Emmanuel Dreyfus wrote: >> Quick summary of the problem: when using transport-type socket with >> transport.address-family unspecified, glusterfs binds sockets with >> AF_UNSPEC, which will use either AF_INET or AF_INET6 socket, whatever the >> kernel prefers. At mine it uses AF_INET6, while the machine is not >> configured to use IPv6. As a result, glusterfs client cannot connect >> to glusterfs server. >> >> A workaround is to use option transport.address-family inet in >> glusterfsd/glusterd.vol but that option must also be specified in >> all volume files for all bricks and FUSE client, which is >> unfortunate because they are automatically generated. I proposed a >> patch so that glusterd transport.address-family setting is propagated >> to various places: http://review.gluster.com/3261 >> >> That did not meet consensus. Jeff Darcy notes that we should be able >> to listen both on AF_INET and AF_INET6 sockets at the same time. I >> had a look at the code, and indeed it could easily be done. The only >> trouble is how to specify the listeners. For now option transport >> defaults to socket,rdma. I suggest we add socket families in that >> specification. We would then have this default: >> ? ?option transport socket/inet,socket/inet6,rdma >> >> With the following semantics: >> ? ?socket -> AF_UNSPEC socket (backward comaptibility) >> ? ?socket/inet -> AF_INET socket >> ? ?socket/inet6 -> AF_INET6 socket >> ? ?socket/sdp -> AF_SDP socket >> ? ?rdma -> sameas before >> >> Any opinion on that plan? Please comment before I writa code, it will >> save me some time is the proposal is wrong. > > I think it looks like the right solution. I understand that keeping the > address-family multiplexing entirely in the socket code would be more complex, > since it changes the relationship between transport instances and file > descriptors (and threads in the SSL/multi-thread case). ?That's unfortunate, > but far from the most unfortunate thing about our transport code. > > I do wonder whether we should use '/' as the separator, since it kind of > implies the same kind of relationships between names and paths that we use for > translator names - e.g. cluster/dht is actually used as part of the actual path > for dht.so - and in this case that relationship doesn't actually exist. Another > idea, which I don't actually like any better but which I'll suggest for > completeness, would be to express the list of address families via an option: > > ? ? ? ?option transport.socket.address-family inet6 > > Now that I think about it, another benefit is that it supports multiple > instances of the same address family with different options, e.g. to support > segregated networks. ?Obviously we lack higher-level support for that right > now, but if that should ever change then it would be nice to have the right > low-level infrastructure in place for it. > Yes this should be controlled through volume options. "transport.address-family" is the right place to set it. Possible values are "inet, inet6, unix, inet-sdp". I would have named those user facing options as "ipv4, ipv6, sdp, all". If transport.address-family is not set. then if remote-host is set default to AF_INET (ipv4) if if transport.socket.connect-path is set default to AF_UNIX (unix) AF_UNSPEC is should be be taken as IPv4/IPv6. It is named appropriately. Default should be ipv4. I have not tested the patch. It is simply to explain how the changes should look like. I ignored legacy translators. When we implement concurrent support for multiple address-family (likely via mult-process model) we can worry about combinations. I agree. Combinations should look like "inet | inet6 | .." and not "inet / inet6 /.." -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: glusterfs-af-default-ipv4.diff Type: application/octet-stream Size: 9194 bytes Desc: not available URL: From jdarcy at redhat.com Tue May 8 00:43:31 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 07 May 2012 20:43:31 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205072217.q47MHfmr003867@singularity.tronunltd.com> References: <201205072217.q47MHfmr003867@singularity.tronunltd.com> Message-ID: <4FA86C33.6020901@redhat.com> On 05/07/2012 06:17 PM, Ian Latter wrote: > Is there anything written up on why you/all want every > node to be completely conscious of every other node? > > I could see a couple of architectures that might work > better (be more scalable) if the config minutiae were > either not necessary to be shared or shared in only > cases where the config minutiae were a dependency. Well, these aren't exactly minutiae. Everything at file or directory level is fully distributed and will remain so. We're talking only about stuff at the volume or server level, which is very little data but very broad in scope. Trying to segregate that only adds complexity and subtracts convenience, compared to having it equally accessible to (or through) any server. > RE ZK, I have an issue with it not being a binary at > the linux distribution level. This is the reason I don't > currently have Gluster's geo replication module in > place .. What exactly is your objection to interpreted or JIT compiled languages? Performance? Security? It's an unusual position, to say the least. From glusterdevel at louiszuckerman.com Tue May 8 03:52:02 2012 From: glusterdevel at louiszuckerman.com (Louis Zuckerman) Date: Mon, 7 May 2012 23:52:02 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <4FA7DFA3.1030300@redhat.com> References: <4FA7DFA3.1030300@redhat.com> Message-ID: Here's another ZooKeeper management framework that may be useful. It's called Curator, developed by Netflix, and recently released as open source. It probably has a bit more inertia than ZkFarmer too. http://techblog.netflix.com/2011/11/introducing-curator-netflix-zookeeper.html https://github.com/Netflix/curator HTH -louis On Mon, May 7, 2012 at 10:43 AM, Jeff Darcy wrote: > I've long felt that our ways of dealing with cluster membership and > staging of > config changes is not quite as robust and scalable as we might want. > Accordingly, I spent a bit of time a couple of weeks ago looking into the > possibility of using ZooKeeper to do some of this stuff. Yeah, it brings > in a > heavy Java dependency, but when I looked at some lighter-weight > alternatives > they all seemed to be lacking in more important ways. Basically the idea > was > to do this: > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, > or > point everyone at an existing ZooKeeper cluster. > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" > merely updates ZK, and "peer status" merely reads from it). > > * Store config information in ZK *once* instead of regenerating volfiles > etc. > on every node (and dealing with the ugly cases where a node was down when > the > config change happened). > > * Set watches on ZK nodes to be notified when config changes happen, and > respond appropriately. > > I eventually ran out of time and moved on to other things, but this or > something like it (e.g. using Riak Core) still seems like a better approach > than what we have. In that context, it looks like ZkFarmer[1] might be a > big > help. AFAICT someone else was trying to solve almost exactly the same > kind of > server/config problem that we have, and wrapped their solution into a > library. > Is this a direction other devs might be interested in pursuing some day, > if/when time allows? > > > [1] https://github.com/rs/zkfarmer > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Tue May 8 04:27:24 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 08 May 2012 14:27:24 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205080427.q484RO09004857@singularity.tronunltd.com> > > Is there anything written up on why you/all want every > > node to be completely conscious of every other node? > > > > I could see a couple of architectures that might work > > better (be more scalable) if the config minutiae were > > either not necessary to be shared or shared in only > > cases where the config minutiae were a dependency. > > Well, these aren't exactly minutiae. Everything at file or directory level is > fully distributed and will remain so. We're talking only about stuff at the > volume or server level, which is very little data but very broad in scope. > Trying to segregate that only adds complexity and subtracts convenience, > compared to having it equally accessible to (or through) any server. Sorry, I didn't have time this morning to add more detail. Note that my concern isn't bandwidth, its flexibility; the less knowledge needed the more I can do crazy things in user land, like running boxes in different data centres and randomly power things up and down, randomly re- address, randomly replace in-box hardware, load balance, NAT, etc. It makes a dynamic environment difficult to construct, for example, when Gluster rejects the same volume-id being presented to an existing cluster from a new GFID. But there's no need to go even that complicated, let me pull out an example of where shared knowledge may be unnecessary; The work that I was doing in Gluster (pre glusterd) drove out one primary "server" which fronted a Replicate volume of both its own Distribute volume and that of another server or two - themselves serving a single Distribute volume. So the client connected to one server for one volume and the rest was black box / magic (from the client's perspective - big fast storage in many locations); in that case it could be said that servers needed some shared knowledge, while the clients didn't. The equivalent configuration in a glusterd world (from my experiments) pushed all of the distribute knowledge out to the client and I haven't had a response as to how to add a replicate on distributed volumes in this model, so I've lost replicate. But in this world, the client must know about everything and the server is simply a set of served/presented disks (as volumes). In this glusterd world, then, why does any server need to know of any other server, if the clients are doing all of the heavy lifting? The additional consideration is where the server both consumes and presents, but this would be captured in the client side view. i.e. given where glusterd seems to be driving, this knowledge seems to be needed on the client side (within glusterfs, not glusterfsd). To my mind this breaks the gluster architecture that I read about 2009, but I need to stress that I didn't get a reply to the glusterd architecture question that I posted about a month ago; so I don't know if glusterd is currently limiting deployment options because; - there is an intention to drive the heavy lifting to the client (for example for performance reasons in big deployments), or; - there are known limitations in the existing bricks/ modules (for example moving files thru distribute), or; - there is ultimately (long term) more flexibility seen in this model (and we're at a midway point between pre glusterd and post so it doesn't feel that way yet), or; - there is an intent to drive out a particular market outcome or match an existing storage model (the gluster presentation was driving towards cloud, and maybe those vendors don't use server side implementations), etc. As I don't have a clear/big picture in my mind; if I'm not considering all of the impacts, then my apologies. > > RE ZK, I have an issue with it not being a binary at > > the linux distribution level. This is the reason I don't > > currently have Gluster's geo replication module in > > place .. > > What exactly is your objection to interpreted or JIT compiled languages? > Performance? Security? It's an unusual position, to say the least. > Specifically, primarily, space. Saturn builds GlusterFS capacity from a 48 Megabyte Linux distribution and adding many Megabytes of Perl and/or Python and/or PHP and/or Java for a single script is impractical. My secondary concern is licensing (specifically in the Java run-time environment case). Hadoop forced my hand; GNU's JRE/compiler wasn't up to the task of running Hadoop when I last looked at it (about 2 or 3 years ago now) - well, it could run a 2007 or so version but not current ones at that time - so now I work with Gluster .. Going back to ZkFarmer; Considering other architectures; it depends on how you slice and dice the problem as to how much external support you need; > I've long felt that our ways of dealing with cluster > membership and staging of config changes is not > quite as robust and scalable as we might want. By way of example; The openMosix kernel extensions maintained their own information exchange between cluster nodes; if a node (ip) was added via the /proc interface, it was "in" the cluster. Therefore cluster membership was the hand-off/interface. It could be as simple as a text list on each node, or it could be left to a user space daemon which could then gate cluster membership - this suited everyone with a small cluster. The native daemon (omdiscd) used multicast packets to find nodes and then stuff those IP's into the /proc interface - this suited everyone with a private/dedicated cluster. A colleague and I wrote a TCP variation to allow multi-site discovery with SSH public key exchanges and IPSEC tunnel establishment as part of the gating process - this suited those with a distributed/ part-time cluster. To ZooKeeper's point (http://zookeeper.apache.org/), the discovery protocol that we created was weak and I've since found a model/algorithm that allows for far more robust discovery. The point being that, depending on the final cluster architecture for gluster (i.e. all are nodes are peers and thus all are cluster members, nodes are client or server and both are cluster members, nodes are client or server and only clients [or servers] are cluster members, etc) there may be simpler cluster management options .. Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From ab at gluster.com Tue May 8 04:33:50 2012 From: ab at gluster.com (Anand Babu Periasamy) Date: Mon, 7 May 2012 21:33:50 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <4FA7DFA3.1030300@redhat.com> References: <4FA7DFA3.1030300@redhat.com> Message-ID: On Mon, May 7, 2012 at 7:43 AM, Jeff Darcy wrote: > I've long felt that our ways of dealing with cluster membership and staging of > config changes is not quite as robust and scalable as we might want. > Accordingly, I spent a bit of time a couple of weeks ago looking into the > possibility of using ZooKeeper to do some of this stuff. ?Yeah, it brings in a > heavy Java dependency, but when I looked at some lighter-weight alternatives > they all seemed to be lacking in more important ways. ?Basically the idea was > to do this: > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or > point everyone at an existing ZooKeeper cluster. > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe" > merely updates ZK, and "peer status" merely reads from it). > > * Store config information in ZK *once* instead of regenerating volfiles etc. > on every node (and dealing with the ugly cases where a node was down when the > config change happened). > > * Set watches on ZK nodes to be notified when config changes happen, and > respond appropriately. > > I eventually ran out of time and moved on to other things, but this or > something like it (e.g. using Riak Core) still seems like a better approach > than what we have. ?In that context, it looks like ZkFarmer[1] might be a big > help. ?AFAICT someone else was trying to solve almost exactly the same kind of > server/config problem that we have, and wrapped their solution into a library. > ?Is this a direction other devs might be interested in pursuing some day, > if/when time allows? > > > [1] https://github.com/rs/zkfarmer Real issue is here is: GlusterFS is a fully distributed system. It is OK for config files to be in one place (centralized). It is easier to manage and backup. Avati still claims that making distributed copies are not a problem (volume operations are fast, versioned and checksumed). Also the code base for replicating 3 way or all-node is same. We all need to come to agreement on the demerits of replicating the volume spec on every node. If we are convinced to keep the config info in one place, ZK is certainly one a good idea. I personally hate Java dependency. I still struggle with Java dependencies for browser and clojure. I can digest that if we are going to adopt Java over Python for future external modules. Alternatively we can also look at creating a replicated meta system volume. What ever we adopt, we should keep dependencies and installation steps to the bare minimum and simple. -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein From ab at gluster.com Tue May 8 04:56:10 2012 From: ab at gluster.com (Anand Babu Periasamy) Date: Mon, 7 May 2012 21:56:10 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205080427.q484RO09004857@singularity.tronunltd.com> References: <201205080427.q484RO09004857@singularity.tronunltd.com> Message-ID: On Mon, May 7, 2012 at 9:27 PM, Ian Latter wrote: > >> > Is there anything written up on why you/all want every >> > node to be completely conscious of every other node? >> > >> > I could see a couple of architectures that might work >> > better (be more scalable) if the config minutiae were >> > either not necessary to be shared or shared in only >> > cases where the config minutiae were a dependency. >> >> Well, these aren't exactly minutiae. ?Everything at file > or directory level is >> fully distributed and will remain so. ?We're talking only > about stuff at the >> volume or server level, which is very little data but very > broad in scope. >> Trying to segregate that only adds complexity and > subtracts convenience, >> compared to having it equally accessible to (or through) > any server. > > Sorry, I didn't have time this morning to add more detail. > > Note that my concern isn't bandwidth, its flexibility; the > less knowledge needed the more I can do crazy things > in user land, like running boxes in different data centres > and randomly power things up and down, randomly re- > address, randomly replace in-box hardware, load > balance, NAT, etc. ?It makes a dynamic environment > difficult to construct, for example, when Gluster rejects > the same volume-id being presented to an existing > cluster from a new GFID. > > But there's no need to go even that complicated, let > me pull out an example of where shared knowledge > may be unnecessary; > > The work that I was doing in Gluster (pre glusterd) drove > out one primary "server" which fronted a Replicate > volume of both its own Distribute volume and that of > another server or two - themselves serving a single > Distribute volume. ?So the client connected to one > server for one volume and the rest was black box / > magic (from the client's perspective - big fast storage > in many locations); in that case it could be said that > servers needed some shared knowledge, while the > clients didn't. > > The equivalent configuration in a glusterd world (from > my experiments) pushed all of the distribute knowledge > out to the client and I haven't had a response as to how > to add a replicate on distributed volumes in this model, > so I've lost replicate. ?But in this world, the client must > know about everything and the server is simply a set > of served/presented disks (as volumes). ?In this > glusterd world, then, why does any server need to > know of any other server, if the clients are doing all of > the heavy lifting? > > The additional consideration is where the server both > consumes and presents, but this would be captured in > the client side view. ?i.e. given where glusterd seems > to be driving, this knowledge seems to be needed on > the client side (within glusterfs, not glusterfsd). > > To my mind this breaks the gluster architecture that I > read about 2009, but I need to stress that I didn't get > a reply to the glusterd architecture question that I > posted about a month ago; ?so I don't know if glusterd > is currently limiting deployment options because; > ?- there is an intention to drive the heavy lifting to the > ? ?client (for example for performance reasons in big > ? ?deployments), or; > ?- there are known limitations in the existing bricks/ > ? ?modules (for example moving files thru distribute), > ? ?or; > ?- there is ultimately (long term) more flexibility seen > ? ?in this model (and we're at a midway point between > ? ?pre glusterd and post so it doesn't feel that way > ? ?yet), or; > ?- there is an intent to drive out a particular market > ? ?outcome or match an existing storage model (the > ? ?gluster presentation was driving towards cloud, > ? ?and maybe those vendors don't use server side > ? ?implementations), etc. > > As I don't have a clear/big picture in my mind; if I'm > not considering all of the impacts, then my apologies. > > >> > RE ZK, I have an issue with it not being a binary at >> > the linux distribution level. ?This is the reason I don't >> > currently have Gluster's geo replication module in >> > place .. >> >> What exactly is your objection to interpreted or JIT > compiled languages? >> Performance? ?Security? ?It's an unusual position, to say > the least. >> > > Specifically, primarily, space. ?Saturn builds GlusterFS > capacity from a 48 Megabyte Linux distribution and > adding many Megabytes of Perl and/or Python and/or > PHP and/or Java for a single script is impractical. > > My secondary concern is licensing (specifically in the > Java run-time environment case). ?Hadoop forced my > hand; GNU's JRE/compiler wasn't up to the task of > running Hadoop when I last looked at it (about 2 or 3 > years ago now) - well, it could run a 2007 or so > version but not current ones at that time - so now I > work with Gluster .. > > > > Going back to ZkFarmer; > > Considering other architectures; it depends on how > you slice and dice the problem as to how much > external support you need; > ?> I've long felt that our ways of dealing with cluster > ?> membership and staging of config changes is not > ?> quite as robust and scalable as we might want. > > By way of example; > ?The openMosix kernel extensions maintained their > own information exchange between cluster nodes; if > a node (ip) was added via the /proc interface, it was > "in" the cluster. ?Therefore cluster membership was > the hand-off/interface. > ?It could be as simple as a text list on each node, or > it could be left to a user space daemon which could > then gate cluster membership - this suited everyone > with a small cluster. > ?The native daemon (omdiscd) used multicast > packets to find nodes and then stuff those IP's into > the /proc interface - this suited everyone with a > private/dedicated cluster. > ?A colleague and I wrote a TCP variation to allow > multi-site discovery with SSH public key exchanges > and IPSEC tunnel establishment as part of the > gating process - this suited those with a distributed/ > part-time cluster. ?To ZooKeeper's point > (http://zookeeper.apache.org/), the discovery > protocol that we created was weak and I've since > found a model/algorithm that allows for far more > robust discovery. > > ?The point being that, depending on the final cluster > architecture for gluster (i.e. all are nodes are peers > and thus all are cluster members, nodes are client > or server and both are cluster members, nodes are > client or server and only clients [or servers] are > cluster members, etc) there may be simpler cluster > management options .. > > > Cheers, > Reason to keep the volume spec files on all servers is simply to be fully distributed. No one node or set of nodes should hold the cluster hostage. Code to keep them in sync over 2 nodes or 20 nodes is essentially the same. We are revisiting this situation now because we want to scale to 1000s of nodes potentially. Gluster CLI operations should not time out or slow down. If ZK requires proprietary JRE for stability, Java will be NO NO!. We may not need ZK at all. If we simply decide to centralize the config, GlusterFS has enough code to handle them. Again Avati will argue that it is exactly the same code as now. My point is to keep things simple as we scale. Even if the code base is same, we should still restrict it to N selected nodes. It is matter of adding config option. -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein From ian.latter at midnightcode.org Tue May 8 05:21:37 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 08 May 2012 15:21:37 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205080521.q485Lb9d005117@singularity.tronunltd.com> > No one node or set of nodes should hold the > cluster hostage. Agreed - this is fundamental. > We are revisiting this situation now because we > want to scale to 1000s of nodes potentially. Good, I hate upper bounds on architectures :) Though I haven't tested my own implementation, I understand that one implementation of the discovery protocol that I've used, scaled to 20,000 hosts across three sites in two countries; this is the the type of robust outcome that can be manipulated at the macro scale - i.e. without manipulating per-node details. > Gluster CLI operations should not time out or > slow down. This is critical - not just the CLI but also the storage interface (in a redundant environment); infrastructure wears and fails, thus failing infrastructure should be regarded as the norm/ default. > If ZK requires proprietary JRE for stability, > Java will be NO NO!. *Fantastic* > My point is to keep things simple as we scale. I couldn't agree more. In that principle I ask that each dependency on cluster knowledge be considered carefully with a minimalist approach. -- Ian Latter Late night coder .. http://midnightcode.org/ From vbellur at redhat.com Tue May 8 09:15:13 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Tue, 08 May 2012 14:45:13 +0530 Subject: [Gluster-devel] Server outage - review.gluster.com - please stand by In-Reply-To: References: Message-ID: <4FA8E421.3090108@redhat.com> On 05/08/2012 01:05 AM, John Mark Walker wrote: > Greetings, > > Our iWeb server, which hosts review.gluster.com, is currently down. I have filed an urgent request to reboot the server in question. > > If you notice anything else working poorly, aside from review.gluster.com, please let me know ASAP. > > I'll send a note when services are back to normal. All services are back to normal. Please let us know if you notice any issue. Thanks, Vijay From xhernandez at datalab.es Tue May 8 09:34:35 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Tue, 08 May 2012 11:34:35 +0200 Subject: [Gluster-devel] A healing translator Message-ID: <4FA8E8AB.2040604@datalab.es> Hello developers, I would like to expose some ideas we are working on to create a new kind of translator that should be able to unify and simplify to some extent the healing procedures of complex translators. Currently, the only translator with complex healing capabilities that we are aware of is AFR. We are developing another translator that will also need healing capabilities, so we thought that it would be interesting to create a new translator able to handle the common part of the healing process and hence to simplify and avoid duplicated code in other translators. The basic idea of the new translator is to handle healing tasks nearer the storage translator on the server nodes instead to control everything from a translator on the client nodes. Of course the heal translator is not able to handle healing entirely by itself, it needs a client translator which will coordinate all tasks. The heal translator is intended to be used by translators that work with multiple subvolumes. I will try to explain how it works without entering into too much details. There is an important requisite for all client translators that use healing: they must have exactly the same list of subvolumes and in the same order. Currently, I think this is not a problem. The heal translator treats each file as an independent entity, and each one can be in 3 modes: 1. Normal mode This is the normal mode for a copy or fragment of a file when it is synchronized and consistent with the same file on other nodes (for example with other replicas. It is the client translator who decides if it is synchronized or not). 2. Healing mode This is the mode used when a client detects an inconsistency in the copy or fragment of the file stored on this node and initiates the healing procedures. 3. Provider mode (I don't like very much this name, though) This is the mode used by client translators when an inconsistency is detected in this file, but the copy or fragment stored in this node is considered good and it will be used as a source to repair the contents of this file on other nodes. Initially, when a file is created, it is set in normal mode. Client translators that make changes must guarantee that they send the modification requests in the same order to all the servers. This should be done using inodelk/entrylk. When a change is sent to a server, the client must include a bitmap mask of the clients to which the request is being sent. Normally this is a bitmap containing all the clients, however, when a server fails for some reason some bits will be cleared. The heal translator uses this bitmap to early detect failures on other nodes from the point of view of each client. When this condition is detected, the request is aborted with an error and the client is notified with the remaining list of valid nodes. If the client considers the request can be successfully server with the remaining list of nodes, it can resend the request with the updated bitmap. The heal translator also updates two file attributes for each change request to mantain the "version" of the data and metadata contents of the file. A similar task is currently made by AFR using xattrop. This would not be needed anymore, speeding write requests. The version of data and metadata is returned to the client for each read request, allowing it to detect inconsistent data. When a client detects an inconsistency, it initiates healing. First of all, it must lock the entry and inode (when necessary). Then, from the data collected from each node, it must decide which nodes have good data and which ones have bad data and hence need to be healed. There are two possible cases: 1. File is not a regular file In this case the reconstruction is very fast and requires few requests, so it is done while the file is locked. In this case, the heal translator does nothing relevant. 2. File is a regular file For regular files, the first step is to synchronize the metadata to the bad nodes, including the version information. Once this is done, the file is set in healing mode on bad nodes, and provider mode on good nodes. Then the entry and inode are unlocked. When a file is in provider mode, it works as in normal mode, but refuses to start another healing. Only one client can be healing a file. When a file is in healing mode, each normal write request from any client are handled as if the file were in normal mode, updating the version information and detecting possible inconsistencies with the bitmap. Additionally, the healing translator marks the written region of the file as "good". Each write request from the healing client intended to repair the file must be marked with a special flag. In this case, the area that wants to be written is filtered by the list of "good" ranges (if there are any intersection with a good range, it is removed from the request). The resulting set of ranges are propagated to the lower translator and added to the list of "good" ranges but the version information is not updated. Read requests are only served if the range requested is entirely contained into the "good" regions list. There are some additional details, but I think this is enough to have a general idea of its purpose and how it works. The main advantages of this translator are: 1. Avoid duplicated code in client translators 2. Simplify and unify healing methods in client translators 3. xattrop is not needed anymore in client translators to keep track of changes 4. Full file contents are repaired without locking the file 5. Better detection and prevention of some split brain situations as soon as possible I think it would be very useful. It seems to me that it works correctly in all situations, however I don't have all the experience that other developers have with the healing functions of AFR, so I will be happy to answer any question or suggestion to solve problems it may have or to improve it. What do you think about it ? Thank you, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From jdarcy at redhat.com Tue May 8 12:57:31 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Tue, 08 May 2012 08:57:31 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: References: <4FA7DFA3.1030300@redhat.com> Message-ID: <4FA9183B.5080708@redhat.com> On 05/08/2012 12:33 AM, Anand Babu Periasamy wrote: > Real issue is here is: GlusterFS is a fully distributed system. It is > OK for config files to be in one place (centralized). It is easier to > manage and backup. Avati still claims that making distributed copies > are not a problem (volume operations are fast, versioned and > checksumed). It's also grossly inefficient at 100-node scale. I'll also need some convincing before I believe that nodes which are down during a config change will catch up automatically and reliably in all cases. I think this is even more of an issue with membership than with config data. All-to-all pings are just not acceptable at 100-node or greater scale. We need something better, and more importantly designing cluster membership protocols is just not a business we should even be in. We shouldn't be devoting our own time to that when we can just use something designed by people who have that as their focus. > Also the code base for replicating 3 way or all-node is > same. We all need to come to agreement on the demerits of replicating > the volume spec on every node. It's somewhat similar to how we replicate data - we need enough copies to survive a certain number of anticipated failures. > If we are convinced to keep the config info in one place, ZK is > certainly one a good idea. I personally hate Java dependency. I still > struggle with Java dependencies for browser and clojure. I can digest > that if we are going to adopt Java over Python for future external > modules. Alternatively we can also look at creating a replicated meta > system volume. What ever we adopt, we should keep dependencies and > installation steps to the bare minimum and simple. I personally hate the Java dependency too. I'd much rather have something in C/Go/Python/Erlang but couldn't find anything that had the same (useful) feature set. I also considered the idea of storing config in a hand-crafted GlusterFS volume, using our own mechanisms for distributing/finding and replicating data. That's at least an area where we can claim some expertise. Such layering does create a few interesting issues, but nothing intractable. The big drawback is that it only solves the config-data problem; a solution which combines that with cluster membership is IMO preferable. The development drag of having to maintain that functionality ourselves, and hook every new feature into the not-very-convenient APIs that have predictably resulted, is considerable. From jdarcy at redhat.com Tue May 8 12:42:19 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Tue, 08 May 2012 08:42:19 -0400 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205080427.q484RO09004857@singularity.tronunltd.com> References: <201205080427.q484RO09004857@singularity.tronunltd.com> Message-ID: <4FA914AB.8030209@redhat.com> On 05/08/2012 12:27 AM, Ian Latter wrote: > The equivalent configuration in a glusterd world (from > my experiments) pushed all of the distribute knowledge > out to the client and I haven't had a response as to how > to add a replicate on distributed volumes in this model, > so I've lost replicate. This doesn't seem to be a problem with replicate-first vs. distribute-first, but with client-side vs. server-side deployment of those translators. You *can* construct your own volfiles that do these things on the servers. It will work, but you won't get a lot of support for it. The issue here is that we have only a finite number of developers, and a near-infinite number of configurations. We can't properly qualify everything. One way we've tried to limit that space is by preferring distribute over replicate, because replicate does a better job of shielding distribute from brick failures than vice versa. Another is to deploy both on the clients, following the scalability rule of pushing effort to the most numerous components. The code can support other arrangements, but the people might not. BTW, a similar concern exists with respect to replication (i.e. AFR) across data centers. Performance is going to be bad, and there's not going to be much we can do about it. > But in this world, the client must > know about everything and the server is simply a set > of served/presented disks (as volumes). In this > glusterd world, then, why does any server need to > know of any other server, if the clients are doing all of > the heavy lifting? First, because config changes have to apply across servers. Second, because server machines often spin up client processes for things like repair or rebalance. From ian.latter at midnightcode.org Tue May 8 23:08:32 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Wed, 09 May 2012 09:08:32 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205082308.q48N8WQg008425@singularity.tronunltd.com> > On 05/08/2012 12:27 AM, Ian Latter wrote: > > The equivalent configuration in a glusterd world (from > > my experiments) pushed all of the distribute knowledge > > out to the client and I haven't had a response as to how > > to add a replicate on distributed volumes in this model, > > so I've lost replicate. > > This doesn't seem to be a problem with replicate-first vs. distribute-first, > but with client-side vs. server-side deployment of those translators. You > *can* construct your own volfiles that do these things on the servers. It will > work, but you won't get a lot of support for it. The issue here is that we > have only a finite number of developers, and a near-infinite number of > configurations. We can't properly qualify everything. One way we've tried to > limit that space is by preferring distribute over replicate, because replicate > does a better job of shielding distribute from brick failures than vice versa. > Another is to deploy both on the clients, following the scalability rule of > pushing effort to the most numerous components. The code can support other > arrangements, but the people might not. Sure, I have my own vol files that do (did) what I wanted and I was supporting myself (and users); the question (and the point) is what is the GlusterFS *intent*? I'll write an rsyncd wrapper myself, to run on top of Gluster, if the intent is not allow the configuration I'm after (arbitrary number of disks in one multi-host environment replicated to an arbitrary number of disks in another multi-host environment, where ideally each environment need not sum to the same data capacity, presented in a single contiguous consumable storage layer to an arbitrary number of unintelligent clients, that is as fault tolerant as I choose it to be including the ability to add and offline/online and remove storage as I so choose) .. or switch out the whole solution if Gluster is heading away from my needs. I just need to know what the direction is .. I may even be able to help get you there if you tell me :) > BTW, a similar concern exists with respect to replication (i.e. AFR) across > data centers. Performance is going to be bad, and there's not going to be much > we can do about it. Hmm .. that depends .. these sorts of statements need context/qualification (in bandwidth and latency terms). For example the last multi-site environment that I did architecture for was two DCs set 32kms apart with a redundant 20Gbps layer-2 (ethernet) stretch between them - latency was 1ms average, 2ms max (the fiber actually took a 70km path). Didn't run Gluster on it, but we did stretch a number things that "couldn't" be stretched. > > But in this world, the client must > > know about everything and the server is simply a set > > of served/presented disks (as volumes). In this > > glusterd world, then, why does any server need to > > know of any other server, if the clients are doing all of > > the heavy lifting? > > First, because config changes have to apply across servers. Second, because > server machines often spin up client processes for things like repair or > rebalance. Yep, but my reading is that the config's that the servers need are local - to make a disk a share (volume), and that as you've described the rest are "client processes" (even when on something built as a "server"), so if you catered for all clients then you'd be set? I.e. AFR now runs in the client? And I am sick of the word-wrap on this client .. I think you've finally convinced me to fix it ... what's normal these days - still 80 chars? -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Wed May 9 00:57:49 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 8 May 2012 17:57:49 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205082308.q48N8WQg008425@singularity.tronunltd.com> References: <201205082308.q48N8WQg008425@singularity.tronunltd.com> Message-ID: On Tue, May 8, 2012 at 4:08 PM, Ian Latter wrote: > > On 05/08/2012 12:27 AM, Ian Latter wrote: > > > The equivalent configuration in a glusterd world (from > > > my experiments) pushed all of the distribute knowledge > > > out to the client and I haven't had a response as to how > > > to add a replicate on distributed volumes in this model, > > > so I've lost replicate. > > > > This doesn't seem to be a problem with replicate-first vs. > distribute-first, > > but with client-side vs. server-side deployment of those > translators. You > > *can* construct your own volfiles that do these things on > the servers. It will > > work, but you won't get a lot of support for it. The > issue here is that we > > have only a finite number of developers, and a > near-infinite number of > > configurations. We can't properly qualify everything. > One way we've tried to > > limit that space is by preferring distribute over > replicate, because replicate > > does a better job of shielding distribute from brick > failures than vice versa. > > Another is to deploy both on the clients, following the > scalability rule of > > pushing effort to the most numerous components. The code > can support other > > arrangements, but the people might not. > > Sure, I have my own vol files that do (did) what I wanted > and I was supporting myself (and users); the question > (and the point) is what is the GlusterFS *intent*? The "intent" (more or less - I hate to use the word as it can imply a commitment to what I am about to say, but there isn't one) is to keep the bricks (server process) dumb and have the intelligence on the client side. This is a "rough goal". There are cases where replication on the server side is inevitable (in the case of NFS access) but we keep the software architecture undisturbed by running a client process on the server machine to achieve it. We do plan to support "replication on the server" in the future while still retaining the existing software architecture as much as possible. This is particularly useful in Hadoop environment where the jobs expect write performance of a single copy and expect copy to happen in the background. We have the proactive self-heal daemon running on the server machines now (which again is a client process which happens to be physically placed on the server) which gives us many interesting possibilities - i.e, with simple changes where we fool the client side replicate translator at the time of transaction initiation that only the closest server is up at that point of time and write to it alone, and have the proactive self-heal daemon perform the extra copies in the background. This would be consistent with other readers as they get directed to the "right" version of the file by inspecting the changelogs while the background replication is in progress. The intention of the above example is to give a general sense of how we want to evolve the architecture (i.e, the "intention" you were referring to) - keep the clients intelligent and servers dumb. If some intelligence needs to be built on the physical server, tackle it by loading a client process there (there are also "pathinfo xattr" kind of internal techniques to figure out locality of the clients in a generic way without bringing "server sidedness" into them in a harsh way) I'll > write an rsyncd wrapper myself, to run on top of Gluster, > if the intent is not allow the configuration I'm after > (arbitrary number of disks in one multi-host environment > replicated to an arbitrary number of disks in another > multi-host environment, where ideally each environment > need not sum to the same data capacity, presented in a > single contiguous consumable storage layer to an > arbitrary number of unintelligent clients, that is as fault > tolerant as I choose it to be including the ability to add > and offline/online and remove storage as I so choose) .. > or switch out the whole solution if Gluster is heading > away from my needs. I just need to know what the > direction is .. I may even be able to help get you there if > you tell me :) > > There are good and bad in both styles (distribute on top v/s replicate on top). Replicate on top gives you much better flexibility of configuration. Distribute on top is easier for us developers. As a user I would like replicate on top as well. But the problem today is that replicate (and self-heal) does not understand "partial failure" of its subvolumes. If one of the subvolume of replicate is a distribute, then today's replicate only understands complete failure of the distribute set or it assumes everything is completely fine. An example is self-healing of directory entries. If a file is "missing" in one subvolume because a distribute node is temporarily down, replicate has no clue why it is missing (or that it should keep away from attempting to self-heal). Along the same lines, it does not know that once a server is taken off from its distribute subvolume for good that it needs to start recreating missing files. The effort to fix this seems to be big enough to disturb the inertia of status quo. If this is fixed, we can definitely adopt a replicate-on-top mode in glusterd. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From abperiasamy at gmail.com Wed May 9 01:05:37 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Tue, 8 May 2012 18:05:37 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: <201205082308.q48N8WQg008425@singularity.tronunltd.com> References: <201205082308.q48N8WQg008425@singularity.tronunltd.com> Message-ID: On Tue, May 8, 2012 at 4:08 PM, Ian Latter wrote: >> On 05/08/2012 12:27 AM, Ian Latter wrote: >> > The equivalent configuration in a glusterd world (from >> > my experiments) pushed all of the distribute knowledge >> > out to the client and I haven't had a response as to how >> > to add a replicate on distributed volumes in this model, >> > so I've lost replicate. >> >> This doesn't seem to be a problem with replicate-first vs. > distribute-first, >> but with client-side vs. server-side deployment of those > translators. ?You >> *can* construct your own volfiles that do these things on > the servers. ?It will >> work, but you won't get a lot of support for it. ?The > issue here is that we >> have only a finite number of developers, and a > near-infinite number of >> configurations. ?We can't properly qualify everything. > One way we've tried to >> limit that space is by preferring distribute over > replicate, because replicate >> does a better job of shielding distribute from brick > failures than vice versa. >> Another is to deploy both on the clients, following the > scalability rule of >> pushing effort to the most numerous components. ?The code > can support other >> arrangements, but the people might not. > > Sure, I have my own vol files that do (did) what I wanted > and I was supporting myself (and users); the question > (and the point) is what is the GlusterFS *intent*? ?I'll > write an rsyncd wrapper myself, to run on top of Gluster, > if the intent is not allow the configuration I'm after > (arbitrary number of disks in one multi-host environment > replicated to an arbitrary number of disks in another > multi-host environment, where ideally each environment > need not sum to the same data capacity, presented in a > single contiguous consumable storage layer to an > arbitrary number of unintelligent clients, that is as fault > tolerant as I choose it to be including the ability to add > and offline/online and remove storage as I so choose) .. > or switch out the whole solution if Gluster is heading > away from my ?needs. ?I just need to know what the > direction is .. I may even be able to help get you there if > you tell me :) Rsync'ing the vol spec files is the simplest and elegant approach. It is how glusterfs originally handled config files. How ever elastic volume management (online volume management operations) requires synchronized online changes to volume spec files. This requires GlusterFS to manage volume specification files internally. That is why we brought glusterd in 3.1. Real question is: do we want to keep the volume spec files on all nodes (fully distributed) or few selected nodes. > >> BTW, a similar concern exists with respect to replication > (i.e. AFR) across >> data centers. ?Performance is going to be bad, and there's > not going to be much >> we can do about it. > > Hmm .. that depends .. these sorts of statements need > context/qualification (in bandwidth and latency terms). ?For > example the last multi-site environment that I did > architecture for was two DCs set 32kms apart with a > redundant 20Gbps layer-2 (ethernet) stretch between > them - latency was 1ms average, 2ms max (the fiber > actually took a 70km path). ?Didn't run Gluster on it, but > we did stretch a number things that "couldn't" be stretched. > > >> > But in this world, the client must >> > know about everything and the server is simply a set >> > of served/presented disks (as volumes). ?In this >> > glusterd world, then, why does any server need to >> > know of any other server, if the clients are doing all of >> > the heavy lifting? >> >> First, because config changes have to apply across > servers. ?Second, because >> server machines often spin up client processes for things > like repair or >> rebalance. > > Yep, but my reading is that the config's that the servers > need are local - to make a disk a share (volume), and > that as you've described the rest are "client processes" > (even when on something built as a "server"), so if you > catered for all clients then you'd be set? ?I.e. AFR now > runs in the client? > > > And I am sick of the word-wrap on this client .. I think > you've finally convinced me to fix it ... what's normal > these days - still 80 chars? I used to line-wrap (gnus and cool emacs extensions). It doesn't make sense to line wrap any more. Let the email client handle it depending on the screen size of the device (mobile / tablet / desktop). -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From anand.avati at gmail.com Wed May 9 01:33:50 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 8 May 2012 18:33:50 -0700 Subject: [Gluster-devel] ZkFarmer In-Reply-To: References: <4FA7DFA3.1030300@redhat.com> Message-ID: On Mon, May 7, 2012 at 9:33 PM, Anand Babu Periasamy wrote: > On Mon, May 7, 2012 at 7:43 AM, Jeff Darcy wrote: > > I've long felt that our ways of dealing with cluster membership and > staging of > > config changes is not quite as robust and scalable as we might want. > > Accordingly, I spent a bit of time a couple of weeks ago looking into the > > possibility of using ZooKeeper to do some of this stuff. Yeah, it > brings in a > > heavy Java dependency, but when I looked at some lighter-weight > alternatives > > they all seemed to be lacking in more important ways. Basically the > idea was > > to do this: > > > > * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper > servers, or > > point everyone at an existing ZooKeeper cluster. > > > > * Use ZK ephemeral nodes as a way to track cluster membership ("peer > probe" > > merely updates ZK, and "peer status" merely reads from it). > > > > * Store config information in ZK *once* instead of regenerating volfiles > etc. > > on every node (and dealing with the ugly cases where a node was down > when the > > config change happened). > > > > * Set watches on ZK nodes to be notified when config changes happen, and > > respond appropriately. > > > > I eventually ran out of time and moved on to other things, but this or > > something like it (e.g. using Riak Core) still seems like a better > approach > > than what we have. In that context, it looks like ZkFarmer[1] might be > a big > > help. AFAICT someone else was trying to solve almost exactly the same > kind of > > server/config problem that we have, and wrapped their solution into a > library. > > Is this a direction other devs might be interested in pursuing some day, > > if/when time allows? > > > > > > [1] https://github.com/rs/zkfarmer > > Real issue is here is: GlusterFS is a fully distributed system. It is > OK for config files to be in one place (centralized). It is easier to > manage and backup. Avati still claims that making distributed copies > are not a problem (volume operations are fast, versioned and > checksumed). Also the code base for replicating 3 way or all-node is > same. We all need to come to agreement on the demerits of replicating > the volume spec on every node. > My claim is somewhat similar to what you said literally, but slightly different in meaning. What I mean is, while it is true keeping multiple copies of the volfile is more expensive/resource consuming in theory, what is the breaking point in terms of number of servers where it begins to matter? There are trivial (low lying) enhancements which are possible (for e.g, store volfiles of a volume only on participating servers instead of all servers) which could address a class of concerns. There are clear advantages in having volfiles in all the participating nodes at least - it takes away dependency on order of booting of servers in your data centre. If volfiles are available locally you dont have to wait/retry for the "central servers" to come up first. Whether this is volfiles managed by glusterd, or "storage servers" of ZK, it is a big advantage to have the startup of a given server decoupled from the others (of course the coupling comes in at an operational level at the time of volume modifications, but that is much more acceptable). If the storage of volfiles on all servers really seems unnecessary, we should first come up with real hard numbers - number of servers v/s latency of volume operations and then figure out at what point it starts becoming unacceptably slow. Maybe a good solution is to just propagate the volfiles in the background while still retaining version info than introducing a more intrusive change? But we really need the numbers first. > > If we are convinced to keep the config info in one place, ZK is > certainly one a good idea. I personally hate Java dependency. I still > struggle with Java dependencies for browser and clojure. I can digest > that if we are going to adopt Java over Python for future external > modules. Alternatively we can also look at creating a replicated meta > system volume. What ever we adopt, we should keep dependencies and > installation steps to the bare minimum and simple. > > It is true other projects have figured out the problem of membership and configuration management and specialize at doing that. That is very good for the entire computing community as a whole. If there are components we can incorporate and build upon their work, that is very desirable. At the same time we also need to check what other baggage we inherit along with the specialized expertise we take on. One of the biggest strengths of Gluster has been its "lightweight"edness and lack of dependencies - which in turn has driven our adoption significantly which in turn results in higher feedback and bug reports etc. (i.e, it is not an isolated strength in itself). Enforcing a Java dependency down the throat of users who want a simple distributed filesystem (yes, the moment we stop thinking of gluster as a "simple" distributed filesystem - even though it may be an oxymoron technically, but I guess you know what I mean :) it's a slippery slope towards it becoming "yet another" distributed filesystem.) The simplicity is what "makes" gluster to a large extent what it is. This makes the developer's life miserable to a fair degree, but it anyways always is, one way or another ;) I am not against adopting external projects. There are good reasons many times to do so. If there are external projects which are "compatible in personality" with gluster and helps us avoid reinventing the wheel, we must definitely do so. If they are not compatible, I'm sure there are lessons and ideas we can adopt, if not code. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Wed May 9 04:18:46 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 9 May 2012 04:18:46 +0000 Subject: [Gluster-devel] ZkFarmer In-Reply-To: References: <4FA7DFA3.1030300@redhat.com> Message-ID: <20120509041846.GB18684@homeworld.netbsd.org> On Mon, May 07, 2012 at 09:33:50PM -0700, Anand Babu Periasamy wrote: > I personally hate Java dependency. Me too. I know Java programs are supposed to have decent performances, but my experiences had always been terrible. Please do not add a dependency on Java. -- Emmanuel Dreyfus manu at netbsd.org From manu at netbsd.org Wed May 9 04:41:47 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 9 May 2012 04:41:47 +0000 Subject: [Gluster-devel] glusterfs-3.3.0qa40 released In-Reply-To: <20120507081553.5AA00100C5@build.gluster.com> References: <20120507081553.5AA00100C5@build.gluster.com> Message-ID: <20120509044147.GC18684@homeworld.netbsd.org> On Mon, May 07, 2012 at 01:15:50AM -0700, Vijay Bellur wrote: > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa40.tar.gz Hi There is a small issue with python: the machine that runs autoconf only has python 2.5 installed, and as a result, the generated configure script fails to detect an installed python 2.6 or higher. Here is an example at mine, where python 2.7 is installed: checking for a Python interpreter with version >= 2.4... none configure: error: no suitable Python interpreter found That can be fixed by patching configure, but it would be nice if gluster builds could contain the check with latest python. -- Emmanuel Dreyfus manu at netbsd.org From renqiang at 360buy.com Wed May 9 04:46:08 2012 From: renqiang at 360buy.com (=?gb2312?B?yM7Hvw==?=) Date: Wed, 9 May 2012 12:46:08 +0800 Subject: [Gluster-devel] How to repair a 1TB disk in 30 mins Message-ID: <000301cd2d9e$a6b07fc0$f4117f40$@com> Dear All: I have a question. When I have a large cluster, maybe more than 10PB data, if a file have 3 copies and each disk have 1TB capacity, So we need about 30,000 disks. All disks are very cheap and are easily damaged. We must repair a 1TB disk in 30 mins?As far as I know?in gluster architecture?all data in the damaged disk will be repaired to the new disk which is used to replace the damaged disk. As a result of the writing speed of disk, when we repair 1TB disk in gluster, we need more than 5 hours. Can we do it in 30 mins? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Wed May 9 05:35:40 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Wed, 09 May 2012 15:35:40 +1000 Subject: [Gluster-devel] automake Message-ID: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Hello, I have built a new module and I can't seem to get the changed makefiles to be built. I have not used "configure" in any of my projects and I'm not seeing an answer from my google searches. The error that I get is during the "make" where glusterfs-3.2.6/missing errors at line 52 "automake-1.9: command not found". This is a newer RedHat environment and it has automake 1.11 .. if I cp 1.11 to 1.9 I get other errors ... libtool is reporting that the automake version is 1.11.1. I believe that it is getting the 1.9 version from Gluster ... How do I get a new Makefile.am and Makefile.in to work in this structure? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From harsha at gluster.com Wed May 9 06:03:00 2012 From: harsha at gluster.com (Harshavardhana) Date: Tue, 8 May 2012 23:03:00 -0700 Subject: [Gluster-devel] automake In-Reply-To: <201205090535.q495Ze5E009996@singularity.tronunltd.com> References: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Message-ID: Ian, Please re-run the ./autogen.sh and use again. Make sure you have added entries in 'configure.ac' and 'Makefile.am' for the respective module name and directory. -Harsha On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > Hello, > > > ?I have built a new module and I can't seem to > get the changed makefiles to be built. ?I have not > used "configure" in any of my projects and I'm > not seeing an answer from my google searches. > > ?The error that I get is during the "make" where > glusterfs-3.2.6/missing errors at line 52 > "automake-1.9: command not found". > > ?This is a newer RedHat environment and it has > automake 1.11 .. if I cp 1.11 to 1.9 I get other > errors ... libtool is reporting that the automake > version is 1.11.1. ?I believe that it is getting the > 1.9 version from Gluster ... > > ?How do I get a new Makefile.am and Makefile.in > to work in this structure? > > > > Cheers, > > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Wed May 9 06:05:54 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Wed, 09 May 2012 16:05:54 +1000 Subject: [Gluster-devel] automake Message-ID: <201205090605.q4965sPn010223@singularity.tronunltd.com> You're a champion. Thanks Harsha. ----- Original Message ----- >From: "Harshavardhana" >To: "Ian Latter" >Subject: Re: [Gluster-devel] automake >Date: Tue, 08 May 2012 23:03:00 -0700 > > Ian, > > Please re-run the ./autogen.sh and use again. > > Make sure you have added entries in 'configure.ac' and 'Makefile.am' > for the respective module name and directory. > > -Harsha > > On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > > Hello, > > > > > > ?I have built a new module and I can't seem to > > get the changed makefiles to be built. ?I have not > > used "configure" in any of my projects and I'm > > not seeing an answer from my google searches. > > > > ?The error that I get is during the "make" where > > glusterfs-3.2.6/missing errors at line 52 > > "automake-1.9: command not found". > > > > ?This is a newer RedHat environment and it has > > automake 1.11 .. if I cp 1.11 to 1.9 I get other > > errors ... libtool is reporting that the automake > > version is 1.11.1. ?I believe that it is getting the > > 1.9 version from Gluster ... > > > > ?How do I get a new Makefile.am and Makefile.in > > to work in this structure? > > > > > > > > Cheers, > > > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Wed May 9 06:08:41 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 8 May 2012 23:08:41 -0700 Subject: [Gluster-devel] automake In-Reply-To: <201205090535.q495Ze5E009996@singularity.tronunltd.com> References: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Message-ID: You might want to read autobook for the general theory behind autotools. Here's a quick summary - aclocal prepares the running of autotools. autoheader prepares autotools to generate a config.h to be consumed by C code configure.ac is the "source" to discover the build system and accept user parameters autoconf converts configure.ac to configure Makefile.am is the "source" to define what is to be built and how. automake converts Makefile.am to Makefile.in till here everything is scripted in ./autogen.sh running configure creates Makefile out of Makefile.in now run make :) Avati On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > Hello, > > > I have built a new module and I can't seem to > get the changed makefiles to be built. I have not > used "configure" in any of my projects and I'm > not seeing an answer from my google searches. > > The error that I get is during the "make" where > glusterfs-3.2.6/missing errors at line 52 > "automake-1.9: command not found". > > This is a newer RedHat environment and it has > automake 1.11 .. if I cp 1.11 to 1.9 I get other > errors ... libtool is reporting that the automake > version is 1.11.1. I believe that it is getting the > 1.9 version from Gluster ... > > How do I get a new Makefile.am and Makefile.in > to work in this structure? > > > > Cheers, > > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abperiasamy at gmail.com Wed May 9 07:21:35 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Wed, 9 May 2012 00:21:35 -0700 Subject: [Gluster-devel] automake In-Reply-To: References: <201205090535.q495Ze5E009996@singularity.tronunltd.com> Message-ID: On Tue, May 8, 2012 at 11:08 PM, Anand Avati wrote: > You might want to read autobook for the general theory behind autotools. > Here's a quick summary - > > aclocal prepares the running of autotools. > autoheader prepares autotools to generate a config.h to be consumed by C > code > configure.ac is the "source" to discover the build system and accept user > parameters > autoconf converts configure.ac to configure > Makefile.am is the "source" to define what is to be built and how. > automake converts Makefile.am to Makefile.in > > till here everything is scripted in ./autogen.sh > > running configure creates Makefile out of Makefile.in > > now run make :) > Best way to learn autotools is copy-paste-customize. In general, if you are starting a new project, Debian has a nice little tool called "autoproject". It will auto generate autoconf and automake files. Then you start customizing it. GNU project should really merge all these tools in to one simple coherent system. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From abperiasamy at gmail.com Wed May 9 07:54:43 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Wed, 9 May 2012 00:54:43 -0700 Subject: [Gluster-devel] How to repair a 1TB disk in 30 mins In-Reply-To: <000301cd2d9e$a6b07fc0$f4117f40$@com> References: <000301cd2d9e$a6b07fc0$f4117f40$@com> Message-ID: On Tue, May 8, 2012 at 9:46 PM, ?? wrote: > Dear All: > > ? I have a question. When I have a large cluster, maybe more than 10PB data, > if a file have 3 copies and each disk have 1TB capacity, So we need about > 30,000 disks. All disks are very cheap and are easily damaged. We must > ?repair a 1TB disk in 30 mins?As far as I know?in gluster architecture?all > data in the damaged disk will be repaired to the new disk which is used to > replace the damaged disk. As a result of the writing speed of disk, when we > repair 1TB disk in gluster, we need more than 5 hours. Can we do it in 30 > mins? 5 hours is based on SATA 1TB disk copying at ~50MB/s across small and large files + folders. This means, you literally attached the disk to the system and manually transferring the data. I can't think of any other faster way to transfer data on 1TB 7200RPM SATA/SAS disks without bending space-time ;). Larger disks and RAID arrays only makes this worse. This is exactly why we implemented passive self-heal in the first place. GlusterFS heals files on demand (as they are accessed), so applications have least down time or disruption. There is plenty of time to heal the cold data in background. All we should care is minimal down time. Self-heal in 3.3 has some major improvements. It got significantly faster, because healing is performed on the server side entirely (server to server). It can perform granular healing on large files (previously checksum operations used to pause or timeout the VMs). Active-healing (Replicate now remembers pending files and heals them when the failed node comes back. Previously you have to perform name-space wide recursive directory listing). Most importantly self-healing is no longer a blackbox. heal-info can show pending and currently-healing files. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From renqiang at 360buy.com Wed May 9 09:29:34 2012 From: renqiang at 360buy.com (=?utf-8?B?5Lu75by6?=) Date: Wed, 9 May 2012 17:29:34 +0800 Subject: [Gluster-devel] =?utf-8?b?562U5aSNOiAgSG93IHRvIHJlcGFpciBhIDFU?= =?utf-8?q?B_disk_in_30_mins?= In-Reply-To: References: <000301cd2d9e$a6b07fc0$f4117f40$@com> Message-ID: <002601cd2dc6$3f68f4f0$be3aded0$@com> Thank you very much? And I have some questions? 1?What's the capacity of the largest cluster online ?And how many nodes in it? And What is it used for? 2?When we excute 'ls' in a directory,it's very slow,if the cluster has too many bricks and too many nodes.Can we do it well? -----????----- ???: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] ????: 2012?5?9? 15:55 ???: renqiang ??: gluster-devel at nongnu.org ??: Re: [Gluster-devel] How to repair a 1TB disk in 30 mins On Tue, May 8, 2012 at 9:46 PM, ?? wrote: > Dear All: > > I have a question. When I have a large cluster, maybe more than 10PB data, > if a file have 3 copies and each disk have 1TB capacity, So we need about > 30,000 disks. All disks are very cheap and are easily damaged. We must > repair a 1TB disk in 30 mins?As far as I know?in gluster architecture?all > data in the damaged disk will be repaired to the new disk which is used to > replace the damaged disk. As a result of the writing speed of disk, when we > repair 1TB disk in gluster, we need more than 5 hours. Can we do it in 30 > mins? 5 hours is based on SATA 1TB disk copying at ~50MB/s across small and large files + folders. This means, you literally attached the disk to the system and manually transferring the data. I can't think of any other faster way to transfer data on 1TB 7200RPM SATA/SAS disks without bending space-time ;). Larger disks and RAID arrays only makes this worse. This is exactly why we implemented passive self-heal in the first place. GlusterFS heals files on demand (as they are accessed), so applications have least down time or disruption. There is plenty of time to heal the cold data in background. All we should care is minimal down time. Self-heal in 3.3 has some major improvements. It got significantly faster, because healing is performed on the server side entirely (server to server). It can perform granular healing on large files (previously checksum operations used to pause or timeout the VMs). Active-healing (Replicate now remembers pending files and heals them when the failed node comes back. Previously you have to perform name-space wide recursive directory listing). Most importantly self-healing is no longer a blackbox. heal-info can show pending and currently-healing files. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From ian.latter at midnightcode.org Thu May 10 05:47:06 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Thu, 10 May 2012 15:47:06 +1000 Subject: [Gluster-devel] Hide Feature Message-ID: <201205100547.q4A5l6eH015066@singularity.tronunltd.com> Hello, I have published an untested "hide" module (compiled against glusterfs-3.2.6); A simple method for hiding an underlying directory structure from parent/up-stream bricks within GlusterFS. In 2012 this code was spawned from my incomplete 2009 dedupe brick code which used this method to protect its internal hash database from the user, above. http://midnightcode.org/projects/saturn/code/hide-0.5.tgz I am serious when I mean untested - I've not even loaded the module under Gluster, it simply compiles. Let me know if there are tweaks that should be made or considered. Enjoy. -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 05:55:55 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Thu, 10 May 2012 15:55:55 +1000 Subject: [Gluster-devel] Fuse operations Message-ID: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> Hello, I published the Hide module in order to open a discussion around Fuse operations; http://fuse.sourceforge.net/doxygen/structfuse__operations.html In the dedupe module I want to secure the hash database from direct parent/use manipulation. The approach that I took was to find every GlusterFS file operation (fop) that took a loc_t parameter (as discovered via every xlator that is included in the tarball), in order to do path matching and then pass-through the call or return an error. The problem is that I can't find GlusterFS examples for all of the Fuse operators and, when I stray from the examples (like getattr and utiments), gluster tells me that there are no such xlator fops (at compile time - from the wind and unwind macros). So, I guess; 1) Are all Fuse/FS ops handled by Gluster? 2) Where can I find a complete list of the Gluster fops, and not just those that have been used in existing modules? 3) Is it safe to path match on loc_t? (i.e. is it fully resolved such that I won't find /etc/././././passwd)? This I could test .. Thanks, -- Ian Latter Late night coder .. http://midnightcode.org/ From jdarcy at redhat.com Thu May 10 13:39:21 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 10 May 2012 09:39:21 -0400 Subject: [Gluster-devel] Hide Feature In-Reply-To: <201205100547.q4A5l6eH015066@singularity.tronunltd.com> References: <201205100547.q4A5l6eH015066@singularity.tronunltd.com> Message-ID: <20120510093921.4a9f581a@jdarcy-dt.usersys.redhat.com> On Thu, 10 May 2012 15:47:06 +1000 "Ian Latter" wrote: > I have published an untested "hide" module (compiled > against glusterfs-3.2.6); > > A simple method for hiding an underlying directory > structure from parent/up-stream bricks within > GlusterFS. In 2012 this code was spawned from > my incomplete 2009 dedupe brick code which used > this method to protect its internal hash database > from the user, above. > > http://midnightcode.org/projects/saturn/code/hide-0.5.tgz > > > I am serious when I mean untested - I've not even > loaded the module under Gluster, it simply compiles. > > > Let me know if there are tweaks that should be made > or considered. A couple of comments: * It should be sufficient to fail lookup for paths that match your pattern. If that fails, the caller will never get to any others. You can use the quota translator as an example for something like this. * If you want to continue supporting this yourself, then you can just leave the code as it is, though in that case you'll want to consider building it "out of tree" as I describe in my "Translator 101" post[1] or do for some of my own translators[2]. Otherwise you'll need to submit it as a patch through Gerrit according to our standard workflow[3]. You'll also need to fix some of the idiosyncratic indentation. I don't remember the current policy wrt copyright assignment, but that might be required too. [1] http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/ [2] https://github.com/jdarcy/negative-lookup [3] http://www.gluster.org/community/documentation/index.php/Development_Work_Flow From jdarcy at redhat.com Thu May 10 13:58:51 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 10 May 2012 09:58:51 -0400 Subject: [Gluster-devel] Fuse operations In-Reply-To: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> References: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> Message-ID: <20120510095851.2f034889@jdarcy-dt.usersys.redhat.com> On Thu, 10 May 2012 15:55:55 +1000 "Ian Latter" wrote: > So, I guess; > 1) Are all Fuse/FS ops handled by Gluster? > 2) Where can I find a complete list of the > Gluster fops, and not just those that have > been used in existing modules? GlusterFS operations for a translator are all defined in an xlator_fops structure. When building translators, it can also be convenient to look at the default_xxx and default_xxx_cbk functions for each fop you implement. Also, I forgot to mention in my comments on your "hide" translator that you can often use the default_xxx_cbk callback when you call STACK_WIND, instead of having to define your own trivial one. FUSE operations are listed by the fuse_opcode enum. You can check for yourself how closely this matches our list. They do have a few ops of their own, we have a few of their own, and a few of theirs actually map to our xlator_cbks instead of xlator_fops. The points of non-correspondence seem to be interrupt, bmap, poll and ioctl. Maybe Csaba can elaborate on what we do (or plan to do) about these. > 3) Is it safe to path match on loc_t? (i.e. is > it fully resolved such that I won't find > /etc/././././passwd)? This I could test .. Name/path resolution is an area that has changed pretty recently, so I'll let Avati or Amar field that one. From anand.avati at gmail.com Thu May 10 19:36:26 2012 From: anand.avati at gmail.com (Anand Avati) Date: Thu, 10 May 2012 12:36:26 -0700 Subject: [Gluster-devel] Fuse operations In-Reply-To: <20120510095851.2f034889@jdarcy-dt.usersys.redhat.com> References: <201205100555.q4A5tt7u015109@singularity.tronunltd.com> <20120510095851.2f034889@jdarcy-dt.usersys.redhat.com> Message-ID: On Thu, May 10, 2012 at 6:58 AM, Jeff Darcy wrote: > On Thu, 10 May 2012 15:55:55 +1000 > "Ian Latter" wrote: > > > So, I guess; > > 1) Are all Fuse/FS ops handled by Gluster? > > 2) Where can I find a complete list of the > > Gluster fops, and not just those that have > > been used in existing modules? > > GlusterFS operations for a translator are all defined in an xlator_fops > structure. When building translators, it can also be convenient to > look at the default_xxx and default_xxx_cbk functions for each fop you > implement. Also, I forgot to mention in my comments on your "hide" > translator that you can often use the default_xxx_cbk callback when you > call STACK_WIND, instead of having to define your own trivial one. > > FUSE operations are listed by the fuse_opcode enum. You can check for > yourself how closely this matches our list. They do have a few ops of > their own, we have a few of their own, and a few of theirs actually map > to our xlator_cbks instead of xlator_fops. The points of > non-correspondence seem to be interrupt, bmap, poll and ioctl. Maybe > Csaba can elaborate on what we do (or plan to do) about these. > > We might support interrupt sometime. Bmap - probably never. Poll, maybe. Ioctl - depeneds on what type of ioctl and requirement. > > 3) Is it safe to path match on loc_t? (i.e. is > > it fully resolved such that I won't find > > /etc/././././passwd)? This I could test .. > > Name/path resolution is an area that has changed pretty recently, so > I'll let Avati or Amar field that one. > The ".." interpretation is done by the client side VFS. Internal path construction does not use ".." and are always normalized. There are new situations where we now support non-absolute paths, but those are for GFID based addressing and ".." does not come into picture there. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnmark at redhat.com Thu May 10 21:41:08 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 10 May 2012 17:41:08 -0400 (EDT) Subject: [Gluster-devel] Bugzilla upgrade & planned outage - May 22 In-Reply-To: Message-ID: Pasting an email from bugzilla-announce: Red Hat Bugzilla (bugzilla.redhat.com) will be unavailable on May 22nd starting at 6 p.m. EDT [2200 UTC] to perform an upgrade from Bugzilla 3.6 to Bugzilla 4.2. We are hoping to be complete in no more than 3 hours barring any problems. Any services relying on bugzilla.redhat.com may not work properly during this time. Please be aware in case you need use of those services during the outage. Also *PLEASE* make sure any scripts or other external applications that rely on bugzilla.redhat.com are tested against our test server before the upgrade if you have not done so already. Let the Bugzilla Team know immediately of any issues found by reporting the bug in bugzilla.redhat.com against the Bugzilla product, version 4.2. A summary of the RPC changes is also included below. RPC changes from upstream Bugzilla 4.2: - Bug.* returns arrays for components, versions and aliases - Bug.* returns target_release array - Bug.* returns flag information (from Bugzilla 4.4) - Bug.search supports searching on keywords, dependancies, blocks - Bug.search supports quick searches, saved searches and advanced searches - Group.get has been added - Component.* and Flag.* have been added - Product.get has a component_names option to return just the component names. RPC changes from Red Hat Bugzilla 3.6: - This list may be incomplete. - This list excludes upstream changes from 3.6 that we inherited - Bug.update calls may use different column names. For example, in 3.6 you updated the 'short_desc' key if you wanted to change the summary. Now you must use the 'summary' key. This may be an inconeniance, but will make it much more maintainable in the long run. - Bug.search_new new becomes Bug.search. The 3.6 version of Bug.search is no longer available. - Product.* has been changed to match upstream code - Group.create has been added - RedHat.* and bugzilla.* calls that mirror official RPC calls are officially depreciated, and will be removed approximately two months after Red Hat Bugzilla 4.2 is released. To test against the new beta Bugzilla server, go to https://partner-bugzilla.redhat.com/ Thanks, JM From ian.latter at midnightcode.org Thu May 10 22:25:02 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 08:25:02 +1000 Subject: [Gluster-devel] automake Message-ID: <201205102225.q4AMP2X2018428@singularity.tronunltd.com> Thanks Avati, Yes, when I said that I hadn't use "configure" I meant "autotools" (though I didn't know it :) I think almost every project I download and build from scratch uses configure .. the last time I looked at the autotools was a few years ago now, maybe its time for a re-look .. my libraries are getting big enough to warrant it I suppose. Hadn't seen autogen before .. thanks for your help. Cheers, ----- Original Message ----- >From: "Anand Avati" >To: "Ian Latter" >Subject: Re: [Gluster-devel] automake >Date: Tue, 08 May 2012 23:08:41 -0700 > > You might want to read autobook for the general theory behind autotools. > Here's a quick summary - > > aclocal prepares the running of autotools. > autoheader prepares autotools to generate a config.h to be consumed by C > code > configure.ac is the "source" to discover the build system and accept user > parameters > autoconf converts configure.ac to configure > Makefile.am is the "source" to define what is to be built and how. > automake converts Makefile.am to Makefile.in > > till here everything is scripted in ./autogen.sh > > running configure creates Makefile out of Makefile.in > > now run make :) > > Avati > > On Tue, May 8, 2012 at 10:35 PM, Ian Latter wrote: > > > Hello, > > > > > > I have built a new module and I can't seem to > > get the changed makefiles to be built. I have not > > used "configure" in any of my projects and I'm > > not seeing an answer from my google searches. > > > > The error that I get is during the "make" where > > glusterfs-3.2.6/missing errors at line 52 > > "automake-1.9: command not found". > > > > This is a newer RedHat environment and it has > > automake 1.11 .. if I cp 1.11 to 1.9 I get other > > errors ... libtool is reporting that the automake > > version is 1.11.1. I believe that it is getting the > > 1.9 version from Gluster ... > > > > How do I get a new Makefile.am and Makefile.in > > to work in this structure? > > > > > > > > Cheers, > > > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 22:26:22 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 08:26:22 +1000 Subject: [Gluster-devel] automake Message-ID: <201205102226.q4AMQMEC018461@singularity.tronunltd.com> > > You might want to read autobook for the general theory behind autotools. > > Here's a quick summary - > > > > aclocal prepares the running of autotools. > > autoheader prepares autotools to generate a config.h to be consumed by C > > code > > configure.ac is the "source" to discover the build system and accept user > > parameters > > autoconf converts configure.ac to configure > > Makefile.am is the "source" to define what is to be built and how. > > automake converts Makefile.am to Makefile.in > > > > till here everything is scripted in ./autogen.sh > > > > running configure creates Makefile out of Makefile.in > > > > now run make :) > > > > Best way to learn autotools is copy-paste-customize. In general, if > you are starting a new project, Debian has a nice little tool called > "autoproject". It will auto generate autoconf and automake files. Then > you start customizing it. > > GNU project should really merge all these tools in to one simple > coherent system. My build environment is Fedora but I'm assuming its there too .. if I get some time I'll have a poke around .. Thanks for the info, appreciate it. -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 22:44:32 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 08:44:32 +1000 Subject: [Gluster-devel] Hide Feature Message-ID: <201205102244.q4AMiW2Z018543@singularity.tronunltd.com> Sorry for the re-send Jeff, I managed to screw up the CC so the list didn't get it; > > Let me know if there are tweaks that should be made > > or considered. > > A couple of comments: > > * It should be sufficient to fail lookup for paths that > match your pattern. If that fails, the caller will > never get to any others. You can use the quota > translator as an example for something like this. Ok, this is interesting. So if someone calls another fop .. say "open" ... against my brick/module, something (Fuse?) will make another, dependent, call to lookup first? If that's true then I can cut this all down to size. > * If you want to continue supporting this yourself, > then you can just leave the code as it is, though in > that case you'll want to consider building it "out of > tree" as I describe in my "Translator 101" post[1] > or do for some of my own translators[2]. > Otherwise you'll need to submit it as a patch > through Gerrit according to our standard workflow[3]. Thanks for the Translator articles/posts, I hadn't seen those. Per my previous patches, I'll publish code on my site under the GPL and you guys (Gluster/RedHat) can run them through whatever processes you choose. If it gets included in the GlusterFS package, then that's fine. If it gets ignored by the GlusterFS package, then that's fine also. > You'll also need to fix some of the idiosyncratic > indentation. I don't remember the current policy wrt > copyright assignment, but that might be required too. The weird indentation style used is not mine .. its what I gathered from the Gluster code that I read through. > [1] > http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/ > > [2] https://github.com/jdarcy/negative-lookup > > [3] > http://www.gluster.org/community/documentation/index.php/Development_Work_Flow > -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 23:39:58 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 09:39:58 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205102339.q4ANdwg8018739@singularity.tronunltd.com> > > Sure, I have my own vol files that do (did) what I wanted > > and I was supporting myself (and users); the question > > (and the point) is what is the GlusterFS *intent*? > > > The "intent" (more or less - I hate to use the word as it can imply a > commitment to what I am about to say, but there isn't one) is to keep the > bricks (server process) dumb and have the intelligence on the client side. > This is a "rough goal". There are cases where replication on the server > side is inevitable (in the case of NFS access) but we keep the software > architecture undisturbed by running a client process on the server machine > to achieve it. [There's a difference between intent and plan/roadmap] Okay. Unfortunately I am unable to leverage this - I tried to serve a Fuse->GlusterFS client mount point (of a Distribute volume) as a GlusterFS posix brick (for a Replicate volume) and it wouldn't play ball .. > We do plan to support "replication on the server" in the future while still > retaining the existing software architecture as much as possible. This is > particularly useful in Hadoop environment where the jobs expect write > performance of a single copy and expect copy to happen in the background. > We have the proactive self-heal daemon running on the server machines now > (which again is a client process which happens to be physically placed on > the server) which gives us many interesting possibilities - i.e, with > simple changes where we fool the client side replicate translator at the > time of transaction initiation that only the closest server is up at that > point of time and write to it alone, and have the proactive self-heal > daemon perform the extra copies in the background. This would be consistent > with other readers as they get directed to the "right" version of the file > by inspecting the changelogs while the background replication is in > progress. > > The intention of the above example is to give a general sense of how we > want to evolve the architecture (i.e, the "intention" you were referring > to) - keep the clients intelligent and servers dumb. If some intelligence > needs to be built on the physical server, tackle it by loading a client > process there (there are also "pathinfo xattr" kind of internal techniques > to figure out locality of the clients in a generic way without bringing > "server sidedness" into them in a harsh way) Okay .. But what happened to the "brick" architecture of stacking anything on anything? I think you point that out here ... > I'll > > write an rsyncd wrapper myself, to run on top of Gluster, > > if the intent is not allow the configuration I'm after > > (arbitrary number of disks in one multi-host environment > > replicated to an arbitrary number of disks in another > > multi-host environment, where ideally each environment > > need not sum to the same data capacity, presented in a > > single contiguous consumable storage layer to an > > arbitrary number of unintelligent clients, that is as fault > > tolerant as I choose it to be including the ability to add > > and offline/online and remove storage as I so choose) .. > > or switch out the whole solution if Gluster is heading > > away from my needs. I just need to know what the > > direction is .. I may even be able to help get you there if > > you tell me :) > > > > > There are good and bad in both styles (distribute on top v/s replicate on > top). Replicate on top gives you much better flexibility of configuration. > Distribute on top is easier for us developers. As a user I would like > replicate on top as well. But the problem today is that replicate (and > self-heal) does not understand "partial failure" of its subvolumes. If one > of the subvolume of replicate is a distribute, then today's replicate only > understands complete failure of the distribute set or it assumes everything > is completely fine. An example is self-healing of directory entries. If a > file is "missing" in one subvolume because a distribute node is temporarily > down, replicate has no clue why it is missing (or that it should keep away > from attempting to self-heal). Along the same lines, it does not know that > once a server is taken off from its distribute subvolume for good that it > needs to start recreating missing files. Hmm. I loved the brick idea. I don't like perverting it by trying to "see through" layers. In that context I can see two or three expected outcomes from someone building this type of stack (heh: a quick trick brick stack) - when a distribute child disappears; At the Distribute layer; 1) The distribute name space / stat space remains in tact, though the content is obviously not avail. 2) The distribute presentation is pure and true of its constituents, showing only the names / stats that are online/avail. In its standalone case, 2 is probably preferable as it allows clean add/start/stop/ remove capacity. At the Replicate layer; 3) replication occurs only where the name / stat space shows a gap 4) the replication occurs at any delta I don't think there's a real choice here, even if 3 were sensible, what would replicate do if there was a local name and even just a remote file size change, when there's no local content to update; it must be 4. In which case, I would expect that a replicate on top of a distribute with a missing child would suddenly see a delta that it would immediately set about repairing. > The effort to fix this seems to be big enough to disturb the inertia of > status quo. If this is fixed, we can definitely adopt a replicate-on-top > mode in glusterd. I'm not sure why there needs to be a "fix" .. wasn't the previous behaviour sensible? Or, if there is something to "change", then bolstering the distribute module might be enough - a combination of 1 and 2 above. Try this out: what if the Distribute layer maintained a full name space on each child, and didn't allow "recreation"? Say 3 children, one is broken/offline, so that /path/to/child/3/file is missing but is known to be missing (internally to Distribute). Then the Distribute brick can both not show the name space to the parent layers, but can also actively prevent manipulation of those files (the parent can neither stat /path/to/child/3/file nor unlink, nor create/write to it). If this change is meant to be permanent, then the administrative act of removing the child from distribute will then truncate the locked name space, allowing parents (be they users or other bricks, like Replicate) to act as they please (such as recreating the missing files). If you adhere to the principles that I thought I understood from 2009 or so then you should be able to let the users create unforeseen Gluster architectures without fear or impact. I.e. i) each brick is fully self contained * ii) physical bricks are the bread of a brick stack sandwich ** iii) any logical brick can appear above/below any other logical brick in a brick stack * Not mandating a 1:1 file mapping from layer to layer ** Eg: the Posix (bottom), Client (bottom), Server (top) and NFS (top) are all regarded as physical bricks. Thus it was my expectation that a dedupe brick (being logical) could either go above or below a distribute brick (also logical), for example. Or that an encryption brick could go on top of replicate which was on top of encryption which was on top of distribute which was on top of encryption on top of posix, for example. Or .. am I over simplifying the problem space? -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Thu May 10 23:52:43 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Fri, 11 May 2012 09:52:43 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205102352.q4ANqhc6018790@singularity.tronunltd.com> Actually, I want to clarify this point; > But the problem today is that replicate (and > self-heal) does not understand "partial failure" > of its subvolumes. If one of the subvolume of > replicate is a distribute, then today's replicate > only understands complete failure of the > distribute set or it assumes everything is > completely fine. I haven't seen this in practice .. I have seen replicate attempt to repair anything that was "missing" and that both the replicate and the underlying bricks were still viable storage layers in that process ... ----- Original Message ----- >From: "Ian Latter" >To: "Anand Avati" >Subject: Re: [Gluster-devel] ZkFarmer >Date: Fri, 11 May 2012 09:39:58 +1000 > > > > Sure, I have my own vol files that do (did) what I wanted > > > and I was supporting myself (and users); the question > > > (and the point) is what is the GlusterFS *intent*? > > > > > > The "intent" (more or less - I hate to use the word as it > can imply a > > commitment to what I am about to say, but there isn't one) > is to keep the > > bricks (server process) dumb and have the intelligence on > the client side. > > This is a "rough goal". There are cases where replication > on the server > > side is inevitable (in the case of NFS access) but we keep > the software > > architecture undisturbed by running a client process on > the server machine > > to achieve it. > > [There's a difference between intent and plan/roadmap] > > Okay. Unfortunately I am unable to leverage this - I tried > to serve a Fuse->GlusterFS client mount point (of a > Distribute volume) as a GlusterFS posix brick (for a > Replicate volume) and it wouldn't play ball .. > > > We do plan to support "replication on the server" in the > future while still > > retaining the existing software architecture as much as > possible. This is > > particularly useful in Hadoop environment where the jobs > expect write > > performance of a single copy and expect copy to happen in > the background. > > We have the proactive self-heal daemon running on the > server machines now > > (which again is a client process which happens to be > physically placed on > > the server) which gives us many interesting possibilities > - i.e, with > > simple changes where we fool the client side replicate > translator at the > > time of transaction initiation that only the closest > server is up at that > > point of time and write to it alone, and have the > proactive self-heal > > daemon perform the extra copies in the background. This > would be consistent > > with other readers as they get directed to the "right" > version of the file > > by inspecting the changelogs while the background > replication is in > > progress. > > > > The intention of the above example is to give a general > sense of how we > > want to evolve the architecture (i.e, the "intention" you > were referring > > to) - keep the clients intelligent and servers dumb. If > some intelligence > > needs to be built on the physical server, tackle it by > loading a client > > process there (there are also "pathinfo xattr" kind of > internal techniques > > to figure out locality of the clients in a generic way > without bringing > > "server sidedness" into them in a harsh way) > > Okay .. But what happened to the "brick" architecture > of stacking anything on anything? I think you point > that out here ... > > > > I'll > > > write an rsyncd wrapper myself, to run on top of Gluster, > > > if the intent is not allow the configuration I'm after > > > (arbitrary number of disks in one multi-host environment > > > replicated to an arbitrary number of disks in another > > > multi-host environment, where ideally each environment > > > need not sum to the same data capacity, presented in a > > > single contiguous consumable storage layer to an > > > arbitrary number of unintelligent clients, that is as fault > > > tolerant as I choose it to be including the ability to add > > > and offline/online and remove storage as I so choose) .. > > > or switch out the whole solution if Gluster is heading > > > away from my needs. I just need to know what the > > > direction is .. I may even be able to help get you there if > > > you tell me :) > > > > > > > > There are good and bad in both styles (distribute on top > v/s replicate on > > top). Replicate on top gives you much better flexibility > of configuration. > > Distribute on top is easier for us developers. As a user I > would like > > replicate on top as well. But the problem today is that > replicate (and > > self-heal) does not understand "partial failure" of its > subvolumes. If one > > of the subvolume of replicate is a distribute, then > today's replicate only > > understands complete failure of the distribute set or it > assumes everything > > is completely fine. An example is self-healing of > directory entries. If a > > file is "missing" in one subvolume because a distribute > node is temporarily > > down, replicate has no clue why it is missing (or that it > should keep away > > from attempting to self-heal). Along the same lines, it > does not know that > > once a server is taken off from its distribute subvolume > for good that it > > needs to start recreating missing files. > > Hmm. I loved the brick idea. I don't like perverting it by > trying to "see through" layers. In that context I can see > two or three expected outcomes from someone building > this type of stack (heh: a quick trick brick stack) - when > a distribute child disappears; > > At the Distribute layer; > 1) The distribute name space / stat space > remains in tact, though the content is > obviously not avail. > 2) The distribute presentation is pure and true > of its constituents, showing only the names > / stats that are online/avail. > > In its standalone case, 2 is probably > preferable as it allows clean add/start/stop/ > remove capacity. > > At the Replicate layer; > 3) replication occurs only where the name / > stat space shows a gap > 4) the replication occurs at any delta > > I don't think there's a real choice here, even > if 3 were sensible, what would replicate do if > there was a local name and even just a remote > file size change, when there's no local content > to update; it must be 4. > > In which case, I would expect that a replicate > on top of a distribute with a missing child would > suddenly see a delta that it would immediately > set about repairing. > > > > The effort to fix this seems to be big enough to disturb > the inertia of > > status quo. If this is fixed, we can definitely adopt a > replicate-on-top > > mode in glusterd. > > I'm not sure why there needs to be a "fix" .. wasn't > the previous behaviour sensible? > > Or, if there is something to "change", then > bolstering the distribute module might be enough - > a combination of 1 and 2 above. > > Try this out: what if the Distribute layer maintained > a full name space on each child, and didn't allow > "recreation"? Say 3 children, one is broken/offline, > so that /path/to/child/3/file is missing but is known > to be missing (internally to Distribute). Then the > Distribute brick can both not show the name > space to the parent layers, but can also actively > prevent manipulation of those files (the parent > can neither stat /path/to/child/3/file nor unlink, nor > create/write to it). If this change is meant to be > permanent, then the administrative act of > removing the child from distribute will then > truncate the locked name space, allowing parents > (be they users or other bricks, like Replicate) to > act as they please (such as recreating the > missing files). > > If you adhere to the principles that I thought I > understood from 2009 or so then you should be > able to let the users create unforeseen Gluster > architectures without fear or impact. I.e. > > i) each brick is fully self contained * > ii) physical bricks are the bread of a brick > stack sandwich ** > iii) any logical brick can appear above/below > any other logical brick in a brick stack > > * Not mandating a 1:1 file mapping from layer > to layer > > ** Eg: the Posix (bottom), Client (bottom), > Server (top) and NFS (top) are all > regarded as physical bricks. > > Thus it was my expectation that a dedupe brick > (being logical) could either go above or below > a distribute brick (also logical), for example. > > Or that an encryption brick could go on top > of replicate which was on top of encryption > which was on top of distribute which was on > top of encryption on top of posix, for example. > > > Or .. am I over simplifying the problem space? > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From vbellur at redhat.com Fri May 11 07:06:38 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Fri, 11 May 2012 12:36:38 +0530 Subject: [Gluster-devel] release-3.3 branched out Message-ID: <4FACBA7E.6090801@redhat.com> A new branch release-3.3 has been created. You can checkout the branch via: $git checkout -b release-3.3 origin/release-3.3 rfc.sh has been updated to send patches to the appropriate branch. The plan is to have all 3.3.x releases happen off this branch. If you need any fix to be part of a 3.3.x release, please send out a backport of the same from master to release-3.3 after it has been accepted in master. Thanks, Vijay From manu at netbsd.org Fri May 11 07:29:20 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 11 May 2012 07:29:20 +0000 Subject: [Gluster-devel] release-3.3 branched out In-Reply-To: <4FACBA7E.6090801@redhat.com> References: <4FACBA7E.6090801@redhat.com> Message-ID: <20120511072920.GG18684@homeworld.netbsd.org> On Fri, May 11, 2012 at 12:36:38PM +0530, Vijay Bellur wrote: > A new branch release-3.3 has been created. You can checkout the branch via: Any chance someone merge my build fixes so that I can pullup to the new branch? http://review.gluster.com/3238 -- Emmanuel Dreyfus manu at netbsd.org From vbellur at redhat.com Fri May 11 07:43:13 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Fri, 11 May 2012 13:13:13 +0530 Subject: [Gluster-devel] release-3.3 branched out In-Reply-To: <20120511072920.GG18684@homeworld.netbsd.org> References: <4FACBA7E.6090801@redhat.com> <20120511072920.GG18684@homeworld.netbsd.org> Message-ID: <4FACC311.5020708@redhat.com> On 05/11/2012 12:59 PM, Emmanuel Dreyfus wrote: > On Fri, May 11, 2012 at 12:36:38PM +0530, Vijay Bellur wrote: >> A new branch release-3.3 has been created. You can checkout the branch via: > Any chance someone merge my build fixes so that I can pullup to the > new branch? > http://review.gluster.com/3238 Merged to master. Vijay From vijay at build.gluster.com Fri May 11 10:35:24 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Fri, 11 May 2012 03:35:24 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa41 released Message-ID: <20120511103527.5809B18009D@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa41/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa41.tar.gz This release is made off v3.3.0qa41 From 7220022 at gmail.com Sat May 12 15:22:57 2012 From: 7220022 at gmail.com (7220022) Date: Sat, 12 May 2012 19:22:57 +0400 Subject: [Gluster-devel] Gluster VSA for VMware ESX Message-ID: <012701cd3053$1d2e6110$578b2330$@gmail.com> Would love to test performance of Gluster Virtual Storage Appliance for VMware, but cannot get the demo. Emails and calls to Red Hat went unanswered. We've built a nice test system for the cluster at our lab, 8 modern servers running ESX4.1 and connected via 40gb InfiniBand fabric. Each server has 24 2.5" drives, SLC SSD and 10K SAS HDD-s connected to 6 LSI controllers with CacheCade (Pro 2.0 with write cache enabled,) 4 drives per controller. The plan is to test performance using bricks made of HDD-s cached with SSD-s, as well as HDD-s and SSD-s separately. Can anyone help getting the demo version of VSA? It's fine if it's a beta version, we just wanted to check the performance and scalability. -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Sun May 13 08:27:20 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 13 May 2012 10:27:20 +0200 Subject: [Gluster-devel] buffer corruption in io-stats Message-ID: <1kk12tm.1awqq7kf1joseM%manu@netbsd.org> I get a reproductible SIGSEGV with sources from latest git. iosfd is overwritten by the file path, it seems there is a confusion somewhere between iosfd->filename pointer value and pointed buffer (gdb) bt #1 0xbb92891b in free () from /lib/libc.so.12 #2 0xbbbb37a7 in __gf_free (free_ptr=0x74656e2f) at mem-pool.c:258 #3 0xb9a85378 in io_stats_release (this=0xba3e3000, fd=0xb8f9d098) at io-stats.c:2420 #4 0xbbbafcc0 in fd_destroy (fd=0xb8f9d098) at fd.c:507 #5 0xbbbafdf8 in fd_unref (fd=0xb8f9d098) at fd.c:543 #6 0xbbbaf7cf in gf_fdptr_put (fdtable=0xbb77d070, fd=0xb8f9d098) at fd.c:393 #7 0xbb821147 in fuse_release () from /usr/local/lib/glusterfs/3git/xlator/mount/fuse.so #8 0xbb82a2e1 in fuse_thread_proc () from /usr/local/lib/glusterfs/3git/xlator/mount/fuse.so (gdb) frame 3 #3 0xb9a85378 in io_stats_release (this=0xba3e3000, fd=0xb8f9d098) at io-stats.c:2420 2420 GF_FREE (iosfd->filename); (gdb) print *iosfd $2 = {filename = 0x74656e2f
, data_written = 3418922014271107938, data_read = 7813586423313035891, block_count_write = {4788563690262784356, 3330756270057407571, 7074933154630937908, 28265, 0 }, block_count_read = { 0 }, opened_at = {tv_sec = 1336897011, tv_usec = 145734}} (gdb) x/10s iosfd 0xbb70f800: "/netbsd/usr/src/tooldir.NetBSD-6.99.4-i386/bin" -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 13 14:42:45 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 13 May 2012 16:42:45 +0200 Subject: [Gluster-devel] python version Message-ID: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> Hi There is a problem with python version detection in the configure script. The machine on which autotools is ran prior releasing glusterfs expands AM_PATH_PYTHON into a script that fails to accept python > 2.4. As I understand, a solution is to concatenate latest automake-1.12/m4/python.m4 into glusterfs' aclocal.m4. That way python up to 3.1 shoul be accepted. Opinions? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From renqiang at 360buy.com Mon May 14 01:20:32 2012 From: renqiang at 360buy.com (=?gb2312?B?yM7Hvw==?=) Date: Mon, 14 May 2012 09:20:32 +0800 Subject: [Gluster-devel] balance stoped Message-ID: <018001cd316f$c25a6f90$470f4eb0$@com> Hi,All! May I ask you a question? When we do balance on a volume, it stopped when moving the 505th?s file 0f 1006 files. Now we cannot restart it and also cannot cancel it. How can I do, please? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Mon May 14 01:22:43 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Mon, 14 May 2012 11:22:43 +1000 Subject: [Gluster-devel] lseek Message-ID: <201205140122.q4E1MhU8000317@singularity.tronunltd.com> Hello, I'm looking for a seek (lseek) implementation in one of the modules and I can't see one. Do I need to care about seeking if my module changes the file size (i.e. compresses) in Gluster? I would have thought that I did except that I believe that what I'm reading is that Gluster returns a NONSEEKABLE flag on file open (fuse_kernel.h at line 149). Does this mitigate the need to correct the user seeks? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Mon May 14 07:48:17 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 14 May 2012 09:48:17 +0200 Subject: [Gluster-devel] lseek In-Reply-To: <201205140122.q4E1MhU8000317@singularity.tronunltd.com> References: <201205140122.q4E1MhU8000317@singularity.tronunltd.com> Message-ID: <4FB0B8C1.4020908@datalab.es> Hello Ian, there is no such thing as an explicit seek in glusterfs. Each readv, writev, (f)truncate and rchecksum have an offset parameter that tells you the position where the operation must be performed. If you make something that changes the size of the file you must make it in a way that it is transparent to upper translators. This means that all offsets you will receive are "real" (in your case, offsets in the uncompressed version of the file). You should calculate in some way the equivalent offset in the compressed version of the file and send it to the correspoding fop of the lower translators. In the same way, you must return in all iatt structures the real size of the file (not the compressed size). I'm not sure what is the intended use of NONSEEKABLE, but I think it is for special file types, like devices or similar that are sequential in nature. Anyway, this is a fuse flag that you can't return from a regular translator open fop. Xavi On 05/14/2012 03:22 AM, Ian Latter wrote: > Hello, > > > I'm looking for a seek (lseek) implementation in > one of the modules and I can't see one. > > Do I need to care about seeking if my module > changes the file size (i.e. compresses) in Gluster? > I would have thought that I did except that I believe > that what I'm reading is that Gluster returns a > NONSEEKABLE flag on file open (fuse_kernel.h at > line 149). Does this mitigate the need to correct > the user seeks? > > > Cheers, > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Mon May 14 09:51:59 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Mon, 14 May 2012 19:51:59 +1000 Subject: [Gluster-devel] lseek Message-ID: <201205140951.q4E9px5H001754@singularity.tronunltd.com> Hello Xavi, Ok - thanks. I was hoping that this was how read and write were working (i.e. with absolute offsets and not just getting relative offsets from the current seek point), however what of the raw seek command? len = lseek(fd, 0, SEEK_END); Upon successful completion, lseek() returns the resulting offset location as measured in bytes from the beginning of the file. Any idea on where the return value comes from? I will need to fake up a file size for this command .. ----- Original Message ----- >From: "Xavier Hernandez" >To: >Subject: Re: [Gluster-devel] lseek >Date: Mon, 14 May 2012 09:48:17 +0200 > > Hello Ian, > > there is no such thing as an explicit seek in glusterfs. Each readv, > writev, (f)truncate and rchecksum have an offset parameter that tells > you the position where the operation must be performed. > > If you make something that changes the size of the file you must make it > in a way that it is transparent to upper translators. This means that > all offsets you will receive are "real" (in your case, offsets in the > uncompressed version of the file). You should calculate in some way the > equivalent offset in the compressed version of the file and send it to > the correspoding fop of the lower translators. > > In the same way, you must return in all iatt structures the real size of > the file (not the compressed size). > > I'm not sure what is the intended use of NONSEEKABLE, but I think it is > for special file types, like devices or similar that are sequential in > nature. Anyway, this is a fuse flag that you can't return from a regular > translator open fop. > > Xavi > > On 05/14/2012 03:22 AM, Ian Latter wrote: > > Hello, > > > > > > I'm looking for a seek (lseek) implementation in > > one of the modules and I can't see one. > > > > Do I need to care about seeking if my module > > changes the file size (i.e. compresses) in Gluster? > > I would have thought that I did except that I believe > > that what I'm reading is that Gluster returns a > > NONSEEKABLE flag on file open (fuse_kernel.h at > > line 149). Does this mitigate the need to correct > > the user seeks? > > > > > > Cheers, > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Mon May 14 10:29:54 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 14 May 2012 12:29:54 +0200 Subject: [Gluster-devel] lseek In-Reply-To: <201205140951.q4E9px5H001754@singularity.tronunltd.com> References: <201205140951.q4E9px5H001754@singularity.tronunltd.com> Message-ID: <4FB0DEA2.3030805@datalab.es> Hello Ian, lseek calls are handled internally by the kernel and they never reach the user land for fuse calls. lseek only updates the current file offset that is stored inside the kernel file's structure. This value is what is passed to read/write fuse calls as an absolute offset. There isn't any problem in this behavior as long as you hide all size manipulations from fuse. If you write a translator that compresses a file, you should do so in a transparent manner. This means, basically, that: 1. Whenever you are asked to return the file size, you must return the size of the uncompressed file 2. Whenever you receive an offset, you must translate that offset to the corresponding offset in the compressed file and work with that 3. Whenever you are asked to read or write data, you must return the number of uncompressed bytes read or written (even if you have compressed the chunk of data to a smaller size and you have physically written less bytes). 4. All read requests must return uncompressed data (this seems obvious though) This guarantees that your manipulations are not seen in any way by any upper translator or even fuse, thus everything should work smoothly. If you respect these rules, lseek (and your translator) will work as expected. In particular, when a user calls lseek with SEEK_END, the kernel takes the size of the file from the internal kernel inode's structure. This size is obtained through a previous call to lookup or updated using the result of write operations. If you respect points 1 and 3, this value will be correct. In gluster there are a lot of fops that return a iatt structure. You must guarantee that all these functions return the correct size of the file in the field ia_size to be sure that everything works as expected. Xavi On 05/14/2012 11:51 AM, Ian Latter wrote: > Hello Xavi, > > > Ok - thanks. I was hoping that this was how read > and write were working (i.e. with absolute offsets > and not just getting relative offsets from the current > seek point), however what of the raw seek > command? > > len = lseek(fd, 0, SEEK_END); > > Upon successful completion, lseek() returns > the resulting offset location as measured in > bytes from the beginning of the file. > > Any idea on where the return value comes from? > I will need to fake up a file size for this command .. > > > > ----- Original Message ----- >> From: "Xavier Hernandez" >> To: >> Subject: Re: [Gluster-devel] lseek >> Date: Mon, 14 May 2012 09:48:17 +0200 >> >> Hello Ian, >> >> there is no such thing as an explicit seek in glusterfs. > Each readv, >> writev, (f)truncate and rchecksum have an offset parameter > that tells >> you the position where the operation must be performed. >> >> If you make something that changes the size of the file > you must make it >> in a way that it is transparent to upper translators. This > means that >> all offsets you will receive are "real" (in your case, > offsets in the >> uncompressed version of the file). You should calculate in > some way the >> equivalent offset in the compressed version of the file > and send it to >> the correspoding fop of the lower translators. >> >> In the same way, you must return in all iatt structures > the real size of >> the file (not the compressed size). >> >> I'm not sure what is the intended use of NONSEEKABLE, but > I think it is >> for special file types, like devices or similar that are > sequential in >> nature. Anyway, this is a fuse flag that you can't return > from a regular >> translator open fop. >> >> Xavi >> >> On 05/14/2012 03:22 AM, Ian Latter wrote: >>> Hello, >>> >>> >>> I'm looking for a seek (lseek) implementation in >>> one of the modules and I can't see one. >>> >>> Do I need to care about seeking if my module >>> changes the file size (i.e. compresses) in Gluster? >>> I would have thought that I did except that I believe >>> that what I'm reading is that Gluster returns a >>> NONSEEKABLE flag on file open (fuse_kernel.h at >>> line 149). Does this mitigate the need to correct >>> the user seeks? >>> >>> >>> Cheers, >>> >>> >>> >>> -- >>> Ian Latter >>> Late night coder .. >>> http://midnightcode.org/ >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at nongnu.org >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at nongnu.org >> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Mon May 14 11:18:22 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Mon, 14 May 2012 21:18:22 +1000 Subject: [Gluster-devel] lseek Message-ID: <201205141118.q4EBIMku002113@singularity.tronunltd.com> Hello Xavier, I don't have a problem with the principles, these were effectively how I was traveling (the notable difference is statfs which I want to pass-through unaffected, reporting the true file system capacity such that a du [stat] may sum to a greater value than a df [statfs]). In 2009 I had a mostly- functional hashing write function and a dubious read function (I stumbled when I had to open a file from within a fop). But I think what you're telling/showing me is that I have no deep understanding of the mapping of the system calls to their Fuse->Gluster fops - which is expected :) And, this is a better outcome than learning that Gluster has gaps in its framework with regard to my objective. I.e. I didn't know that lseek mapped to lookup. And the examples aren't comprehensive enough (rot-13 is the only one that really manipulates content, and it only plays with read and write, obviously because it has a 1:1 relationship with the data). This is the key, and not something that I was expecting; > In gluster there are a lot of fops that return a iatt > structure. You must guarantee that all these > functions return the correct size of the file in > the field ia_size to be sure that everything works > as expected. I'll do my best to build a comprehensive list of iatt returning fops from the examples ... but I'd say it'll take a solid peer review to get this hammered out properly. Thanks for steering me straight Xavi, appreciate it. ----- Original Message ----- >From: "Xavier Hernandez" >To: "Ian Latter" >Subject: Re: [Gluster-devel] lseek >Date: Mon, 14 May 2012 12:29:54 +0200 > > Hello Ian, > > lseek calls are handled internally by the kernel and they never reach > the user land for fuse calls. lseek only updates the current file offset > that is stored inside the kernel file's structure. This value is what is > passed to read/write fuse calls as an absolute offset. > > There isn't any problem in this behavior as long as you hide all size > manipulations from fuse. If you write a translator that compresses a > file, you should do so in a transparent manner. This means, basically, that: > > 1. Whenever you are asked to return the file size, you must return the > size of the uncompressed file > 2. Whenever you receive an offset, you must translate that offset to the > corresponding offset in the compressed file and work with that > 3. Whenever you are asked to read or write data, you must return the > number of uncompressed bytes read or written (even if you have > compressed the chunk of data to a smaller size and you have physically > written less bytes). > 4. All read requests must return uncompressed data (this seems obvious > though) > > This guarantees that your manipulations are not seen in any way by any > upper translator or even fuse, thus everything should work smoothly. > > If you respect these rules, lseek (and your translator) will work as > expected. > > In particular, when a user calls lseek with SEEK_END, the kernel takes > the size of the file from the internal kernel inode's structure. This > size is obtained through a previous call to lookup or updated using the > result of write operations. If you respect points 1 and 3, this value > will be correct. > > In gluster there are a lot of fops that return a iatt structure. You > must guarantee that all these functions return the correct size of the > file in the field ia_size to be sure that everything works as expected. > > Xavi > > On 05/14/2012 11:51 AM, Ian Latter wrote: > > Hello Xavi, > > > > > > Ok - thanks. I was hoping that this was how read > > and write were working (i.e. with absolute offsets > > and not just getting relative offsets from the current > > seek point), however what of the raw seek > > command? > > > > len = lseek(fd, 0, SEEK_END); > > > > Upon successful completion, lseek() returns > > the resulting offset location as measured in > > bytes from the beginning of the file. > > > > Any idea on where the return value comes from? > > I will need to fake up a file size for this command .. > > > > > > > > ----- Original Message ----- > >> From: "Xavier Hernandez" > >> To: > >> Subject: Re: [Gluster-devel] lseek > >> Date: Mon, 14 May 2012 09:48:17 +0200 > >> > >> Hello Ian, > >> > >> there is no such thing as an explicit seek in glusterfs. > > Each readv, > >> writev, (f)truncate and rchecksum have an offset parameter > > that tells > >> you the position where the operation must be performed. > >> > >> If you make something that changes the size of the file > > you must make it > >> in a way that it is transparent to upper translators. This > > means that > >> all offsets you will receive are "real" (in your case, > > offsets in the > >> uncompressed version of the file). You should calculate in > > some way the > >> equivalent offset in the compressed version of the file > > and send it to > >> the correspoding fop of the lower translators. > >> > >> In the same way, you must return in all iatt structures > > the real size of > >> the file (not the compressed size). > >> > >> I'm not sure what is the intended use of NONSEEKABLE, but > > I think it is > >> for special file types, like devices or similar that are > > sequential in > >> nature. Anyway, this is a fuse flag that you can't return > > from a regular > >> translator open fop. > >> > >> Xavi > >> > >> On 05/14/2012 03:22 AM, Ian Latter wrote: > >>> Hello, > >>> > >>> > >>> I'm looking for a seek (lseek) implementation in > >>> one of the modules and I can't see one. > >>> > >>> Do I need to care about seeking if my module > >>> changes the file size (i.e. compresses) in Gluster? > >>> I would have thought that I did except that I believe > >>> that what I'm reading is that Gluster returns a > >>> NONSEEKABLE flag on file open (fuse_kernel.h at > >>> line 149). Does this mitigate the need to correct > >>> the user seeks? > >>> > >>> > >>> Cheers, > >>> > >>> > >>> > >>> -- > >>> Ian Latter > >>> Late night coder .. > >>> http://midnightcode.org/ > >>> > >>> _______________________________________________ > >>> Gluster-devel mailing list > >>> Gluster-devel at nongnu.org > >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel > >> _______________________________________________ > >> Gluster-devel mailing list > >> Gluster-devel at nongnu.org > >> https://lists.nongnu.org/mailman/listinfo/gluster-devel > >> > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Mon May 14 11:47:10 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Mon, 14 May 2012 13:47:10 +0200 Subject: [Gluster-devel] lseek In-Reply-To: <201205141118.q4EBIMku002113@singularity.tronunltd.com> References: <201205141118.q4EBIMku002113@singularity.tronunltd.com> Message-ID: <4FB0F0BE.9030009@datalab.es> Hello Ian, I didn't thought in statfs. In this special case things are a bit harder for a compression translator. I think it's impossible to return accurate data without a considerable amount of work. Maybe some estimation of the available space based on the current achieved mean compression ratio would be sufficient, but never accurate. With more work you could even be able to say exactly how much space have been used, but the best you can do with the remaining space is an estimation. Regarding lseek, there isn't a map with lookup. Probably I haven't explained it as well as I wanted. There are basically two kinds of user mode calls. Those that use a string containing a filename to operate with (stat, unlink, open, creat, ...), and those that use a file descriptor (fstat, read, write, ...). The kernel does not work with names to handle files, so it has to translate the names to inodes to work with them. This means that any call that uses a string will need to make a "lookup" to get the associated inode (the only exception is creat, that creates a new inode without using lookup). This means that every filename based operation can generate a lookup request (although some caching mechanism may reduce the number of calls). All operations that work with a file descriptor do not generate a lookup request, because the file descriptor is already bound to an inode. In your particular case, to do an lseek you must have made a previous call to open (that would have generated a lookup request) or creat. Hope this better explains how kernel and gluster are bound... Xavi On 05/14/2012 01:18 PM, Ian Latter wrote: > Hello Xavier, > > > I don't have a problem with the principles, these > were effectively how I was traveling (the notable > difference is statfs which I want to pass-through > unaffected, reporting the true file system capacity > such that a du [stat] may sum to a greater value > than a df [statfs]). In 2009 I had a mostly- > functional hashing write function and a dubious > read function (I stumbled when I had to open a > file from within a fop). > > But I think what you're telling/showing me is that > I have no deep understanding of the mapping of > the system calls to their Fuse->Gluster fops - > which is expected :) And, this is a better outcome > than learning that Gluster has gaps in its > framework with regard to my objective. I.e. I > didn't know that lseek mapped to lookup. And > the examples aren't comprehensive enough > (rot-13 is the only one that really manipulates > content, and it only plays with read and write, > obviously because it has a 1:1 relationship with > the data). > > This is the key, and not something that I was > expecting; > >> In gluster there are a lot of fops that return a iatt >> structure. You must guarantee that all these >> functions return the correct size of the file in >> the field ia_size to be sure that everything works >> as expected. > I'll do my best to build a comprehensive list of iatt > returning fops from the examples ... but I'd say it'll > take a solid peer review to get this hammered out > properly. > > Thanks for steering me straight Xavi, appreciate > it. > > > > ----- Original Message ----- >> From: "Xavier Hernandez" >> To: "Ian Latter" >> Subject: Re: [Gluster-devel] lseek >> Date: Mon, 14 May 2012 12:29:54 +0200 >> >> Hello Ian, >> >> lseek calls are handled internally by the kernel and they > never reach >> the user land for fuse calls. lseek only updates the > current file offset >> that is stored inside the kernel file's structure. This > value is what is >> passed to read/write fuse calls as an absolute offset. >> >> There isn't any problem in this behavior as long as you > hide all size >> manipulations from fuse. If you write a translator that > compresses a >> file, you should do so in a transparent manner. This > means, basically, that: >> 1. Whenever you are asked to return the file size, you > must return the >> size of the uncompressed file >> 2. Whenever you receive an offset, you must translate that > offset to the >> corresponding offset in the compressed file and work with that >> 3. Whenever you are asked to read or write data, you must > return the >> number of uncompressed bytes read or written (even if you > have >> compressed the chunk of data to a smaller size and you > have physically >> written less bytes). >> 4. All read requests must return uncompressed data (this > seems obvious >> though) >> >> This guarantees that your manipulations are not seen in > any way by any >> upper translator or even fuse, thus everything should work > smoothly. >> If you respect these rules, lseek (and your translator) > will work as >> expected. >> >> In particular, when a user calls lseek with SEEK_END, the > kernel takes >> the size of the file from the internal kernel inode's > structure. This >> size is obtained through a previous call to lookup or > updated using the >> result of write operations. If you respect points 1 and 3, > this value >> will be correct. >> >> In gluster there are a lot of fops that return a iatt > structure. You >> must guarantee that all these functions return the correct > size of the >> file in the field ia_size to be sure that everything works > as expected. >> Xavi >> >> On 05/14/2012 11:51 AM, Ian Latter wrote: >>> Hello Xavi, >>> >>> >>> Ok - thanks. I was hoping that this was how read >>> and write were working (i.e. with absolute offsets >>> and not just getting relative offsets from the current >>> seek point), however what of the raw seek >>> command? >>> >>> len = lseek(fd, 0, SEEK_END); >>> >>> Upon successful completion, lseek() returns >>> the resulting offset location as measured in >>> bytes from the beginning of the file. >>> >>> Any idea on where the return value comes from? >>> I will need to fake up a file size for this command .. >>> >>> >>> >>> ----- Original Message ----- >>>> From: "Xavier Hernandez" >>>> To: >>>> Subject: Re: [Gluster-devel] lseek >>>> Date: Mon, 14 May 2012 09:48:17 +0200 >>>> >>>> Hello Ian, >>>> >>>> there is no such thing as an explicit seek in glusterfs. >>> Each readv, >>>> writev, (f)truncate and rchecksum have an offset parameter >>> that tells >>>> you the position where the operation must be performed. >>>> >>>> If you make something that changes the size of the file >>> you must make it >>>> in a way that it is transparent to upper translators. This >>> means that >>>> all offsets you will receive are "real" (in your case, >>> offsets in the >>>> uncompressed version of the file). You should calculate in >>> some way the >>>> equivalent offset in the compressed version of the file >>> and send it to >>>> the correspoding fop of the lower translators. >>>> >>>> In the same way, you must return in all iatt structures >>> the real size of >>>> the file (not the compressed size). >>>> >>>> I'm not sure what is the intended use of NONSEEKABLE, but >>> I think it is >>>> for special file types, like devices or similar that are >>> sequential in >>>> nature. Anyway, this is a fuse flag that you can't return >>> from a regular >>>> translator open fop. >>>> >>>> Xavi >>>> >>>> On 05/14/2012 03:22 AM, Ian Latter wrote: >>>>> Hello, >>>>> >>>>> >>>>> I'm looking for a seek (lseek) implementation in >>>>> one of the modules and I can't see one. >>>>> >>>>> Do I need to care about seeking if my module >>>>> changes the file size (i.e. compresses) in Gluster? >>>>> I would have thought that I did except that I believe >>>>> that what I'm reading is that Gluster returns a >>>>> NONSEEKABLE flag on file open (fuse_kernel.h at >>>>> line 149). Does this mitigate the need to correct >>>>> the user seeks? >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> >>>>> -- >>>>> Ian Latter >>>>> Late night coder .. >>>>> http://midnightcode.org/ >>>>> >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at nongnu.org >>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at nongnu.org >>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >>>> >>> -- >>> Ian Latter >>> Late night coder .. >>> http://midnightcode.org/ >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at nongnu.org >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel >> > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel From kkeithle at redhat.com Mon May 14 14:17:12 2012 From: kkeithle at redhat.com (Kaleb S. KEITHLEY) Date: Mon, 14 May 2012 10:17:12 -0400 Subject: [Gluster-devel] python version In-Reply-To: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> Message-ID: <4FB113E8.0@redhat.com> On 05/13/2012 10:42 AM, Emmanuel Dreyfus wrote: > Hi > > There is a problem with python version detection in the configure > script. The machine on which autotools is ran prior releasing glusterfs > expands AM_PATH_PYTHON into a script that fails to accept python> 2.4. > > As I understand, a solution is to concatenate latest > automake-1.12/m4/python.m4 into glusterfs' aclocal.m4. That way python > up to 3.1 should be accepted. Opinions? The aclocal.m4 file is produced when (/usr/bin/)aclocal is invoked by ./autogen.sh file in preparation for building gluster. (You have to run autogen.sh to produce the ./configure file.) aclocal uses whatever python.m4 file you have on your system, e.g. /usr/share/aclocal-1.11/python.m4, which is also from the automake package. I presume whoever packages automake for a particular system is taking into consideration what other packages and versions are standard for the system and picks right version of automake. IOW picks the version of automake that has all the (hard-coded) versions of python to match the python they have on their system. If someone has installed a later version of python and not also updated to a compatible version of automake, that's not a problem that gluster should have to solve, or even try to solve. I don't believe we want to require our build process to download the latest-and-greatest version of automake. As a side note, I sampled a few currently shipping systems and see that the automake shipped with/for Fedora 16 and 17, FreeBSD 8.2 and 8.3, and NetBSD 5.1.2, is automake-1.11, which has all the appearances of supporting python 2.5 (and 3.0). Finally, after all that, note that the configure.ac file appears to be hard-coded to require python 2.x, so if anyone is trying to use python 3.x, that's doomed to fail until configure.ac is "fixed." Do we even know why python 2.x is required and why python 3.x can't be used? -- Kaleb From manu at netbsd.org Mon May 14 14:23:47 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 14 May 2012 14:23:47 +0000 Subject: [Gluster-devel] python version In-Reply-To: <4FB113E8.0@redhat.com> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> Message-ID: <20120514142347.GA3985@homeworld.netbsd.org> On Mon, May 14, 2012 at 10:17:12AM -0400, Kaleb S. KEITHLEY wrote: > The aclocal.m4 file is produced when (/usr/bin/)aclocal is invoked > by ./autogen.sh file in preparation for building gluster. (You have > to run autogen.sh to produce the ./configure file.) Right, then my plan will not work, and the only way to fix the problem is to upgrade automake on the machine that produces the gluterfs releases. > As a side note, I sampled a few currently shipping systems and see > that the automake shipped with/for Fedora 16 and 17, FreeBSD 8.2 and > 8.3, and NetBSD 5.1.2, is automake-1.11, which has all the > appearances of supporting python 2.5 (and 3.0). You seem to take for granted that people building a glusterfs release will run autotools before running configure. This is not the way it should work: a released tarball should contain a configure script that works anywhere. The tarballs released up to at least 3.3.0qa40 have a configure script that cannot detect python > 2.4 -- Emmanuel Dreyfus manu at netbsd.org From kkeithle at redhat.com Mon May 14 14:31:32 2012 From: kkeithle at redhat.com (Kaleb S. KEITHLEY) Date: Mon, 14 May 2012 10:31:32 -0400 Subject: [Gluster-devel] python version In-Reply-To: <20120514142347.GA3985@homeworld.netbsd.org> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> <20120514142347.GA3985@homeworld.netbsd.org> Message-ID: <4FB11744.1040907@redhat.com> On 05/14/2012 10:23 AM, Emmanuel Dreyfus wrote: > On Mon, May 14, 2012 at 10:17:12AM -0400, Kaleb S. KEITHLEY wrote: >> The aclocal.m4 file is produced when (/usr/bin/)aclocal is invoked >> by ./autogen.sh file in preparation for building gluster. (You have >> to run autogen.sh to produce the ./configure file.) > > Right, then my plan will not work, and the only way to fix the problem > is to upgrade automake on the machine that produces the glusterfs > releases. > >> As a side note, I sampled a few currently shipping systems and see >> that the automake shipped with/for Fedora 16 and 17, FreeBSD 8.2 and >> 8.3, and NetBSD 5.1.2, is automake-1.11, which has all the >> appearances of supporting python 2.5 (and 3.0). > > You seem to take for granted that people building a glusterfs > release will run autotools before running configure. This is not > the way it should work: a released tarball should contain a > configure script that works anywhere. The tarballs released up to > at least 3.3.0qa40 have a configure script that cannot detect python> 2.4 > I looked at what I get when I checkout the source from the git repo and what I have to do to build from a freshly checked out source tree. And yes, we need to upgrade the build machines were we package the release tarballs. Right now is not a good time to do that. -- Kaleb From yknev.shankar at gmail.com Mon May 14 15:31:56 2012 From: yknev.shankar at gmail.com (Venky Shankar) Date: Mon, 14 May 2012 21:01:56 +0530 Subject: [Gluster-devel] python version In-Reply-To: <4FB113E8.0@redhat.com> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> Message-ID: [snip] > Finally, after all that, note that the configure.ac file appears to be > hard-coded to require python 2.x, so if anyone is trying to use python 3.x, > that's doomed to fail until configure.ac is "fixed." Do we even know why > python 2.x is required and why python 3.x can't be used? > python 2.x is required by geo-replication. Although geo-replication is code ready for python 3.x, it's not functionally tested with it. That's the reason configure.ac has 2.x hard-coded. > > -- > > Kaleb > > > ______________________________**_________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/**mailman/listinfo/gluster-devel > Thanks, -Venky -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Mon May 14 15:45:48 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 14 May 2012 15:45:48 +0000 Subject: [Gluster-devel] python version In-Reply-To: References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> Message-ID: <20120514154548.GB3985@homeworld.netbsd.org> On Mon, May 14, 2012 at 09:01:56PM +0530, Venky Shankar wrote: > python 2.x is required by geo-replication. Although geo-replication is code > ready for python 3.x, it's not functionally tested with it. That's the > reason configure.ac has 2.x hard-coded. Well, my problem is that python 2.5, python 2.6 and python 2.7 are not detected by configure. One need to patch configure in order to build with python 2.x (x > 4) installed. -- Emmanuel Dreyfus manu at netbsd.org From kkeithle at redhat.com Mon May 14 16:30:12 2012 From: kkeithle at redhat.com (Kaleb S. KEITHLEY) Date: Mon, 14 May 2012 12:30:12 -0400 Subject: [Gluster-devel] python version In-Reply-To: <20120514154548.GB3985@homeworld.netbsd.org> References: <1kk1kjd.1h7jc221px95fwM%manu@netbsd.org> <4FB113E8.0@redhat.com> <20120514154548.GB3985@homeworld.netbsd.org> Message-ID: <4FB13314.3060708@redhat.com> On 05/14/2012 11:45 AM, Emmanuel Dreyfus wrote: > On Mon, May 14, 2012 at 09:01:56PM +0530, Venky Shankar wrote: >> python 2.x is required by geo-replication. Although geo-replication is code >> ready for python 3.x, it's not functionally tested with it. That's the >> reason configure.ac has 2.x hard-coded. > > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not > detected by configure. One need to patch configure in order to build > with python 2.x (x> 4) installed. > Seems like it would be easier to get autoconf and automake from the NetBSD packages and just run `./autogen.sh && ./configure` (Which, FWIW, is how glusterfs RPMs are built for the Fedora distributions. I'd wager for much the same reason.) -- Kaleb From manu at netbsd.org Mon May 14 18:46:07 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 14 May 2012 20:46:07 +0200 Subject: [Gluster-devel] python version In-Reply-To: <4FB13314.3060708@redhat.com> Message-ID: <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Kaleb S. KEITHLEY wrote: > > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not > > detected by configure. One need to patch configure in order to build > > with python 2.x (x> 4) installed. > > Seems like it would be easier to get autoconf and automake from the > NetBSD packages and just run `./autogen.sh && ./configure` I prefer patching the configure script. Running autogen introduce build dependencies on perl just to substitute a string on a single line: that's overkill. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From abperiasamy at gmail.com Mon May 14 19:25:20 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Mon, 14 May 2012 12:25:20 -0700 Subject: [Gluster-devel] python version In-Reply-To: <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> References: <4FB13314.3060708@redhat.com> <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Message-ID: On Mon, May 14, 2012 at 11:46 AM, Emmanuel Dreyfus wrote: > Kaleb S. KEITHLEY wrote: > >> > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not >> > detected by configure. One need to patch configure in order to build >> > with python 2.x (x> ?4) installed. >> >> Seems like it would be easier to get autoconf and automake from the >> NetBSD packages and just run `./autogen.sh && ./configure` > > I prefer patching the configure script. Running autogen introduce build > dependencies on perl just to substitute a string on a single line: > that's overkill. > Who ever builds from source is required to run autogen.sh to produce env specific configure and build files. "configure" script should not be checked into git repository. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From anand.avati at gmail.com Mon May 14 23:58:18 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 14 May 2012 16:58:18 -0700 Subject: [Gluster-devel] python version In-Reply-To: References: <4FB13314.3060708@redhat.com> <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Message-ID: On Mon, May 14, 2012 at 12:25 PM, Anand Babu Periasamy < abperiasamy at gmail.com> wrote: > On Mon, May 14, 2012 at 11:46 AM, Emmanuel Dreyfus > wrote: > > Kaleb S. KEITHLEY wrote: > > > >> > Well, my problem is that python 2.5, python 2.6 and python 2.7 are not > >> > detected by configure. One need to patch configure in order to build > >> > with python 2.x (x> 4) installed. > >> > >> Seems like it would be easier to get autoconf and automake from the > >> NetBSD packages and just run `./autogen.sh && ./configure` > > > > I prefer patching the configure script. Running autogen introduce build > > dependencies on perl just to substitute a string on a single line: > > that's overkill. > > > > Who ever builds from source is required to run autogen.sh to produce > env specific configure and build files. Not quite. That's the whole point of having a configure script in the first place - to detect the environment at build time. One who builds from source should not require to run autogen.sh, just configure should be sufficient. Since configure itself is a generated script, and can possibly have mistakes and requirements change (like the one being discussed), that's when autogen.sh must be used to re-generate configure script. In this case however, the simplest approach would actually be to run autogen.sh till either: a) we upgrade the release build machine to use newer aclocal macros b) qualify geo-replication to work on python 3 and remove the check. Emmanuel, since the problem is not going to be a long lasting one (either of the two should fix your problem), I suggest you find a solution local to you in the interim. Even better, if someone can actually test and qualify geo-replication to work on python 3 it would ease solution "b" sooner. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 15 01:30:21 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 03:30:21 +0200 Subject: [Gluster-devel] python version In-Reply-To: Message-ID: <1kk4971.wh86xo1gypeoiM%manu@netbsd.org> Anand Avati wrote: > a) we upgrade the release build machine to use newer aclocal macros > > b) qualify geo-replication to work on python 3 and remove the check. Solution b is not enough: even if the configure script does not claim a specific version of python, it will still be unable to detect an installed python > 2.4 because it contains that: for am_cv_pathless_PYTHON in python python2 python2.4 python2.3 python2.2 python2.1 python2.0 none; do What about solution c? c) Tweak autogen.sh so that it patches generated configure and add the checks for python > 2.4 if they are missing: --- autogen.sh.orig 2012-05-15 03:22:48.000000000 +0200 +++ autogen.sh 2012-05-15 03:24:28.000000000 +0200 @@ -5,4 +5,6 @@ (libtoolize --automake --copy --force || glibtoolize --automake --copy --force) autoconf automake --add-missing --copy --foreign cd argp-standalone;./autogen.sh + +sed 's/for am_cv_pathless_PYTHON in python python2 python2.4/for am_cv_pathless_PYTHON in python python2 python3 python3.2 python3.1 python3.0 python2.7 2.6 python2.5 python2.4/' configure > configure.new && mv configure.new configure -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 15 04:20:29 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 06:20:29 +0200 Subject: [Gluster-devel] Fixing Address family mess In-Reply-To: Message-ID: <1kk4hl3.1qjswd01knbbvqM%manu@netbsd.org> Anand Babu Periasamy wrote: > AF_UNSPEC is should be be taken as IPv4/IPv6. It is named > appropriately. Default should be ipv4. > > I have not tested the patch. I did test it and it fixed the problem at mine. Here it is in gerrit: http://review.gluster.com/#change,3319 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 15 04:27:26 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 06:27:26 +0200 Subject: [Gluster-devel] NetBSD support in 3.3? Message-ID: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> Hi I still have a few pending submissions for NetBSD support in latest sources: http://review.gluster.com/3319 Use inet as default transport http://review.gluster.com/3320 Add missing (base|dir)name_r http://review.gluster.com/3321 NetBSD build fixes I would like to have 3.3 building without too many unintegrated patches on NetBSD. Is it worth working on pushing the changes above or is release-3.3 too close to release to expect such changes to get into it now? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From amarts at redhat.com Tue May 15 05:51:55 2012 From: amarts at redhat.com (Amar Tumballi) Date: Tue, 15 May 2012 11:21:55 +0530 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> References: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> Message-ID: <4FB1EEFB.2020509@redhat.com> On 05/15/2012 09:57 AM, Emmanuel Dreyfus wrote: > Hi > > I still have a few pending submissions for NetBSD support in latest > sources: > http://review.gluster.com/3319 Use inet as default transport > http://review.gluster.com/3320 Add missing (base|dir)name_r > http://review.gluster.com/3321 NetBSD build fixes > > I would like to have 3.3 building without too many unintegrated patches > on NetBSD. Is it worth working on pushing the changes above or is > release-3.3 too close to release to expect such changes to get into it > now? > Emmanuel, I understand your concerns, but I suspect we are very close to 3.3.0 release at this point of time, and hence it may be tight for taking these patches in. What we are planing is for a quicker 3.3.1 depending on the community feedback of 3.3.0 release, which should surely have your patches included. Hope that makes sense. Regards, Amar From manu at netbsd.org Tue May 15 10:13:07 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 10:13:07 +0000 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <4FB1EEFB.2020509@redhat.com> References: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> <4FB1EEFB.2020509@redhat.com> Message-ID: <20120515101307.GD3985@homeworld.netbsd.org> On Tue, May 15, 2012 at 11:21:55AM +0530, Amar Tumballi wrote: > I understand your concerns, but I suspect we are very close to 3.3.0 > release at this point of time, and hence it may be tight for taking > these patches in. Riht, I will therefore not request pullups to release-3.3 for theses changes, but I would appreciate if people could review them so that they have a chance to go in master. Will 3.3.1 be based on release-3.3, or will a new branch be forked? -- Emmanuel Dreyfus manu at netbsd.org From vbellur at redhat.com Tue May 15 10:14:38 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Tue, 15 May 2012 15:44:38 +0530 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <20120515101307.GD3985@homeworld.netbsd.org> References: <1kk4hmn.t9qjk71rmdx55M%manu@netbsd.org> <4FB1EEFB.2020509@redhat.com> <20120515101307.GD3985@homeworld.netbsd.org> Message-ID: <4FB22C8E.1@redhat.com> On 05/15/2012 03:43 PM, Emmanuel Dreyfus wrote: > Riht, I will therefore not request pullups to release-3.3 for theses > changes, but I would appreciate if people could review them so that they > have a chance to go in master. > > Will 3.3.1 be based on release-3.3, or will a new branch be forked? All 3.3.x releases will be based on release-3.3. It might be a good idea to rebase these changes to release-3.3 after they have been accepted in master. Vijay From manu at netbsd.org Tue May 15 11:51:36 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 15 May 2012 13:51:36 +0200 Subject: [Gluster-devel] NetBSD support in 3.3? In-Reply-To: <4FB22C8E.1@redhat.com> Message-ID: <1kk51xf.8p0t3l1viyp1mM%manu@netbsd.org> Vijay Bellur wrote: > All 3.3.x releases will be based on release-3.3. It might be a good idea > to rebase these changes to release-3.3 after they have been accepted in > master. But after 3.3 release, as I understand. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From ej1515.park at samsung.com Wed May 16 12:23:12 2012 From: ej1515.park at samsung.com (=?euc-kr?B?udrAusHY?=) Date: Wed, 16 May 2012 12:23:12 +0000 (GMT) Subject: [Gluster-devel] Asking about Gluster Performance Factors Message-ID: <0M44007MX7QO1Z40@mailout1.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201205162123598_1LI1H0JV.jpg Type: image/jpeg Size: 72722 bytes Desc: not available URL: From johnmark at redhat.com Wed May 16 14:38:50 2012 From: johnmark at redhat.com (John Mark Walker) Date: Wed, 16 May 2012 10:38:50 -0400 (EDT) Subject: [Gluster-devel] Asking about Gluster Performance Factors In-Reply-To: <0M44007MX7QO1Z40@mailout1.samsung.com> Message-ID: <931185f2-f1b7-431f-96a0-1e7cb476b7d7@zmail01.collab.prod.int.phx2.redhat.com> Hi Ethan, ----- Original Message ----- > Dear Gluster Dev Team : > I'm Ethan, Assistant engineer in Samsung electronics. Reviewing your > paper, I have some questions of performance factors in gluster. Which paper? Can you provide a link? Also, please note that this is a community mailing list, and we cannot guarantee quick response times here - if you need a fast response, I'm happy to put you through to the right people. Thanks, John Mark Walker Gluster Community Guy > First, what does it mean the option "performance.cache-*"? Does it > mean read cache? If does, what's difference between the options > "prformance.cache-max-file-size" and "performance.cache-size" ? > I read your another paper("performance in a gluster system, versions > 3.1.x") and it says as below on Page 12, > (Gluster Native protocol does not implement write caching, as we > believe that the modest performance improvements from rite caching > do not justify the risk of cache coherency issues.) > Second, how much is the read throughput improved as configuring 2-way > replication? we need any statistics or something like that. > ("performance in a gluster system, versions 3.1.x") and it says as > below on Page 12, > (However, read throughput is generally improved by replication, as > reads can be delivered from either storage node) > I would ask you to return ASAP. From johnmark at redhat.com Wed May 16 15:56:32 2012 From: johnmark at redhat.com (John Mark Walker) Date: Wed, 16 May 2012 11:56:32 -0400 (EDT) Subject: [Gluster-devel] Reminder: community.gluster.org In-Reply-To: <4b117086-34aa-4d8b-aede-ffae2e3abfbd@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1bb98699-b028-4f92-b8fd-603056aef57c@zmail01.collab.prod.int.phx2.redhat.com> Greetings all, Just a friendly reminder that we could use your help on community.gluster.org (hereafter 'c.g.o'). Someday in the near future, we will have 2-way synchronization between our mailing lists and c.g.o, but as of now, there are 2 places to ask and answer questions. I ask that for things with definite answers, even if they start out here on the mailing lists, please provide the question and answer on c.g.o. For lengthy conversations about using or developing GlusterFS, including ideas for new ideas, roadmaps, etc., the mailing lists are ideal for that. Why do we prefer c.g.o? Because it's Google-friendly :) So, if you see any existing questions over there that you are qualified to answer, please do weigh in with an answer. And as always, for quick "real-time" help, you're best served by visiting #gluster on the freenode IRC network. This has been a public service announcement from your friendly community guy. -JM From ndevos at redhat.com Wed May 16 19:56:04 2012 From: ndevos at redhat.com (Niels de Vos) Date: Wed, 16 May 2012 21:56:04 +0200 Subject: [Gluster-devel] Updated Wireshark packages for RHEL-6 and Fedora-17 available for testing Message-ID: <4FB40654.60703@redhat.com> Hi all, today I have merged support for GlusterFS 3.2 and 3.3 into one Wireshark 'dissector'. The packages with date 20120516 in the version support both the current stable 3.2.x version, and the latest 3.3.0qa41. Older 3.3.0 versions will likely have issues due to some changes in the RPC-AUTH protocol used. Updating to the latest qa41 release (or newer) is recommended anyway. I do not expect that we'll add support for earlier 3.3.0 releases. My repository with packages for RHEL-6 and Fedora-17 contains a .repo file for yum (save it in /etc/yum.repos.d): - http://repos.fedorapeople.org/repos/devos/wireshark-gluster/ RPMs for other Fedora or RHEL versions can be provided on request. Let me know if you need an other version (or architecture). Single patches for some different Wireshark versions are available from https://github.com/nixpanic/gluster-wireshark. A full history of commits can be found here: - https://github.com/nixpanic/gluster-wireshark-1.4/commits/master/ (Support for GlusterFS 3.3 was added by Akhila and Shree, thanks!) Please test and report success and problems, file a issues on github: https://github.com/nixpanic/gluster-wireshark-1.4/issues Some functionality is still missing, but with the current status, it should be good for most analysing already. With more issues filed, it makes it easier to track what items are important. Of course, you can also respond to this email and give feedback :-) After some more cleanup of the code, this dissector will be passed on for review and inclusion in the upstream Wireshark project. Some more testing results is therefore much appreciated. Thanks, Niels From johnmark at redhat.com Wed May 16 21:12:41 2012 From: johnmark at redhat.com (John Mark Walker) Date: Wed, 16 May 2012 17:12:41 -0400 (EDT) Subject: [Gluster-devel] GlusterFest! For GlusterFS 3.3 Beta 4 In-Reply-To: Message-ID: <5456de9c-6c8b-4995-ad1e-720c9c52c74f@zmail01.collab.prod.int.phx2.redhat.com> Greetings, We are planning to have one more beta release tomorrow. If all goes as planned, this will be the release candidate. In conjunction with the beta, I thought we should have a 24-hour GlusterFest, starting tomorrow at 8pm - http://www.gluster.org/community/documentation/index.php/GlusterFest 'What's a GlusterFest?' you may be asking. Well, it's all of the below: - Testing the software. Install the new beta (when it's released tomorrow) and put it through its paces. We will put some basic testing procedures on the GlusterFest page here - http://www.gluster.org/community/documentation/index.php/GlusterFest - Feel free to create your own testing procedures and link to it from the GlusterFest page - Finding bugs. See the current list of bugs targeted for this release: http://bit.ly/beta4bugs - Fixing bugs. If you're the kind of person who wants to submit patches, see our development workflow doc: http://www.gluster.org/community/documentation/index.php/Development_Work_Flow - and then get to know Gerritt: http://review.gluster.com/ The GlusterFest page will be updated with some basic testing procedures tomorrow, and GlusterFest will officially begin at 8pm PDT May 17/03:00 UTC May 18 (coinciding with the end of our meetup tomorrow), and ending at 8pm PDT May 18/03:00 UTC May 19. If you need assistance, see #gluster on Freenode for "real-time" questions, gluster-users and community.gluster.org for general usage questions, and gluster-devel for anything related to building, patching, and bug-fixing. To keep up with GlusterFest activity, I'll be sending updates from the @glusterorg account on Twitter, and I'm sure there will be traffic on the mailing lists, as well. Happy testing and bug-hunting! -JM From ej1515.park at samsung.com Thu May 17 01:08:50 2012 From: ej1515.park at samsung.com (=?euc-kr?B?udrAusHY?=) Date: Thu, 17 May 2012 01:08:50 +0000 (GMT) Subject: [Gluster-devel] Asking about Gluster Performance Factors Message-ID: <0M4500FX676Q1150@mailout4.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201205171008201_QKNMBDIF.jpg Type: image/jpeg Size: 72722 bytes Desc: not available URL: From johnmark at redhat.com Thu May 17 04:28:50 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 00:28:50 -0400 (EDT) Subject: [Gluster-devel] Fwd: Asking about Gluster Performance Factors In-Reply-To: Message-ID: <153525d7-fe8c-4f5c-aa06-097fcb4b0980@zmail01.collab.prod.int.phx2.redhat.com> See response below from Ben England. Also, note that this question should probably go in gluster-users. -JM ----- Forwarded Message ----- From: "Ben England" To: "John Mark Walker" Sent: Wednesday, May 16, 2012 8:23:30 AM Subject: Re: [Gluster-devel] Asking about Gluster Performance Factors JM, see comments marked with ben>>> below. ----- Original Message ----- From: "???" To: gluster-devel at nongnu.org Sent: Wednesday, May 16, 2012 5:23:12 AM Subject: [Gluster-devel] Asking about Gluster Performance Factors Samsung Enterprise Portal mySingle May 16, 2012 Dear Gluster Dev Team : I'm Ethan, Assistant engineer in Samsung electronics. Reviewing your paper, I have some questions of performance factors in gluster. First, what does it mean the option "performance.cache-*"? Does it mean read cache? If does, what's difference between the options "prformance.cache-max-file-size" and "performance.cache-size" ? I read your another paper("performance in a gluster system, versions 3.1.x") and it says as below on Page 12, (Gluster Native protocol does not implement write caching, as we believe that the modest performance improvements from rite caching do not justify the risk of cache coherency issues.) ben>>> While gluster processes do not implement write caching internally, there are at least 3 ways to improve write performance in a Gluster system. - If you use a RAID controller with a non-volatile writeback cache, the RAID controller can buffer writes on behalf of the Gluster server and thereby reduce latency. - XFS or any other local filesystem used within the server "bricks" can do "write-thru" caching, meaning that the writes can be aggregated and can be kept in the Linux buffer cache so that subsequent read requests can be satisfied from this cache, transparent to Gluster processes. - there is a "write-behind" translator in the native client that will aggregate small sequential write requests at the FUSE layer into larger network-level write requests. If the smallest possible application I/O size is a requirement, sequential writes can also be efficiently aggregated by an NFS client. Second, how much is the read throughput improved as configuring 2-way replication? we need any statistics or something like that. ("performance in a gluster system, versions 3.1.x") and it says as below on Page 12, (However, read throughput is generally improved by replication, as reads can be delivered from either storage node) ben>>> Yes, reads can be satisfied by either server in a replication pair. Since the gluster native client only reads one of the two replicas, read performance should be approximately the same for 2-replica file system as it would be for a 1-replica file system. The difference in performance is with writes, as you would expect. Sincerely yours, Ethan Eunjun Park Assistant Engineer, Solution Development Team, Media Solution Center 416, Maetan 3-dong, Yeongtong-gu, Suwon-si, Gyeonggi-do 443-742, Korea Mobile : 010-8609-9532 E-mail : ej1515.park at samsung.com http://www.samsung.com/sec _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnmark at redhat.com Thu May 17 06:35:10 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 02:35:10 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: Message-ID: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> I was thinking about sending these gerritt notifications to gluster-devel by default - what do y'all think? -JM ----- Forwarded Message ----- From: "Kaushal M (Code Review)" Sent: Wednesday, May 16, 2012 11:32:26 PM Subject: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() Kaushal M has uploaded a new change for review. Change subject: client/protocol : Changes in client3_1_getxattr() ...................................................................... client/protocol : Changes in client3_1_getxattr() Backporting change 1d02db63ae from master. Copy args->loc to local->loc in client3_1_getxattr(). This prevents logs with "(null) (--)" in client3_1_getxattr_cbk(). Also save args->name in local->name and print it in the log as well. BUG: 812199 Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Signed-off-by: Kaushal M --- M xlators/protocol/client/src/client-helpers.c M xlators/protocol/client/src/client.h M xlators/protocol/client/src/client3_1-fops.c 3 files changed, 11 insertions(+), 2 deletions(-) git pull ssh://*/glusterfs refs/changes/50/3350/1 -- To view, visit http://review.gluster.com/3350 To unsubscribe, visit http://review.gluster.com/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Gerrit-PatchSet: 1 Gerrit-Project: glusterfs Gerrit-Branch: release-3.3 Gerrit-Owner: Kaushal M From rajesh at redhat.com Thu May 17 06:42:56 2012 From: rajesh at redhat.com (Rajesh Amaravathi) Date: Thu, 17 May 2012 02:42:56 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: +1 Regards, Rajesh Amaravathi, Software Engineer, GlusterFS RedHat Inc. ----- Original Message ----- From: "John Mark Walker" To: gluster-devel at nongnu.org Sent: Thursday, May 17, 2012 12:05:10 PM Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() I was thinking about sending these gerritt notifications to gluster-devel by default - what do y'all think? -JM ----- Forwarded Message ----- From: "Kaushal M (Code Review)" Sent: Wednesday, May 16, 2012 11:32:26 PM Subject: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() Kaushal M has uploaded a new change for review. Change subject: client/protocol : Changes in client3_1_getxattr() ...................................................................... client/protocol : Changes in client3_1_getxattr() Backporting change 1d02db63ae from master. Copy args->loc to local->loc in client3_1_getxattr(). This prevents logs with "(null) (--)" in client3_1_getxattr_cbk(). Also save args->name in local->name and print it in the log as well. BUG: 812199 Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Signed-off-by: Kaushal M --- M xlators/protocol/client/src/client-helpers.c M xlators/protocol/client/src/client.h M xlators/protocol/client/src/client3_1-fops.c 3 files changed, 11 insertions(+), 2 deletions(-) git pull ssh://*/glusterfs refs/changes/50/3350/1 -- To view, visit http://review.gluster.com/3350 To unsubscribe, visit http://review.gluster.com/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I5419f6a244de93dd1a96ac8e229be3ecdc9f456e Gerrit-PatchSet: 1 Gerrit-Project: glusterfs Gerrit-Branch: release-3.3 Gerrit-Owner: Kaushal M _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From vijay at gluster.com Thu May 17 06:55:42 2012 From: vijay at gluster.com (Vijay Bellur) Date: Thu, 17 May 2012 12:25:42 +0530 Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> References: <3926b14e-cc21-4f4f-b160-a046518fef1d@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FB4A0EE.40102@gluster.com> On 05/17/2012 12:05 PM, John Mark Walker wrote: > I was thinking about sending these gerritt notifications to gluster-devel by default - what do y'all think? Gerrit automatically sends out a notification to all registered users who are watching the project. Do we need an additional notification to gluster-devel if there's a considerable overlap between registered users of gluster-devel and gerrit? -Vijay From johnmark at redhat.com Thu May 17 07:26:23 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 03:26:23 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <4FB4A0EE.40102@gluster.com> Message-ID: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> There are close to 600 people now subscribed to gluster-devel - how many of them actually have an account on Gerritt? I honestly have no idea. Another thing this would do is send a subtle message to subscribers that this is not the place to discuss user issues, but perhaps there are better ways to do that. I've seen many projects do this - as well as send all bugzilla and github notifications, but I could also see some people getting annoyed. -JM ----- Original Message ----- > On 05/17/2012 12:05 PM, John Mark Walker wrote: > > I was thinking about sending these gerritt notifications to > > gluster-devel by default - what do y'all think? > > Gerrit automatically sends out a notification to all registered users > who are watching the project. Do we need an additional notification > to > gluster-devel if there's a considerable overlap between registered > users > of gluster-devel and gerrit? > > > -Vijay > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From ashetty at redhat.com Thu May 17 07:35:27 2012 From: ashetty at redhat.com (Anush Shetty) Date: Thu, 17 May 2012 13:05:27 +0530 Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> References: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FB4AA3F.1090700@redhat.com> On 05/17/2012 12:56 PM, John Mark Walker wrote: > There are close to 600 people now subscribed to gluster-devel - how many of them actually have an account on Gerritt? I honestly have no idea. Another thing this would do is send a subtle message to subscribers that this is not the place to discuss user issues, but perhaps there are better ways to do that. > > I've seen many projects do this - as well as send all bugzilla and github notifications, but I could also see some people getting annoyed. > How about a weekly digest of the same. - Anush From manu at netbsd.org Thu May 17 09:02:32 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 17 May 2012 11:02:32 +0200 Subject: [Gluster-devel] Crashes with latest git code Message-ID: <1kk8js0.b6kp732ejixeM%manu@netbsd.org> Hi I get a lot of crashes on NetBSD with latest git code. looking at core dumps, it is obvious I get memory corruption, as I find various structure overwritten by texts (file path or content). Linking with electric fence produces a much earlier crash, always at the same place. Here is how it looks: Program terminated with signal 11, Segmentation fault. #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 (gdb) bt #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 #1 0xbaa5ec1e in gf_strdup (src=0x0) at ../../../../libglusterfs/src/mem-pool.h:119 #2 0xbaa76dbf in client3_1_getxattr (frame=0xbb77f5c0, this=0xba3cd000, data=0xbfbfe18c) at client3_1-fops.c:4641 #3 0xbaa59ab8 in client_getxattr (frame=0xbb77f5c0, this=0xba3cd000, loc=0xb9402dd0, name=0x0, xdata=0x0) at client.c:1452 #4 0xb9ac3c7d in afr_sh_metadata_sync_prepare (frame=0xba8026bc, this=0xba3ce000) at afr-self-heal-metadata.c:419 #5 0xb9ac428b in afr_sh_metadata_fix (frame=0xba8026bc, this=0xba3ce000, op_ret=0, op_errno=0) at afr-self-heal-metadata.c:522 #6 0xb9abeb2b in afr_sh_common_lookup_cbk (frame=0xba8026bc, cookie=0x1, this=0xba3ce000, op_ret=0, op_errno=0, inode=0xb8b001a0, buf=0xbfbfe424, xattr=0xba401394, postparent=0xbfbfe3bc) at afr-self-heal-common.c:1311 #7 0xbaa6dc10 in client3_1_lookup_cbk (req=0xb92010d8, iov=0xb92010f8, count=1, myframe=0xbb77f550) at client3_1-fops.c:2636 Frame 4 is this: STACK_WIND (frame, afr_sh_metadata_getxattr_cbk, priv->children[source], priv->children[source]->fops->getxattr, &local->loc, NULL, NULL); Then in frame 3, I get args.name = NULL client_getxattr (call_frame_t *frame, xlator_t *this, loc_t *loc, const char *name, dict_t *xdata) (...) args.name = name; (...) ret = proc->fn (frame, this, &args); In frame 2, args->name = NULL client3_1_getxattr (call_frame_t *frame, xlator_t *this, void *data) (...) args = data; (...) local->name = gf_strdup (args->name); And there we will crash in gf_strdup(). The root cause is afr_sh_metadata_sync_prepare() calling client_getxattr with NULL arguments. The fix is beyond my knowledge of glusterfs internals, but I am sure that some folks here will be able to comment. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From rabhat at redhat.com Thu May 17 09:11:55 2012 From: rabhat at redhat.com (Raghavendra Bhat) Date: Thu, 17 May 2012 05:11:55 -0400 (EDT) Subject: [Gluster-devel] Crashes with latest git code In-Reply-To: <1kk8js0.b6kp732ejixeM%manu@netbsd.org> Message-ID: <3044c1d3-8b15-4a9d-9d18-7343cf8a33f4@zmail01.collab.prod.int.phx2.redhat.com> Hi Emmanuel, A bug has already been filed for this (822385) and patch has been sent for the review (http://review.gluster.com/#change,3353). Regards, Raghavendra Bhat ----- Original Message ----- From: "Emmanuel Dreyfus" To: gluster-devel at nongnu.org Sent: Thursday, May 17, 2012 2:32:32 PM Subject: [Gluster-devel] Crashes with latest git code Hi I get a lot of crashes on NetBSD with latest git code. looking at core dumps, it is obvious I get memory corruption, as I find various structure overwritten by texts (file path or content). Linking with electric fence produces a much earlier crash, always at the same place. Here is how it looks: Program terminated with signal 11, Segmentation fault. #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 (gdb) bt #0 0xbb8aab70 in strlen () from /usr/lib/libc.so.12 #1 0xbaa5ec1e in gf_strdup (src=0x0) at ../../../../libglusterfs/src/mem-pool.h:119 #2 0xbaa76dbf in client3_1_getxattr (frame=0xbb77f5c0, this=0xba3cd000, data=0xbfbfe18c) at client3_1-fops.c:4641 #3 0xbaa59ab8 in client_getxattr (frame=0xbb77f5c0, this=0xba3cd000, loc=0xb9402dd0, name=0x0, xdata=0x0) at client.c:1452 #4 0xb9ac3c7d in afr_sh_metadata_sync_prepare (frame=0xba8026bc, this=0xba3ce000) at afr-self-heal-metadata.c:419 #5 0xb9ac428b in afr_sh_metadata_fix (frame=0xba8026bc, this=0xba3ce000, op_ret=0, op_errno=0) at afr-self-heal-metadata.c:522 #6 0xb9abeb2b in afr_sh_common_lookup_cbk (frame=0xba8026bc, cookie=0x1, this=0xba3ce000, op_ret=0, op_errno=0, inode=0xb8b001a0, buf=0xbfbfe424, xattr=0xba401394, postparent=0xbfbfe3bc) at afr-self-heal-common.c:1311 #7 0xbaa6dc10 in client3_1_lookup_cbk (req=0xb92010d8, iov=0xb92010f8, count=1, myframe=0xbb77f550) at client3_1-fops.c:2636 Frame 4 is this: STACK_WIND (frame, afr_sh_metadata_getxattr_cbk, priv->children[source], priv->children[source]->fops->getxattr, &local->loc, NULL, NULL); Then in frame 3, I get args.name = NULL client_getxattr (call_frame_t *frame, xlator_t *this, loc_t *loc, const char *name, dict_t *xdata) (...) args.name = name; (...) ret = proc->fn (frame, this, &args); In frame 2, args->name = NULL client3_1_getxattr (call_frame_t *frame, xlator_t *this, void *data) (...) args = data; (...) local->name = gf_strdup (args->name); And there we will crash in gf_strdup(). The root cause is afr_sh_metadata_sync_prepare() calling client_getxattr with NULL arguments. The fix is beyond my knowledge of glusterfs internals, but I am sure that some folks here will be able to comment. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From manu at netbsd.org Thu May 17 09:18:29 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 17 May 2012 11:18:29 +0200 Subject: [Gluster-devel] Crashes with latest git code In-Reply-To: <3044c1d3-8b15-4a9d-9d18-7343cf8a33f4@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1kk8kq6.1m2qnzf1so7qgfM%manu@netbsd.org> Raghavendra Bhat wrote: > A bug has already been filed for this (822385) and patch has been sent for > the review (http://review.gluster.com/#change,3353). I looked at the patch, it does not fix the problem I reported: args->name is still NULL. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From rabhat at redhat.com Thu May 17 09:46:20 2012 From: rabhat at redhat.com (Raghavendra Bhat) Date: Thu, 17 May 2012 05:46:20 -0400 (EDT) Subject: [Gluster-devel] Crashes with latest git code In-Reply-To: <1kk8kq6.1m2qnzf1so7qgfM%manu@netbsd.org> Message-ID: In getxattr name is NULL means its equivalent listxattr. So args->name being NULL is ok. Process was crashing because it tried to do strdup (actually strlen in the gf_strdup) of the NULL pointer to a string. On wire we will send it as a null string with namelen set to 0 and protocol/server will understand it. On client side: req.name = (char *)args->name; if (!req.name) { req.name = ""; req.namelen = 0; } On server side: if (args.namelen) state->name = gf_strdup (args.name); ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Raghavendra Bhat" Cc: gluster-devel at nongnu.org Sent: Thursday, May 17, 2012 2:48:29 PM Subject: Re: [Gluster-devel] Crashes with latest git code Raghavendra Bhat wrote: > A bug has already been filed for this (822385) and patch has been sent for > the review (http://review.gluster.com/#change,3353). I looked at the patch, it does not fix the problem I reported: args->name is still NULL. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From jdarcy at redhat.com Thu May 17 11:47:52 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 17 May 2012 07:47:52 -0400 Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <4FB4AA3F.1090700@redhat.com> References: <9e2c5f73-794f-46c2-8202-38be4c3a2ed7@zmail01.collab.prod.int.phx2.redhat.com> <4FB4AA3F.1090700@redhat.com> Message-ID: <4FB4E568.8050601@redhat.com> On 05/17/2012 03:35 AM, Anush Shetty wrote: > > On 05/17/2012 12:56 PM, John Mark Walker wrote: >> There are close to 600 people now subscribed to gluster-devel - how many >> of them actually have an account on Gerritt? I honestly have no idea. >> Another thing this would do is send a subtle message to subscribers that >> this is not the place to discuss user issues, but perhaps there are better >> ways to do that. >> >> I've seen many projects do this - as well as send all bugzilla and github >> notifications, but I could also see some people getting annoyed. > > How about a weekly digest of the same. Excellent idea. From johnmark at redhat.com Thu May 17 16:15:59 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 12:15:59 -0400 (EDT) Subject: [Gluster-devel] Fwd: Change in glusterfs[release-3.3]: client/protocol : Changes in client3_1_getxattr() In-Reply-To: <4FB4E568.8050601@redhat.com> Message-ID: ----- Original Message ----- > On 05/17/2012 03:35 AM, Anush Shetty wrote: > > > > How about a weekly digest of the same. Sounds reasonable. Now we just have to figure out how to implement :) -JM From vijay at build.gluster.com Thu May 17 16:51:43 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Thu, 17 May 2012 09:51:43 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0beta4 released Message-ID: <20120517165144.1BB041803EB@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0beta4/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0beta4.tar.gz This release is made off From johnmark at redhat.com Thu May 17 18:08:01 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 14:08:01 -0400 (EDT) Subject: [Gluster-devel] glusterfs-3.3.0beta4 released In-Reply-To: <20120517165144.1BB041803EB@build.gluster.com> Message-ID: <864fe250-bfd3-49ca-9310-2fc601411b83@zmail01.collab.prod.int.phx2.redhat.com> Reminder: GlusterFS 3.3 has been branched on GitHub, so you can pull the latest code from this branch if you want to test new fixes after the beta was released: https://github.com/gluster/glusterfs/tree/release-3.3 Also, note that this release features a license change in some files. We noted that some developers could not contribute code to the project because of compatibility issues around GPLv3. So, as a compromise, we changed the licensing in files that we deemed client-specific to allow for more contributors and a stronger developer community. Those files are now dual-licensed under the LGPLv3 and the GPLv2. For text of both of these license, see these URLs: http://www.gnu.org/licenses/lgpl.html http://www.gnu.org/licenses/old-licenses/gpl-2.0.html To see the list of files we modified with the new licensing, see this patchset from Kaleb: http://review.gluster.com/#change,3304 If you have questions or comments about this change, please do reach out to me. Thanks, John Mark ----- Original Message ----- > > http://bits.gluster.com/pub/gluster/glusterfs/3.3.0beta4/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0beta4.tar.gz > > This release is made off > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From johnmark at redhat.com Thu May 17 20:34:56 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 17 May 2012 16:34:56 -0400 (EDT) Subject: [Gluster-devel] GlusterFest! For GlusterFS 3.3 Beta 4 In-Reply-To: <5456de9c-6c8b-4995-ad1e-720c9c52c74f@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <88ecc073-688f-4edc-8ff3-ccba3b6142a3@zmail01.collab.prod.int.phx2.redhat.com> An update: Kaleb was kind enough to port his HekaFS testing page for Fedora to GlusterFS. If you're looking for a series of things to test, see this URL: http://www.gluster.org/community/documentation/index.php/3.3.0_Beta_4_Tests By tonight, I'll have a handy form for reporting your results. We are at T-6:30 hours and counting until GlusterFest begins in earnest. For all updates related to GlusterFest, see this page: http://www.gluster.org/community/documentation/index.php/GlusterFest Please do post any series of tests that you would like to run. In particular, we're looking to test some of the new features of GlusterFS 3.3: - Object storage - HDFS compatibility library - Granular locking - More proactive self-heal Happy hacking, JM ----- Original Message ----- > Greetings, > > We are planning to have one more beta release tomorrow. If all goes > as planned, this will be the release candidate. In conjunction with > the beta, I thought we should have a 24-hour GlusterFest, starting > tomorrow at 8pm - > http://www.gluster.org/community/documentation/index.php/GlusterFest > > 'What's a GlusterFest?' you may be asking. Well, it's all of the > below: > > > - Testing the software. Install the new beta (when it's released > tomorrow) and put it through its paces. We will put some basic > testing procedures on the GlusterFest page here - > http://www.gluster.org/community/documentation/index.php/GlusterFest > > - Feel free to create your own testing procedures and link to it > from the GlusterFest page > > > - Finding bugs. See the current list of bugs targeted for this > release: http://bit.ly/beta4bugs > > > - Fixing bugs. If you're the kind of person who wants to submit > patches, see our development workflow doc: > http://www.gluster.org/community/documentation/index.php/Development_Work_Flow > > - and then get to know Gerritt: http://review.gluster.com/ > > > The GlusterFest page will be updated with some basic testing > procedures tomorrow, and GlusterFest will officially begin at 8pm > PDT May 17/03:00 UTC May 18 (coinciding with the end of our meetup > tomorrow), and ending at 8pm PDT May 18/03:00 UTC May 19. > > > If you need assistance, see #gluster on Freenode for "real-time" > questions, gluster-users and community.gluster.org for general usage > questions, and gluster-devel for anything related to building, > patching, and bug-fixing. > > > To keep up with GlusterFest activity, I'll be sending updates from > the @glusterorg account on Twitter, and I'm sure there will be > traffic on the mailing lists, as well. > > > Happy testing and bug-hunting! > > -JM > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From manu at netbsd.org Fri May 18 07:49:29 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 18 May 2012 07:49:29 +0000 Subject: [Gluster-devel] python version In-Reply-To: References: <4FB13314.3060708@redhat.com> <1kk3qy7.41zpkmegdsm4M%manu@netbsd.org> Message-ID: <20120518074929.GJ3985@homeworld.netbsd.org> On Mon, May 14, 2012 at 04:58:18PM -0700, Anand Avati wrote: > Emmanuel, since the problem is not going to be a long lasting one (either > of the two should fix your problem), I suggest you find a solution local to > you in the interim. I submitted a tiny hack that solves the problem for everyone until automake is upgraded on glusterfs build system: http://review.gluster.com/3360 -- Emmanuel Dreyfus manu at netbsd.org From johnmark at redhat.com Fri May 18 15:02:50 2012 From: johnmark at redhat.com (John Mark Walker) Date: Fri, 18 May 2012 11:02:50 -0400 (EDT) Subject: [Gluster-devel] GlusterFest! For GlusterFS 3.3 Beta 4 In-Reply-To: <88ecc073-688f-4edc-8ff3-ccba3b6142a3@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: Looks like we have a few testers who have reported their results already: http://www.gluster.org/community/documentation/index.php/GlusterFest 12 more hours! -JM ----- Original Message ----- > An update: > > Kaleb was kind enough to port his HekaFS testing page for Fedora to > GlusterFS. If you're looking for a series of things to test, see > this URL: > http://www.gluster.org/community/documentation/index.php/3.3.0_Beta_4_Tests > > > By tonight, I'll have a handy form for reporting your results. We are > at T-6:30 hours and counting until GlusterFest begins in earnest. > For all updates related to GlusterFest, see this page: > http://www.gluster.org/community/documentation/index.php/GlusterFest > > > Please do post any series of tests that you would like to run. In > particular, we're looking to test some of the new features of > GlusterFS 3.3: > > - Object storage > - HDFS compatibility library > - Granular locking > - More proactive self-heal > > > Happy hacking, > JM > > > ----- Original Message ----- > > Greetings, > > > > We are planning to have one more beta release tomorrow. If all goes > > as planned, this will be the release candidate. In conjunction with > > the beta, I thought we should have a 24-hour GlusterFest, starting > > tomorrow at 8pm - > > http://www.gluster.org/community/documentation/index.php/GlusterFest > > > > 'What's a GlusterFest?' you may be asking. Well, it's all of the > > below: > > > > > > - Testing the software. Install the new beta (when it's released > > tomorrow) and put it through its paces. We will put some basic > > testing procedures on the GlusterFest page here - > > http://www.gluster.org/community/documentation/index.php/GlusterFest > > > > - Feel free to create your own testing procedures and link to it > > from the GlusterFest page > > > > > > - Finding bugs. See the current list of bugs targeted for this > > release: http://bit.ly/beta4bugs > > > > > > - Fixing bugs. If you're the kind of person who wants to submit > > patches, see our development workflow doc: > > http://www.gluster.org/community/documentation/index.php/Development_Work_Flow > > > > - and then get to know Gerritt: http://review.gluster.com/ > > > > > > The GlusterFest page will be updated with some basic testing > > procedures tomorrow, and GlusterFest will officially begin at 8pm > > PDT May 17/03:00 UTC May 18 (coinciding with the end of our meetup > > tomorrow), and ending at 8pm PDT May 18/03:00 UTC May 19. > > > > > > If you need assistance, see #gluster on Freenode for "real-time" > > questions, gluster-users and community.gluster.org for general > > usage > > questions, and gluster-devel for anything related to building, > > patching, and bug-fixing. > > > > > > To keep up with GlusterFest activity, I'll be sending updates from > > the @glusterorg account on Twitter, and I'm sure there will be > > traffic on the mailing lists, as well. > > > > > > Happy testing and bug-hunting! > > > > -JM > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at nongnu.org > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > From manu at netbsd.org Fri May 18 16:15:20 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 18 May 2012 16:15:20 +0000 Subject: [Gluster-devel] memory corruption in release-3.3 Message-ID: <20120518161520.GL3985@homeworld.netbsd.org> Hi I still get crashes caused by memory corruption with latest release-3.3. My test case is a rm -Rf on a large tree. It seems I crash in two places: First crash flavor (trav is sometimes unmapped memory, sometimes NULL) #0 0xbbbb60ad in __iobuf_select_arena (iobuf_pool=0xbb70d400, page_size=128) at iobuf.c:453 453 if (trav->passive_cnt) { (gdb) print trav $1 = (struct iobuf_arena *) 0x414d202c (gdb) bt #0 0xbbbb60ad in __iobuf_select_arena (iobuf_pool=0xbb70d400, page_size=128) at iobuf.c:453 #1 0xbbbb655a in iobuf_get2 (iobuf_pool=0xbb70d400, page_size=24) at iobuf.c:604 #2 0xbaa549c7 in client_submit_request () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #3 0xbaa732c5 in client3_1_open () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #4 0xbaa574e6 in client_open () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #5 0xb9abac10 in afr_sh_data_open () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #6 0xb9abacb9 in afr_self_heal_data () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #7 0xb9ac2751 in afr_sh_metadata_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #8 0xb9ac457a in afr_self_heal_metadata () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #9 0xb9abd93f in afr_sh_missing_entries_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #10 0xb9ac169b in afr_self_heal () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #11 0xb9ae2e5b in afr_launch_self_heal () #12 0xb9ae3de9 in afr_lookup_perform_self_heal () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #13 0xb9ae4804 in afr_lookup_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #14 0xb9ae4fab in afr_lookup_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #15 0xbaa6dc10 in client3_1_lookup_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #16 0xbbb69716 in rpc_clnt_handle_reply () from /usr/local/lib/libgfrpc.so.0 #17 0xbbb699b3 in rpc_clnt_notify () from /usr/local/lib/libgfrpc.so.0 #18 0xbbb65989 in rpc_transport_notify () from /usr/local/lib/libgfrpc.so.0 #19 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #20 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #21 0xbbbb270f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=3) at event.c:357 #22 0xbbbb297b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #23 0xbbbb2ca7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #24 0x08050078 in main () Second crash flavor (it looks more like a double free) Program terminated with signal 11, Segmentation fault. #0 0xbb92661e in ?? () from /lib/libc.so.12 (gdb) bt #0 0xbb92661e in ?? () from /lib/libc.so.12 #1 0xbb92891b in free () from /lib/libc.so.12 #2 0xbbbb376f in __gf_free (free_ptr=0xbb70d160) at mem-pool.c:258 #3 0xbbb7e17d in data_destroy (data=0xba301d4c) at dict.c:135 #4 0xbbb7ee18 in data_unref (this=0xba301d4c) at dict.c:470 #5 0xbbb7eb6b in dict_destroy (this=0xba4022d0) at dict.c:395 #6 0xbbb7ecab in dict_unref (this=0xba4022d0) at dict.c:432 #7 0xbaa164ba in __qr_inode_free () from /usr/local/lib/glusterfs/3.3git/xlator/performance/quick-read.so #8 0xbaa27164 in qr_forget () from /usr/local/lib/glusterfs/3.3git/xlator/performance/quick-read.so #9 0xbbb9b221 in __inode_destroy (inode=0xb8b017e4) at inode.c:320 #10 0xbbb9d0a5 in inode_table_prune (table=0xba3cc160) at inode.c:1235 #11 0xbbb9b64e in inode_unref (inode=0xb8b017e4) at inode.c:445 #12 0xbbb85249 in loc_wipe (loc=0xb9402dd0) at xlator.c:530 #13 0xb9ae126e in afr_local_cleanup () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #14 0xb9a9c66b in afr_unlink_done () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #15 0xb9ad2d5b in afr_unlock_common_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so #16 0xb9ad38a2 in afr_unlock_entrylk_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/cluster/replicate.so ---Type to continue, or q to quit--- #17 0xbaa68370 in client3_1_entrylk_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/protocol/client.so #18 0xbbb69716 in rpc_clnt_handle_reply () from /usr/local/lib/libgfrpc.so.0 #19 0xbbb699b3 in rpc_clnt_notify () from /usr/local/lib/libgfrpc.so.0 #20 0xbbb65989 in rpc_transport_notify () from /usr/local/lib/libgfrpc.so.0 #21 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #22 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #23 0xbbbb270f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=3) at event.c:357 #24 0xbbbb297b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #25 0xbbbb2ca7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #26 0x08050078 in main () (gdb) frame 2 #2 0xbbbb376f in __gf_free (free_ptr=0xbb70d160) at mem-pool.c:258 258 FREE (free_ptr); (gdb) x/1w free_ptr 0xbb70d160: 538978863 -- Emmanuel Dreyfus manu at netbsd.org From amarts at redhat.com Sat May 19 06:15:09 2012 From: amarts at redhat.com (Amar Tumballi) Date: Sat, 19 May 2012 11:45:09 +0530 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <20120518161520.GL3985@homeworld.netbsd.org> References: <20120518161520.GL3985@homeworld.netbsd.org> Message-ID: <4FB73A6D.9050601@redhat.com> On 05/18/2012 09:45 PM, Emmanuel Dreyfus wrote: > Hi > > I still get crashes caused by memory corruption with latest release-3.3. > My test case is a rm -Rf on a large tree. It seems I crash in two places: > Emmanuel, Can you please file bug report? different bugs corresponding to different crash dumps will help us. That helps in tracking development internally. Regards, Amar From manu at netbsd.org Sat May 19 10:29:55 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 12:29:55 +0200 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <20120518161520.GL3985@homeworld.netbsd.org> Message-ID: <1kkccac.1lkm3tq166jzftM%manu@netbsd.org> Emmanuel Dreyfus wrote: > Second crash flavor (it looks more like a double free) Here it is again at a different place. This is in loc_wipe, where loc->path is free'ed. Looking at the code, I see that there are places where loc->path is allocated by gf_strdup(). I see other places where it is copied from another buffer. Since this is done without reference counts, it seems likely that there is a double free somewhere. Opinions? (gdb) bt #0 0xbb92652a in ?? () from /lib/libc.so.12 #1 0xbb92891b in free () from /lib/libc.so.12 #2 0xbbbb376f in __gf_free (free_ptr=0xb8250040) at mem-pool.c:258 #3 0xbbb85269 in loc_wipe (loc=0xba4cd010) at xlator.c:534 #4 0xbaa5e68a in client_local_wipe (local=0xba4cd010) at client-helpers.c:125 #5 0xbaa614d5 in client3_1_open_cbk (req=0xb92010d8, iov=0xb92010f8, count=1, myframe=0xbb77fa20) at client3_1-fops.c:421 #6 0xbbb69716 in rpc_clnt_handle_reply (clnt=0xba3c51c0, pollin=0xbb77d220) at rpc-clnt.c:788 #7 0xbbb699b3 in rpc_clnt_notify (trans=0xbb70ec00, mydata=0xba3c51e0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-clnt.c:907 #8 0xbbb65989 in rpc_transport_notify (this=0xbb70ec00, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-transport.c:489 #9 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #10 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #11 0xbbbb270f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=3) at event.c:357 #12 0xbbbb297b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #13 0xbbbb2ca7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #14 0x08050078 in main () -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vijay at build.gluster.com Sat May 19 12:35:21 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Sat, 19 May 2012 05:35:21 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa42 released Message-ID: <20120519123524.842501803FC@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa42/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa42.tar.gz This release is made off v3.3.0qa42 From manu at netbsd.org Sat May 19 13:50:25 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 15:50:25 +0200 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <1kkccac.1lkm3tq166jzftM%manu@netbsd.org> Message-ID: <1kkcml0.c7hab41bl4auaM%manu@netbsd.org> Emmanuel Dreyfus wrote: > Looking at the code, I see that there are places where loc->path is > allocated by gf_strdup(). I see other places where it is copied from > another buffer. Since this is done without reference counts, it seems > likely that there is a double free somewhere. Opinions? I added a second argument to gf_strdup() so that the calling function can pass __func__, and I started logging gf_strdup() allocations to track a possible double free. ANd the result is... the offending free() is done on a loc->path that was not allocated by gf_strdup(). Can it be allocated by another function? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 19 15:07:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 17:07:53 +0200 Subject: [Gluster-devel] memory corruption in release-3.3 In-Reply-To: <1kkccac.1lkm3tq166jzftM%manu@netbsd.org> Message-ID: <1kkcpny.16h3fbd1pfhutzM%manu@netbsd.org> Emmanuel Dreyfus wrote: > Looking at the code, I see that there are places where loc->path is > allocated by gf_strdup(). I see other places where it is copied from > another buffer. Since this is done without reference counts, it seems > likely that there is a double free somewhere. Opinions? I found a bug: Thou shalt not free(3) memory dirname(3) returned On Linux basename() and dirname() return a pointer with the string passed as argument. On BSD flavors, basename() and dirname() return static storage, or pthread specific storage. Both behaviour are compliant, but calling free on the result in the second case is a bug. --- xlators/cluster/afr/src/afr-dir-write.c.orig 2012-05-19 16:45:30.000000000 +0200 +++ xlators/cluster/afr/src/afr-dir-write.c 2012-05-19 17:03:17.000000000 +0200 @@ -55,14 +55,22 @@ if (op_errno) *op_errno = ENOMEM; goto out; } - parent->path = dirname (child_path); + parent->path = gf_strdup( dirname (child_path) ); + if (!parent->path) { + if (op_errno) + *op_errno = ENOMEM; + goto out; + } parent->inode = inode_ref (child->parent); uuid_copy (parent->gfid, child->pargfid); ret = 0; out: + if (child_path) + GF_FREE(child_path); + return ret; } /* {{{ create */-- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 19 17:34:51 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 19 May 2012 19:34:51 +0200 Subject: [Gluster-devel] mkdir race condition Message-ID: <1kkcuhi.1vqgbxy1lxb8w2M%manu@netbsd.org> On a replicated volume, mkdir quickly followed by the rename of a new directory child fails. # rm -Rf test && mkdir test && touch test/a && mv test/a test/b mv: rename test/a to test/b: No such file or directory # rm -Rf test && mkdir test && sleep 1 && touch test/a && mv test/a test/b (it works) Client log: [2012-05-19 18:49:43.933090] W [client3_1-fops.c:327:client3_1_mkdir_cbk] 0-pfs-client-0: remote operation failed: No such file or directory. Path: /test (00000000-0000-0000-0000-000000000000) [2012-05-19 18:49:43.944883] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-pfs-client-0: remote operation failed: No such file or directory [2012-05-19 18:49:43.946265] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-pfs-client-0: remote operation failed: No such file or directory [2012-05-19 18:49:43.961028] W [client3_1-fops.c:1595:client3_1_entrylk_cbk] 0-pfs-client-0: remote operation failed: No such file or directory [2012-05-19 18:49:43.961528] W [fuse-bridge.c:1515:fuse_rename_cbk] 0-glusterfs-fuse: 27: /test/a -> /test/b => -1 (No such file or directory) Server log: [2012-05-19 18:49:58.455280] E [posix-handle.c:412:posix_handle_mkdir_hashes] 0-pfs-posix: error mkdir hash-1 /export/wd3a/.glusterfs/f6/8b (No such file or directory) [2012-05-19 18:49:58.455384] W [posix-handle.c:521:posix_handle_soft] 0-pfs-posix: mkdir /export/wd3a/.glusterfs/f6/8b/f68b2a33-a649-4705-9dfd-40a15f22589a failed (No such file or directory) [2012-05-19 18:49:58.455425] E [posix.c:968:posix_mkdir] 0-pfs-posix: setting gfid on /export/wd3a/test failed [2012-05-19 18:49:58.455558] E [posix.c:1010:posix_mkdir] 0-pfs-posix: post-operation lstat on parent of /export/wd3a/test failed: No such file or directory [2012-05-19 18:49:58.455664] I [server3_1-fops.c:529:server_mkdir_cbk] 0-pfs-server: 41: MKDIR /test (00000000-0000-0000-0000-000000000000) ==> -1 (No such file or directory) [2012-05-19 18:49:58.467548] I [server3_1-fops.c:346:server_entrylk_cbk] 0-pfs-server: 46: ENTRYLK (null) (--) ==> -1 (No such file or directory) [2012-05-19 18:49:58.468990] I [server3_1-fops.c:346:server_entrylk_cbk] 0-pfs-server: 47: ENTRYLK (null) (--) ==> -1 (No such file or directory) [2012-05-19 18:49:58.483726] I [server3_1-fops.c:346:server_entrylk_cbk] 0-pfs-server: 51: ENTRYLK (null) (--) ==> -1 (No such file or directory) It says it fails, but it seems it succeeded: silo# getextattr -x trusted.gfid /export/wd3a/test /export/wd3a/test 000 f6 8b 2a 33 a6 49 47 05 9d fd 40 a1 5f 22 58 9a ..*3.IG... at ._"X. Client is release-3.3 from yesterday. Server is master branch from may 14th. Is it a known problem? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 05:36:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 07:36:02 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / Message-ID: <1kkdu4r.aq5gouehux9cM%manu@netbsd.org> On release-3.3, glusterd_is_path_in_use() in xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has extended attribute enabled, and makes it impossible to create a volume with bricks from other filesystems (with EA enabled), if / does not support extended attributes. Is it on purpose? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From kparthas at redhat.com Sun May 20 05:53:35 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 01:53:35 -0400 (EDT) Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kkdu4r.aq5gouehux9cM%manu@netbsd.org> Message-ID: <1d6e3018-e614-4273-883c-1cca9efaf0b8@zmail01.collab.prod.int.phx2.redhat.com> Emmanuel, The assumption of EA being enabled in / filesystem or any prefix of brick path is an accidental side-effect of the way glusterd_is_path_in_use() is used in glusterd_brick_create_path(). The error handling should be accommodative to ENOTSUP. In short it is a bug. Will send out a patch immediately. thanks, krish ----- Original Message ----- From: "Emmanuel Dreyfus" To: gluster-devel at nongnu.org Sent: Sunday, May 20, 2012 11:06:02 AM Subject: [Gluster-devel] 3.3 requires extended attribute on / On release-3.3, glusterd_is_path_in_use() in xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has extended attribute enabled, and makes it impossible to create a volume with bricks from other filesystems (with EA enabled), if / does not support extended attributes. Is it on purpose? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From manu at netbsd.org Sun May 20 05:56:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 07:56:53 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kkdu4r.aq5gouehux9cM%manu@netbsd.org> Message-ID: <1kkdul5.4vmrbe1owph67M%manu@netbsd.org> Emmanuel Dreyfus wrote: > On release-3.3, glusterd_is_path_in_use() in > xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has > extended attribute enabled, and makes it impossible to create a volume > with bricks from other filesystems (with EA enabled), if / does not > support extended attributes. And even with EA enabled on root, creating a volume loops forever on reading unexistant trusted.gfid and trusted.glusterfs.volume-id on brick's parent directory. It gets ENODATA and retry forever. If I patch the function to just set in_use = 0 and return 0, I can create a volume. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From ian.latter at midnightcode.org Sun May 20 06:12:39 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 16:12:39 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> Hello, Couple of questions that might help make my module a little more sane; 0) Is there any developer docco? I've just done another quick search and I can't see any. Let me know if there is and I'll try and answer the below myself. 1) What is the difference between STACK_WIND and STACK_WIND_COOKIE? I.e. I've only ever used STACK_WIND, when should I use it versus the other? 2) Is there a way to write linearly within a single function within Gluster (or is there a reason why I wouldn't want to do that)? RE 2: This may stem from my lack of understanding of the broader Gluster internals. I am performing multiple fops per fop, which is creating structural inelegances in the code that make me think I'm heading down the wrong rabbit hole. I want to say; read() { // pull in other content while(want more) { _lookup() _open() _read() _close() } return iovec } But the way I've understood the Gluster internal structure is that I need to operate in a chain of related functions; _read_lookup_cbk_open_cbk_read_cbk() { wind _close() } _read_lookup_cbk_open_cbk() { wind _read() add to local->iovec } _lookup_cbk() { wind _open() } read() { while(want more) { wind _lookup() } return local->iovec } Am I missing something - or is there a nicer way of doing this? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From manu at netbsd.org Sun May 20 06:13:32 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 08:13:32 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1d6e3018-e614-4273-883c-1cca9efaf0b8@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1kkdvl3.1p663u6iyul1oM%manu@netbsd.org> Krishnan Parthasarathi wrote: > Will send out a patch immediately. Great :-) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 06:13:33 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 08:13:33 +0200 Subject: [Gluster-devel] mkdir race condition In-Reply-To: <1kkcuhi.1vqgbxy1lxb8w2M%manu@netbsd.org> Message-ID: <1kkdvma.10s8o2rtrmcvpM%manu@netbsd.org> Emmanuel Dreyfus wrote: > On a replicated volume, mkdir quickly followed by the rename of a new > directory child fails. > > # rm -Rf test && mkdir test && touch test/a && mv test/a test/b > mv: rename test/a to test/b: No such file or directory > # rm -Rf test && mkdir test && sleep 1 && touch test/a && mv test/a test/b > (it works) I just reinstalled server from release-3.3 and now things make more sense. Any directory creation will report failure but will succeed: bacasel# mkdir /gfs/manu mkdir: /gfs/manu: No such file or directory bacasel# cd /gfs bacasel# ls manu Server log reports it fails because: [2012-05-20 07:59:23.775789] E [posix-handle.c:412:posix_handle_mkdir_hashes] 0-pfs-posix: error mkdir hash-1 /export/wd3a/.glusterfs/ec/e2 (No such file or directory) It seems posix_handle_mkdir_hashes() attempts to mkdir two directories at once: ec/ec2. How is it supposed to work? Should parent directory be created somewhere else? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From kparthas at redhat.com Sun May 20 06:36:44 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 02:36:44 -0400 (EDT) Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kkdul5.4vmrbe1owph67M%manu@netbsd.org> Message-ID: <80c2c170-133e-4509-9ac5-062293a199ad@zmail01.collab.prod.int.phx2.redhat.com> ----- Original Message ----- From: "Emmanuel Dreyfus" To: gluster-devel at nongnu.org Sent: Sunday, May 20, 2012 11:26:53 AM Subject: Re: [Gluster-devel] 3.3 requires extended attribute on / Emmanuel Dreyfus wrote: > On release-3.3, glusterd_is_path_in_use() in > xlators/mgmt/glusterd/src/glusterd-utils.c seems to assume that / has > extended attribute enabled, and makes it impossible to create a volume > with bricks from other filesystems (with EA enabled), if / does not > support extended attributes. > And even with EA enabled on root, creating a volume loops forever on > reading unexistant trusted.gfid and trusted.glusterfs.volume-id on > brick's parent directory. It gets ENODATA and retry forever. If I patch > the function to just set in_use = 0 and return 0, I can create a volume. It is strange that the you see glusterd_path_in_use() loop forever. If I am not wrong, the inner loop checks for presence of trusted.gfid and trusted.glusterfs.volume-id and should exit after that, and the outer loop performs dirname on the path repeatedly and dirname(3) guarantees such an operation should return "/" eventually, which we check. It would be great if you could provide values of local variables, "used" and "curdir" when you see the looping forever. I dont have a setup to check this immediately. thanks, krish -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Sun May 20 06:47:57 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 16:47:57 +1000 Subject: [Gluster-devel] ZkFarmer Message-ID: <201205200647.q4K6lvdN009529@singularity.tronunltd.com> > > And I am sick of the word-wrap on this client .. I think > > you've finally convinced me to fix it ... what's normal > > these days - still 80 chars? > > I used to line-wrap (gnus and cool emacs extensions). It doesn't make > sense to line wrap any more. Let the email client handle it depending > on the screen size of the device (mobile / tablet / desktop). FYI found this; an hour of code parsing in the mail software and it turns out that it had no wrapping .. it came from the stupid textarea tag in the browser (wrap="hard"). Same principle (server side coded, non client savvy) - now set to "soft". So hopefully fixed :) Cheers. -- Ian Latter Late night coder .. http://midnightcode.org/ From kparthas at redhat.com Sun May 20 06:54:54 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 02:54:54 -0400 (EDT) Subject: [Gluster-devel] Gluster internals In-Reply-To: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> Message-ID: Couple of questions that might help make my module a little more sane; 0) Is there any developer docco? I've just done another quick search and I can't see any. Let me know if there is and I'll try and answer the below myself. 1) What is the difference between STACK_WIND and STACK_WIND_COOKIE? I.e. I've only ever used STACK_WIND, when should I use it versus the other? STACK_WIND_COOKIE is used when we need to 'tie' the call wound with its corresponding callback. You can see this variant being used extensively in cluster xlators where it is used to identify the callback with the subvolume no. it is coming from. 2) Is there a way to write linearly within a single function within Gluster (or is there a reason why I wouldn't want to do that)? RE 2: This may stem from my lack of understanding of the broader Gluster internals. I am performing multiple fops per fop, which is creating structural inelegances in the code that make me think I'm heading down the wrong rabbit hole. I want to say; read() { // pull in other content while(want more) { _lookup() _open() _read() _close() } return iovec } But the way I've understood the Gluster internal structure is that I need to operate in a chain of related functions; _read_lookup_cbk_open_cbk_read_cbk() { wind _close() } _read_lookup_cbk_open_cbk() { wind _read() add to local->iovec } _lookup_cbk() { wind _open() } read() { while(want more) { wind _lookup() } return local->iovec } Am I missing something - or is there a nicer way of doing this? The above method you are trying to use is the "continuation passing style" that is extensively used in afr-inode-read.c and afr-transaction.c to perform multiple internal fops on the trigger of a single fop from the application. cluster/afr may give you some ideas on how you could structure it if you like that more. The other method I can think of (not sure if it would suit your needs) is to use the syncop framework (see libglusterfs/src/syncop.c). This allows one to make a 'synchronous' glusterfs fop. inside a xlator. The downside is that you can only make one call at a time. This may not be acceptable for cluster xlators (ie, xlator with more than one child xlator). Hope that helps, krish _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From ian.latter at midnightcode.org Sun May 20 07:23:12 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 17:23:12 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205200723.q4K7NCO3009706@singularity.tronunltd.com> > 1) What is the difference between STACK_WIND > and STACK_WIND_COOKIE? I.e. I've only > ever used STACK_WIND, when should I use > it versus the other? > > STACK_WIND_COOKIE is used when we need to 'tie' the call > wound with its corresponding callback. You can see this > variant being used extensively in cluster xlators where it > is used to identify the callback with the subvolume no. it > is coming from. Ok - thanks. I will take a closer look at the examples for this .. this may help me ... > 2) Is there a way to write linearly within a single > function within Gluster (or is there a reason > why I wouldn't want to do that)? > > > RE 2: > > This may stem from my lack of understanding > of the broader Gluster internals. I am performing > multiple fops per fop, which is creating structural > inelegances in the code that make me think I'm > heading down the wrong rabbit hole. I want to > say; > > read() { > // pull in other content > while(want more) { > _lookup() > _open() > _read() > _close() > } > return iovec > } > > > But the way I've understood the Gluster internal > structure is that I need to operate in a chain of > related functions; > > _read_lookup_cbk_open_cbk_read_cbk() { > wind _close() > } > > _read_lookup_cbk_open_cbk() { > wind _read() > add to local->iovec > } > > _lookup_cbk() { > wind _open() > } > > read() { > while(want more) { > wind _lookup() > } > return local->iovec > } > > > > Am I missing something - or is there a nicer way of > doing this? > > The above method you are trying to use is the "continuation passing style" that > is extensively used in afr-inode-read.c and afr-transaction.c to perform multiple > internal fops on the trigger of a single fop from the application. cluster/afr may > give you some ideas on how you could structure it if you like that more. These may have been where I got that code style from originally .. I will go back to these two programs, thanks for the reference. I'm currently working my way through the afr-heal programs .. > The other method I can think of (not sure if it would suit your needs) > is to use the syncop framework (see libglusterfs/src/syncop.c). > This allows one to make a 'synchronous' glusterfs fop. inside a xlator. > The downside is that you can only make one call at a time. This may not > be acceptable for cluster xlators (ie, xlator with more than one child xlator). In the syncop framework, how much gets affected when I use it in my xlator. Does it mean that there's only one call at a time in the whole xlator (so the current write will stop all other reads) or is the scope only the fop (so that within this write, my child->fops are serial, but neighbouring reads on my xlator will continue in other threads)? And does that then restrict what can go above and below my xlator? I mean that my xlator isn't a cluster xlator but I would like it to be able to be used on top of (or underneath) a cluster xlator, will that no longer be possible? > Hope that helps, > krish Thanks Krish, every bit helps! -- Ian Latter Late night coder .. http://midnightcode.org/ From ian.latter at midnightcode.org Sun May 20 07:40:54 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Sun, 20 May 2012 17:40:54 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205200740.q4K7esfl009777@singularity.tronunltd.com> > > The other method I can think of (not sure if it would suit your needs) > > is to use the syncop framework (see libglusterfs/src/syncop.c). > > This allows one to make a 'synchronous' glusterfs fop. inside a xlator. > > The downside is that you can only make one call at a time. This may not > > be acceptable for cluster xlators (ie, xlator with more than one child xlator). > > In the syncop framework, how much gets affected when I > use it in my xlator. Does it mean that there's only one call > at a time in the whole xlator (so the current write will stop > all other reads) or is the scope only the fop (so that within > this write, my child->fops are serial, but neighbouring reads > on my xlator will continue in other threads)? And does that > then restrict what can go above and below my xlator? I > mean that my xlator isn't a cluster xlator but I would like it > to be able to be used on top of (or underneath) a cluster > xlator, will that no longer be possible? > I've just taken a look at xlators/cluster/afr/src/pump.c for some syncop usage examples and I really like what I see there. If syncop only serialises/syncs activity that I code within a given fop of my xlator and doesn't impose serial/ sync limits on the parents or children of my xlator then this looks like the right path. I want to be sure that it won't result in a globally syncronous outcome though (like ignoring a cache xlator under mine to get a true disk read) - I just need the internals of my calls to be linear. -- Ian Latter Late night coder .. http://midnightcode.org/ From manu at netbsd.org Sun May 20 08:11:04 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 10:11:04 +0200 Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <80c2c170-133e-4509-9ac5-062293a199ad@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <1kke141.ffp9fr1meqkgbM%manu@netbsd.org> Krishnan Parthasarathi wrote: > It is strange that the you see glusterd_path_in_use() loop forever. If I > am not wrong, the inner loop checks for presence of trusted.gfid and > trusted.glusterfs.volume-id and should exit after that, and the outer loop > performs dirname on the path repeatedly and dirname(3) guarantees such an > operation should return "/" eventually, which we check. Here is my setup when I tried that: / with EA enabled /export/wd3a ibrick with EA enabled But I may have been testing with an untintended patch in glusterd_path_in_use(). I will retry with the right fix once it will be available. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 08:30:53 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 10:30:53 +0200 Subject: [Gluster-devel] mkdir race condition In-Reply-To: <1kkdvma.10s8o2rtrmcvpM%manu@netbsd.org> Message-ID: <1kke28c.rugeav1w049sdM%manu@netbsd.org> Emmanuel Dreyfus wrote: > It seems posix_handle_mkdir_hashes() attempts to mkdir two directories > at once: ec/ec2. How is it supposed to work? Should parent directory be > created somewhere else? This fixes the problem. Any comment? --- xlators/storage/posix/src/posix-handle.c.orig +++ xlators/storage/posix/src/posix-handle.c @@ -405,8 +405,16 @@ parpath = dirname (duppath); parpath = dirname (duppath); ret = mkdir (parpath, 0700); + if (ret == -1 && errno == ENOENT) { + char *tmppath = NULL; + + tmppath = strdupa(parpath); + ret = mkdir (dirname (tmppath), 0700); + if (ret == 0) + ret = mkdir (parpath, 0700); + } if (ret == -1 && errno != EEXIST) { gf_log (this->name, GF_LOG_ERROR, "error mkdir hash-1 %s (%s)", parpath, strerror (errno)); -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sun May 20 08:47:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 10:47:02 +0200 Subject: [Gluster-devel] rename(2) race condition Message-ID: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> After I patched to fix the mkdir issue, I now encounter a race in rename(2). Most of the time it works, but sometimes: 3548 1 tar CALL open(0xbb9010e0,0xa02,0x180) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET open 8 3548 1 tar CALL __fstat50(8,0xbfbfe69c) 3548 1 tar RET __fstat50 0 3548 1 tar CALL write(8,0x8067880,0x16) 3548 1 tar GIO fd 8 wrote 22 bytes "Nnetbsd-5-1-2-RELEASE\n" 3548 1 tar RET write 22/0x16 3548 1 tar CALL close(8) 3548 1 tar RET close 0 3548 1 tar CALL lchmod(0xbb9010e0,0x1a4) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET lchmod 0 3548 1 tar CALL __lutimes50(0xbb9010e0,0xbfbfe6d8) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET __lutimes50 0 3548 1 tar CALL rename(0xbb9010e0,0x8071584) 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" 3548 1 tar RET rename -1 errno 13 Permission denied I can reproduce it with the command below. It runs fine for a few seconds and then hit permission denied. It needs a level of hierarchy to exhibit the hebavior: just install a b will not fail. mkdir test && echo "xxx" > tmp/a while [ 1 ] ; do rm -f test/b && install test/a test/b ; done -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From mihai at patchlog.com Sun May 20 09:19:34 2012 From: mihai at patchlog.com (Mihai Secasiu) Date: Sun, 20 May 2012 12:19:34 +0300 Subject: [Gluster-devel] glusterfs on MacOSX Message-ID: <4FB8B726.10500@patchlog.com> Hello, I am trying to get glusterfs ( 3.2.6, server ) to work on MacOSX ( Lion - I think , darwin kernel 11.3 ). So far I've been able to make it compile with a few patches and --disable-fuse-client. I want to create a volume on a MacMini that will be a replica of another volume stored on a linux server in a different location. The volume stored on the MacMini would also have to be mounted on the macmini. Since the fuse client is broken because it's built to use macfuse and that doesn't work anymore on the latest MacOSX I want to mount the volume over nfs and I've been able to do that ( with a small patch to the xdr code ) but it's really really slow. It's so slow that mounting the volume through a remote node is a lot faster. Also mounting the same volume on a remote node is fast so the problem is definitely in the nfs server on the MacOSX. I did a strace ( dtruss ) on it and it seems like it's doing a lot of polling. Could this be the cause of the slowness ? If anyone wants to try this you can fetch it from https://github.com/mihaisecasiu/glusterfs/tree/release-3.2 Thanks From manu at netbsd.org Sun May 20 12:43:52 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 20 May 2012 14:43:52 +0200 Subject: [Gluster-devel] rename(2) race condition In-Reply-To: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> Message-ID: <1kkee8d.8hdhfs177z5zdM%manu@netbsd.org> Emmanuel Dreyfus wrote: > After I patched to fix the mkdir issue, I now encounter a race in > rename(2). Most of the time it works, but sometimes: And the problem onoy happens when running as an unprivilegied user. It works fine for root. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From kparthas at redhat.com Sun May 20 14:14:10 2012 From: kparthas at redhat.com (Krishnan Parthasarathi) Date: Sun, 20 May 2012 10:14:10 -0400 (EDT) Subject: [Gluster-devel] 3.3 requires extended attribute on / In-Reply-To: <1kke141.ffp9fr1meqkgbM%manu@netbsd.org> Message-ID: Emmanuel, I have submitted the fix for review: http://review.gluster.com/3380 I have not tested the fix with "/" having EA disabled. It would be great if you could confirm the looping forever doesn't happen with this fix. thanks, krish ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Krishnan Parthasarathi" Cc: gluster-devel at nongnu.org Sent: Sunday, May 20, 2012 1:41:04 PM Subject: Re: [Gluster-devel] 3.3 requires extended attribute on / Krishnan Parthasarathi wrote: > It is strange that the you see glusterd_path_in_use() loop forever. If I > am not wrong, the inner loop checks for presence of trusted.gfid and > trusted.glusterfs.volume-id and should exit after that, and the outer loop > performs dirname on the path repeatedly and dirname(3) guarantees such an > operation should return "/" eventually, which we check. Here is my setup when I tried that: / with EA enabled /export/wd3a ibrick with EA enabled But I may have been testing with an untintended patch in glusterd_path_in_use(). I will retry with the right fix once it will be available. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 21 04:51:59 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 21 May 2012 06:51:59 +0200 Subject: [Gluster-devel] zero'ed local data in mdc_setattr_cbk Message-ID: <1kkfmt2.2blugrtqcea2M%manu@netbsd.org> Hi Here is a backtrace for a SIGSEGV in md-cache code. Note inode = NULL when mdc_inode_iatt_set() is called by mdc_setattr_cbk(). It seems that local got corupted in the later function #0 0xbbb3a7c9 in pthread_spin_lock () from /usr/lib/libpthread.so.1 #1 0xbaa09d8c in mdc_inode_prep (this=0xba3e5000, inode=0x0) at md-cache.c:267 #2 0xbaa0a1bf in mdc_inode_iatt_set (this=0xba3e5000, inode=0x0, iatt=0xb9401d40) at md-cache.c:384 #3 0xbaa0ee16 in mdc_setattr_cbk (frame=0xbb77f400, cookie=0xbb77f470, this=0xba3e5000, op_ret=0, op_errno=0, prebuf=0xb9401cd8, postbuf=0xb9401d40, xdata=0x0) at md-cache.c:1423 #4 0xbaa1d0ec in qr_fsetattr_cbk () from /usr/local/lib/glusterfs/3.3git/xlator/performance/quick-read.so #5 0xbbb8ac72 in default_fsetattr_cbk (frame=0xbb77f0f0, cookie=0xbb77f160, this=0xba3e3000, op_ret=0, op_errno=0, statpre=0xb9401cd8, statpost=0xb9401d40, xdata=0x0) at defaults.c:452 #6 0xbbb8ac72 in default_fsetattr_cbk (frame=0xbb77f160, cookie=0xbb77f1d0, this=0xba3e2000, op_ret=0, op_errno=0, statpre=0xb9401cd8, statpost=0xb9401d40, xdata=0x0) at defaults.c:452 #7 0xbbb8ac72 in default_fsetattr_cbk (frame=0xbb77f1d0, cookie=0xbb77f240, this=0xba3e1000, op_ret=0, op_errno=0, statpre=0xb9401cd8, statpost=0xb9401d40, xdata=0x0) at defaults.c:452 #8 0xb9aa9d23 in afr_fsetattr_unwind (frame=0xba801ee8, this=0xba3d1000) at afr-inode-write.c:1160 #9 0xb9aa9f01 in afr_fsetattr_wind_cbk (frame=0xba801ee8, cookie=0x0, this=0xba3d1000, op_ret=0, op_errno=0, preop=0xbfbfe880, postop=0xbfbfe818, xdata=0x0) at afr-inode-write.c:1221 #10 0xbaa6a099 in client3_1_fsetattr_cbk (req=0xb90010d8, iov=0xb90010f8, count=1, myframe=0xbb77f010) at client3_1-fops.c:1897 #11 0xbbb6975e in rpc_clnt_handle_reply (clnt=0xba3c5270, pollin=0xbb77d220) at rpc-clnt.c:788 #12 0xbbb699fb in rpc_clnt_notify (trans=0xbb70f000, mydata=0xba3c5290, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-clnt.c:907 #13 0xbbb659c7 in rpc_transport_notify (this=0xbb70f000, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xbb77d220) at rpc-transport.c:489 #14 0xbaa9327e in socket_event_poll_in () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #15 0xbaa937f5 in socket_event_handler () from /usr/local/lib/glusterfs/3.3git/rpc-transport/socket.so #16 0xbbbb281f in event_dispatch_poll_handler (event_pool=0xbb73b080, ufds=0xbb77e6a0, i=2) at event.c:357 #17 0xbbbb2a8b in event_dispatch_poll (event_pool=0xbb73b080) at event.c:437 #18 0xbbbb2db7 in event_dispatch (event_pool=0xbb73b080) at event.c:947 #19 0x0805015e in main () (gdb) frame 3 #3 0xbaa0ee16 in mdc_setattr_cbk (frame=0xbb77f400, cookie=0xbb77f470, this=0xba3e5000, op_ret=0, op_errno=0, prebuf=0xb9401cd8, postbuf=0xb9401d40, xdata=0x0) at md-cache.c:1423 1423 mdc_inode_iatt_set (this, local->loc.inode, postbuf); (gdb) print *local $2 = {loc = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, loc2 = {path = 0x0, name = 0x0, inode = 0x0, parent = 0x0, gfid = '\000' , pargfid = '\000' }, fd = 0xb8f9d054, linkname = 0x0, xattr = 0x0} -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 21 10:14:24 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 21 May 2012 10:14:24 +0000 Subject: [Gluster-devel] zero'ed local data in mdc_setattr_cbk In-Reply-To: <1kkfmt2.2blugrtqcea2M%manu@netbsd.org> References: <1kkfmt2.2blugrtqcea2M%manu@netbsd.org> Message-ID: <20120521101424.GA10504@homeworld.netbsd.org> On Mon, May 21, 2012 at 06:51:59AM +0200, Emmanuel Dreyfus wrote: > Here is a backtrace for a SIGSEGV in md-cache code. Note inode = NULL > when mdc_inode_iatt_set() is called by mdc_setattr_cbk(). I submitted a patch to fix it, please review http://review.gluster.com/3383 -- Emmanuel Dreyfus manu at netbsd.org From jdarcy at redhat.com Mon May 21 12:24:30 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Mon, 21 May 2012 08:24:30 -0400 Subject: [Gluster-devel] Gluster internals In-Reply-To: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> References: <201205200612.q4K6CdvW009139@singularity.tronunltd.com> Message-ID: <4FBA33FE.3050602@redhat.com> On 05/20/2012 02:12 AM, Ian Latter wrote: > Hello, > > > Couple of questions that might help make my > module a little more sane; > > 0) Is there any developer docco? I've just done > another quick search and I can't see any. Let > me know if there is and I'll try and answer the > below myself. Your best bet right now (if I may say so) is the stuff I've posted on hekafs.org - the "Translator 101" articles plus the API overview at http://hekafs.org/dist/xlator_api_2.html > 1) What is the difference between STACK_WIND > and STACK_WIND_COOKIE? I.e. I've only > ever used STACK_WIND, when should I use > it versus the other? I see Krishnan has already covered this. > 2) Is there a way to write linearly within a single > function within Gluster (or is there a reason > why I wouldn't want to do that)? Any blocking ops would have to be built on top of async ops plus semaphores etc. because (unlike e.g. an HTTP server) the underlying sockets etc. are shared/multiplexed between users and activities. Thus you'd get much more context switching that way than if you stay within the async/continuation style. Some day in the distant future, I'd like to work some more on a preprocessor that turns linear code into async code so that it's easier to write but retains the performance and resource-efficiency advantages of an essentially async style. I did some work (http://pl.atyp.us/ripper/UserGuide.html) in this area several years ago, but it has probably bit-rotted to hell since then. With more recent versions of gcc and LLVM it should be possible to overcome some of the limitations that version had. From manu at netbsd.org Mon May 21 16:27:21 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 21 May 2012 18:27:21 +0200 Subject: [Gluster-devel] rename(2) race condition In-Reply-To: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> Message-ID: <1kkgieh.28gd5j1k9erqhM%manu@netbsd.org> Emmanuel Dreyfus wrote: > 3548 1 tar CALL rename(0xbb9010e0,0x8071584) > 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" > 3548 1 tar RET rename -1 errno 13 Permission denied I tracked this down to FUSE LOOKUP operation that do not set fuse_entry's attr.uid correctly (it is left set to 0). Here is the summary of my findings so far: - as un unprivilegied user, I create and delete files like crazy - most of the time everything is fine - sometime a LOOKUP for a file I created (as an unprivilegied user) will return a fuse_entry with uid set to 0, which cause the kernel to raise EACCESS when I try to delete the file. Here is an example of a FUSE trace, produced by the test case while [ 1 ] ; do cp /etc/fstab test/foo1 ; rm test/foo1 ; done > unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1) < unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1), error = -2 > unique = 1436, nodeid = 3098542296, opcode = CREATE (35) < unique = 1436, nodeid = 3098542296, opcode = CREATE (35), error = 0 > unique = 1437, nodeid = 3098542396, opcode = SETATTR (4) < unique = 1437, nodeid = 3098542396, opcode = SETATTR (4), error = 0 > unique = 1438, nodeid = 3098542396, opcode = WRITE (16) < unique = 1438, nodeid = 3098542396, opcode = WRITE (16), error = 0 > unique = 1439, nodeid = 3098542396, opcode = FSYNC (20) < unique = 1439, nodeid = 3098542396, opcode = FSYNC (20), error = 0 > unique = 1440, nodeid = 3098542396, opcode = RELEASE (18) < unique = 1440, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > unique = 1441, nodeid = 3098542396, opcode = GETATTR (3) < unique = 1441, nodeid = 3098542396, opcode = GETATTR (3), error = 0 > unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1) < unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 --> here I sometimes get fuse_entry's attr.uid incorrectly set to 0 --> When this happens, LOOKUP fails and returns EACCESS. > unique = 1443, nodeid = 3098542296, opcode = UNLINK (10) < unique = 1443, nodeid = 3098542296, opcode = UNLINK (10), error = 0 > unique = 1444, nodeid = 3098542396, opcode = FORGET (2) Is it possible that metadata writes are now so asynchronous that a subsequent lookup cannot retreive the up to date value? If that is the problem, how can I fix it? There is nothing telling the FUSE implementation that a CREATE or SETATTR has just partially completed and has metadata pending. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From ian.latter at midnightcode.org Mon May 21 23:02:44 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 22 May 2012 09:02:44 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205212302.q4LN2idg017478@singularity.tronunltd.com> > > 0) Is there any developer docco? I've just done > > another quick search and I can't see any. Let > > me know if there is and I'll try and answer the > > below myself. > > Your best bet right now (if I may say so) is the stuff I've posted on > hekafs.org - the "Translator 101" articles plus the API overview at > > http://hekafs.org/dist/xlator_api_2.html You must say so - there is so little docco. Actually before I posted I went and re-read your Translator 101 docs as you referred them to me on 10 May, but I hadn't found your API overview - thanks (for both)! > > 2) Is there a way to write linearly within a single > > function within Gluster (or is there a reason > > why I wouldn't want to do that)? > > Any blocking ops would have to be built on top of async ops plus semaphores > etc. because (unlike e.g. an HTTP server) the underlying sockets etc. are > shared/multiplexed between users and activities. Thus you'd get much more > context switching that way than if you stay within the async/continuation style. Interesting - I haven't ever done semaphore coding, but it may not be needed. The syncop framework that Krish referred too seems to do this via a mutex lock (synctask_yawn) and a context switch (synctask_yield). What's the drawback with increased context switching? After my email thread with Krish I decided against syncop, but the flow without was going to be horrific. The only way I could bring it back to anything even half as sane as the afr code (which can cleverly loop through its own _cbk's recursively - I like that, whoever put that together) was to have the last cbk in a chain (say the "close_cbk") call the original function with an index or stepper increment. But after sitting on the idea for a couple of days I actually came to the same conclusion as Manu did in the last message. I.e. without docco I have been writing to what seems to work, and in my 2009 code (I saw last night) a "mkdir" wind followed by "create" code in the same function - which I believe, now, is probably a race condition (because of the threaded/async structure forced through the wind/call macro model). In that case I *do* want a synchronous write - but only within my xlator (which, if I'm reading this right, *is* what syncop does) - as opposed to an end-to-end synchronous write (being sync'd through the full stack of xlators: ignoring caching, waiting for replication to be validated, etc). Although, the same synchronous outcome comes from the chained async calls ... but then we get back to the readability/ fixability of the code. > Some day in the distant future, I'd like to work some more on a preprocessor > that turns linear code into async code so that it's easier to write but retains > the performance and resource-efficiency advantages of an essentially async > style. I did some work (http://pl.atyp.us/ripper/UserGuide.html) in this area > several years ago, but it has probably bit-rotted to hell since then. With > more recent versions of gcc and LLVM it should be possible to overcome some of > the limitations that version had. Yes, I had a very similar thought - a C pre-processor isn't in my experience or time scale though; I considered writing up a script that would chain it out in C for me. I was going to borrow from a script that I wrote which builds one of the libMidnightCode header files but even that seemed impractical .. would anyone be able to debug it? Would I even understand in 2yrs from now - lol So I think the long and the short of it is that anything I do here won't be pretty .. or perhaps: one will look pretty and the other will run pretty :) -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Mon May 21 23:59:07 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 16:59:07 -0700 Subject: [Gluster-devel] rename(2) race condition In-Reply-To: <1kkgieh.28gd5j1k9erqhM%manu@netbsd.org> References: <1kke2xl.17aqgj1oar475M%manu@netbsd.org> <1kkgieh.28gd5j1k9erqhM%manu@netbsd.org> Message-ID: Is the FUSE SETATTR implementation in NetBSD synchronous? i.e, does the chown() or chmod() syscall issued by the application strictly block till GlusterFS's fuse_setattr_cbk() is called? Avati On Mon, May 21, 2012 at 9:27 AM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > > > 3548 1 tar CALL rename(0xbb9010e0,0x8071584) > > 3548 1 tar NAMI "usr/src/gnu/CVS/Tag.03548f" > > 3548 1 tar RET rename -1 errno 13 Permission denied > > I tracked this down to FUSE LOOKUP operation that do not set > fuse_entry's attr.uid correctly (it is left set to 0). > > Here is the summary of my findings so far: > - as un unprivilegied user, I create and delete files like crazy > - most of the time everything is fine > - sometime a LOOKUP for a file I created (as an unprivilegied user) will > return a fuse_entry with uid set to 0, which cause the kernel to raise > EACCESS when I try to delete the file. > > Here is an example of a FUSE trace, produced by the test case > while [ 1 ] ; do cp /etc/fstab test/foo1 ; rm test/foo1 ; done > > > unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1) > < unique = 1435, nodeid = 3098542296, opcode = LOOKUP (1), error = -2 > > unique = 1436, nodeid = 3098542296, opcode = CREATE (35) > < unique = 1436, nodeid = 3098542296, opcode = CREATE (35), error = 0 > > unique = 1437, nodeid = 3098542396, opcode = SETATTR (4) > < unique = 1437, nodeid = 3098542396, opcode = SETATTR (4), error = 0 > > unique = 1438, nodeid = 3098542396, opcode = WRITE (16) > < unique = 1438, nodeid = 3098542396, opcode = WRITE (16), error = 0 > > unique = 1439, nodeid = 3098542396, opcode = FSYNC (20) > < unique = 1439, nodeid = 3098542396, opcode = FSYNC (20), error = 0 > > unique = 1440, nodeid = 3098542396, opcode = RELEASE (18) > < unique = 1440, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > > unique = 1441, nodeid = 3098542396, opcode = GETATTR (3) > < unique = 1441, nodeid = 3098542396, opcode = GETATTR (3), error = 0 > > unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1) > < unique = 1442, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 > > --> here I sometimes get fuse_entry's attr.uid incorrectly set to 0 > --> When this happens, LOOKUP fails and returns EACCESS. > > > unique = 1443, nodeid = 3098542296, opcode = UNLINK (10) > < unique = 1443, nodeid = 3098542296, opcode = UNLINK (10), error = 0 > > unique = 1444, nodeid = 3098542396, opcode = FORGET (2) > > > Is it possible that metadata writes are now so asynchronous that a > subsequent lookup cannot retreive the up to date value? If that is the > problem, how can I fix it? There is nothing telling the FUSE > implementation that a CREATE or SETATTR has just partially completed and > has metadata pending. > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anand.avati at gmail.com Tue May 22 00:11:47 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 17:11:47 -0700 Subject: [Gluster-devel] A healing translator In-Reply-To: <4FA8E8AB.2040604@datalab.es> References: <4FA8E8AB.2040604@datalab.es> Message-ID: On Tue, May 8, 2012 at 2:34 AM, Xavier Hernandez wrote: > Hello developers, > > I would like to expose some ideas we are working on to create a new kind > of translator that should be able to unify and simplify to some extent the > healing procedures of complex translators. > > Currently, the only translator with complex healing capabilities that we > are aware of is AFR. We are developing another translator that will also > need healing capabilities, so we thought that it would be interesting to > create a new translator able to handle the common part of the healing > process and hence to simplify and avoid duplicated code in other > translators. > > The basic idea of the new translator is to handle healing tasks nearer the > storage translator on the server nodes instead to control everything from a > translator on the client nodes. Of course the heal translator is not able > to handle healing entirely by itself, it needs a client translator which > will coordinate all tasks. The heal translator is intended to be used by > translators that work with multiple subvolumes. > > I will try to explain how it works without entering into too much details. > > There is an important requisite for all client translators that use > healing: they must have exactly the same list of subvolumes and in the same > order. Currently, I think this is not a problem. > > The heal translator treats each file as an independent entity, and each > one can be in 3 modes: > > 1. Normal mode > > This is the normal mode for a copy or fragment of a file when it is > synchronized and consistent with the same file on other nodes (for example > with other replicas. It is the client translator who decides if it is > synchronized or not). > > 2. Healing mode > > This is the mode used when a client detects an inconsistency in the copy > or fragment of the file stored on this node and initiates the healing > procedures. > > 3. Provider mode (I don't like very much this name, though) > > This is the mode used by client translators when an inconsistency is > detected in this file, but the copy or fragment stored in this node is > considered good and it will be used as a source to repair the contents of > this file on other nodes. > > Initially, when a file is created, it is set in normal mode. Client > translators that make changes must guarantee that they send the > modification requests in the same order to all the servers. This should be > done using inodelk/entrylk. > > When a change is sent to a server, the client must include a bitmap mask > of the clients to which the request is being sent. Normally this is a > bitmap containing all the clients, however, when a server fails for some > reason some bits will be cleared. The heal translator uses this bitmap to > early detect failures on other nodes from the point of view of each client. > When this condition is detected, the request is aborted with an error and > the client is notified with the remaining list of valid nodes. If the > client considers the request can be successfully server with the remaining > list of nodes, it can resend the request with the updated bitmap. > > The heal translator also updates two file attributes for each change > request to mantain the "version" of the data and metadata contents of the > file. A similar task is currently made by AFR using xattrop. This would not > be needed anymore, speeding write requests. > > The version of data and metadata is returned to the client for each read > request, allowing it to detect inconsistent data. > > When a client detects an inconsistency, it initiates healing. First of > all, it must lock the entry and inode (when necessary). Then, from the data > collected from each node, it must decide which nodes have good data and > which ones have bad data and hence need to be healed. There are two > possible cases: > > 1. File is not a regular file > > In this case the reconstruction is very fast and requires few requests, so > it is done while the file is locked. In this case, the heal translator does > nothing relevant. > > 2. File is a regular file > > For regular files, the first step is to synchronize the metadata to the > bad nodes, including the version information. Once this is done, the file > is set in healing mode on bad nodes, and provider mode on good nodes. Then > the entry and inode are unlocked. > > When a file is in provider mode, it works as in normal mode, but refuses > to start another healing. Only one client can be healing a file. > > When a file is in healing mode, each normal write request from any client > are handled as if the file were in normal mode, updating the version > information and detecting possible inconsistencies with the bitmap. > Additionally, the healing translator marks the written region of the file > as "good". > > Each write request from the healing client intended to repair the file > must be marked with a special flag. In this case, the area that wants to be > written is filtered by the list of "good" ranges (if there are any > intersection with a good range, it is removed from the request). The > resulting set of ranges are propagated to the lower translator and added to > the list of "good" ranges but the version information is not updated. > > Read requests are only served if the range requested is entirely contained > into the "good" regions list. > > There are some additional details, but I think this is enough to have a > general idea of its purpose and how it works. > > The main advantages of this translator are: > > 1. Avoid duplicated code in client translators > 2. Simplify and unify healing methods in client translators > 3. xattrop is not needed anymore in client translators to keep track of > changes > 4. Full file contents are repaired without locking the file > 5. Better detection and prevention of some split brain situations as soon > as possible > > I think it would be very useful. It seems to me that it works correctly in > all situations, however I don't have all the experience that other > developers have with the healing functions of AFR, so I will be happy to > answer any question or suggestion to solve problems it may have or to > improve it. > > What do you think about it ? > > The goals you state above are all valid. What would really help (adoption) is if you can implement this as a modification of AFR by utilizing all the work already done, and you get brownie points if it is backward compatible with existing AFR. If you already have any code in a publishable state, please share it with us (github link?). Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.latter at midnightcode.org Tue May 22 00:40:03 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 22 May 2012 10:40:03 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205220040.q4M0e3ah017846@singularity.tronunltd.com> Actually, while we're at this level I'd like to bolt on another thought / query - these were my words; > But after sitting on the idea for a couple of days I actually came > to the same conclusion as Manu did in the last message. I.e. > without docco I have been writing to what seems to work, and > in my 2009 code (I saw last night) a "mkdir" wind followed by "create" > code in the same function - which I believe, now, is probably a > race condition (because of the threaded/async structure forced > through the wind/call macro model). But they include an assumption. The query is: are async writes and reads sequential? The two specific cases are; 1) Are all reads that are initiated in time after a write guaranteed to occur after that write has taken affect? 2) Are all writes that are initiated in time after a write (x) guaranteed to occur after that write (x) has taken affect? I could also appreciate that there may be a difference between the top/user layer view and the xlator internals .. if there is then can you please include that view in the explanation? Cheers, -- Ian Latter Late night coder .. http://midnightcode.org/ From anand.avati at gmail.com Tue May 22 01:27:41 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 18:27:41 -0700 Subject: [Gluster-devel] Gluster internals In-Reply-To: <201205220040.q4M0e3ah017846@singularity.tronunltd.com> References: <201205220040.q4M0e3ah017846@singularity.tronunltd.com> Message-ID: On Mon, May 21, 2012 at 5:40 PM, Ian Latter wrote: > > But they include an assumption. > > The query is: are async writes and reads sequential? The > two specific cases are; > > 1) Are all reads that are initiated in time after a write > guaranteed to occur after that write has taken affect? > Yes > > 2) Are all writes that are initiated in time after a write (x) > guaranteed to occur after that write (x) has taken > affect? > Only overlapping offsets/regions retain causal ordering of completion. It is write-behind which acknowledges writes pre-maturely and therefore the layer which must maintain the 'effects' for further reads and writes by making the dependent IOs (overlapping offset/regions) wait for previous write's actual completion. Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 22 05:33:37 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 22 May 2012 07:33:37 +0200 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: Message-ID: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Anand Avati wrote: > Is the FUSE SETATTR implementation in NetBSD synchronous? i.e, does the > chown() or chmod() syscall issued by the application strictly block till > GlusterFS's fuse_setattr_cbk() is called? I have been able to narrow the test down to the code below, which does not even call chown(). #include #include #include #include #include #include int main(void) { int fd; (void)mkdir("subdir", 0755); do { if ((fd = open("subdir/bugc1.txt", O_CREAT|O_RDWR, 0644)) == -1) err(EX_OSERR, "open failed"); if (close(fd) == -1) err(EX_OSERR, "close failed"); if (unlink("subdir/bugc1.txt") == -1) err(EX_OSERR, "unlink failed"); } while (1 /*CONSTCOND */); /* NOTREACHED */ return EX_OK; } It produces a FUSE trace without SETATTR: > unique = 393, nodeid = 3098542496, opcode = LOOKUP (1) < unique = 393, nodeid = 3098542496, opcode = LOOKUP (1), error = -2 > unique = 394, nodeid = 3098542496, opcode = CREATE (35) < unique = 394, nodeid = 3098542496, opcode = CREATE (35), error = 0 -> I suspect (not yet checked) this is the place where I get fuse_entry_out with attr.uid = 0. This will be cached since attr_valid tells us to do so. > unique = 395, nodeid = 3098542396, opcode = RELEASE (18) < unique = 395, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > unique = 396, nodeid = 3098542296, opcode = LOOKUP (1) < unique = 396, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 >From other traces, I can tell that this last lookup is for the parent directory (subdir). The FUSE request for looking up bugc1.txt with the intent of deleting is not even sent: from cached uid we obtained from fuse_entry_out, we know that permissions shall be denied (I had a debug printf to check that). We do not even ask. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From anand.avati at gmail.com Tue May 22 05:44:30 2012 From: anand.avati at gmail.com (Anand Avati) Date: Mon, 21 May 2012 22:44:30 -0700 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Message-ID: On Mon, May 21, 2012 at 10:33 PM, Emmanuel Dreyfus wrote: > Anand Avati wrote: > > > Is the FUSE SETATTR implementation in NetBSD synchronous? i.e, does the > > chown() or chmod() syscall issued by the application strictly block till > > GlusterFS's fuse_setattr_cbk() is called? > > I have been able to narrow the test down to the code below, which does not > even > call chown(). > > #include > #include > #include > #include > #include > #include > > int > main(void) > { > int fd; > > (void)mkdir("subdir", 0755); > > do { > if ((fd = open("subdir/bugc1.txt", O_CREAT|O_RDWR, 0644)) > == -1) > err(EX_OSERR, "open failed"); > > if (close(fd) == -1) > err(EX_OSERR, "close failed"); > > if (unlink("subdir/bugc1.txt") == -1) > err(EX_OSERR, "unlink failed"); > } while (1 /*CONSTCOND */); > > /* NOTREACHED */ > return EX_OK; > } > > It produces a FUSE trace without SETATTR: > > > unique = 393, nodeid = 3098542496, opcode = LOOKUP (1) > < unique = 393, nodeid = 3098542496, opcode = LOOKUP (1), error = -2 > > unique = 394, nodeid = 3098542496, opcode = CREATE (35) > < unique = 394, nodeid = 3098542496, opcode = CREATE (35), error = 0 > > -> I suspect (not yet checked) this is the place where I get > fuse_entry_out > with attr.uid = 0. This will be cached since attr_valid tells us to do so. > > > unique = 395, nodeid = 3098542396, opcode = RELEASE (18) > < unique = 395, nodeid = 3098542396, opcode = RELEASE (18), error = 0 > > unique = 396, nodeid = 3098542296, opcode = LOOKUP (1) > < unique = 396, nodeid = 3098542296, opcode = LOOKUP (1), error = 0 > > From other traces, I can tell that this last lookup is for the parent > directory > (subdir). The FUSE request for looking up bugc1.txt with the intent of > deleting > is not even sent: from cached uid we obtained from fuse_entry_out, we know > that > permissions shall be denied (I had a debug printf to check that). We do > not even > ask. > > Even in the case where bugc1.txt got a wrong uid returned (assuming so), it should not influence the permissibility of it getting deleted. The deletability of a file is based on the permissions on the parent directory and not the ownership of the file (unless +t sticky bit was set on the directory). Is there a way you can extend the trace code above to show the UIDs getting returned? Maybe it was the parent directory (subdir) that got a wrong UID returned? Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From aavati at redhat.com Tue May 22 07:11:36 2012 From: aavati at redhat.com (Anand Avati) Date: Tue, 22 May 2012 00:11:36 -0700 Subject: [Gluster-devel] RFC on fix to bug #802414 In-Reply-To: <7a290bd4-c833-4a35-af04-adb0052f6ff2@zmail01.collab.prod.int.phx2.redhat.com> References: <7a290bd4-c833-4a35-af04-adb0052f6ff2@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FBB3C28.2020106@redhat.com> The PARENT_DOWN_HANDLED approach will take us backwards from the current state where we are resiliant to frame losses and other class of bugs (i.e, if a frame loss happens on either server or client, it only results in prevented graph cleanup but the graph switch still happens). The root "cause" here is that we are giving up on a very important and fundamental principle of immutability on the fd object. The real solution here is to never modify fd->inode. Instead we must bring about a more native fd "migration" than just re-opening an existing fd on the new graph. Think of the inode migration analogy. The handle coming from FUSE (the address of the object) is a "hint". Usually the hint is right, if the object in the address belongs to the latest graph. If not, using the GFID we resolve a new inode on the latest graph and use it. In case of FD we can do something similar, except there are not GFIDs (which should not be a problem). We need to make the handle coming from FUSE (the address of fd_t) just a hint. If the fd->inode->table->xl->graph is the latest, then the hint was a HIT. If the graph was not the latest, we look for a previous migration attempt+result in the "base" (original) fd's context. If that does not exist or is not fresh (on the latest graph) then we do a new fd creation, open on new graph, fd_unref the old cached result in the fd context of the "base fd" and keep ref to this new result. All this must happen from fuse_resolve_fd(). The setting of the latest fd and updation of the latest fd pointer happens under the scope of the base_fd->lock() which gives it a very clear and unambiguous scope which was missing with the old scheme. [The next step will be to nuke the fd->inode swapping in fuse_create_cbk] Avati On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote: > > > ----- Original Message ----- >> From: "Pranith Kumar Karampuri" >> To: "Anand Avati" >> Cc: "Vijay Bellur", "Amar Tumballi", "Krishnan Parthasarathi" >> , "Raghavendra Gowdappa" >> Sent: Tuesday, May 22, 2012 8:42:58 AM >> Subject: Re: RFC on fix to bug #802414 >> >> Dude, >> We have already put logs yesterday in LOCK and UNLOCK and saw >> that the&fd->inode->lock address changed from LOCK to UNLOCK. > > Yes, even I too believe that the hang is because of fd->inode swap in fuse_migrate_fd and not the one in fuse_create_cbk. We could clearly see in the log files following race: > fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this was a naive fix - hold lock on inode in old graph - to the race-condition caused by swapping fd->inode, which didn't work) > > poll-thr: tries to acquire fd->inode->lock (inode is old_inode present in old-graph) in afr_local_cleanup > fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock > poll-thr: gets woken up from lock call on old_inode->lock. > poll-thr: does its work, but while unlocking, uses fd->inode where inode belongs to new graph. > > we had logs printing lock address before and after acquisition of lock and we could clearly see that lock address changed after acquiring lock in afr_local_cleanup. > >> >>>> "The hang in fuse_migrate_fd is _before_ the inode swap performed >>>> there." >> All the fds are opened on the same file. So all fds in the fd >> migration point to same inode. The race is hit by nth fd, (n+1)th fd >> hangs. We have seen that afr_local_cleanup was doing fd_unref, and >> LOCK(fd->inode->lock) was done with one address then by the time >> UNLOCK(fd->inode->lock) is done the address changed. So the next fd >> that has to migrate hung because the prev inode lock is not >> unlocked. >> >> If after nth fd introduces the race a _cbk comes in epoll thread on >> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will >> hang. >> Which is my theory for the hang we observed on Saturday. >> >> Pranith. >> ----- Original Message ----- >> From: "Anand Avati" >> To: "Raghavendra Gowdappa" >> Cc: "Vijay Bellur", "Amar Tumballi" >> , "Krishnan Parthasarathi" >> , "Pranith Kumar Karampuri" >> >> Sent: Tuesday, May 22, 2012 2:09:33 AM >> Subject: Re: RFC on fix to bug #802414 >> >> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote: >>> Avati, >>> >>> fuse_migrate_fd (running in reader thread - rdthr) assigns new >>> inode to fd, once it looks up inode in new graph. But this >>> assignment can race with code that accesses fd->inode->lock >>> executing in poll-thread (pthr) as follows >>> >>> pthr: LOCK (fd->inode->lock); (inode in old graph) >>> rdthr: fd->inode = inode (resolved in new graph) >>> pthr: UNLOCK (fd->inode->lock); (inode in new graph) >>> >> >> The way I see it (the backtrace output in the other mail), the swap >> happening in fuse_create_cbk() must be the one causing lock/unlock to >> land on different inode objects. The hang in fuse_migrate_fd is >> _before_ >> the inode swap performed there. Can you put some logs in >> fuse_create_cbk()'s inode swap code and confirm this? >> >> >>> Now, any lock operations on inode in old graph will block. Thanks >>> to pranith for pointing to this race-condition. >>> >>> The problem here is we don't have a single lock that can >>> synchronize assignment "fd->inode = inode" and other locking >>> attempts on fd->inode->lock. So, we are thinking that instead of >>> trying to synchronize, eliminate the parallel accesses altogether. >>> This can be done by splitting fd migration into two tasks. >>> >>> 1. Actions on old graph (like fsync to flush writes to disk) >>> 2. Actions in new graph (lookup, open) >>> >>> We can send PARENT_DOWN when, >>> 1. Task 1 is complete. >>> 2. No fop sent by fuse is pending. >>> >>> on receiving PARENT_DOWN, protocol/client will shutdown transports. >>> As part of transport cleanup, all pending frames are unwound and >>> protocol/client will notify its parents with PARENT_DOWN_HANDLED >>> event. Each of the translator will pass this event to its parents >>> once it is convinced that there are no pending fops started by it >>> (like background self-heal, reads as part of read-ahead etc). Once >>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there >>> will be no replies that will be racing with migration (note that >>> migration is done using syncops). At this point in time, it is >>> safe to start Task 2 (which associates fd with an inode in new >>> graph). >>> >>> Also note that reader thread will not do other operations till it >>> completes both tasks. >>> >>> As far as the implementation of this patch goes, major work is in >>> translators like read-ahead, afr, dht to provide the guarantee >>> required to send PARENT_DOWN_HANDLED event to their parents. >>> >>> Please let me know your thoughts on this. >>> >> >> All the above steps might not apply if it is caused by the swap in >> fuse_create_cbk(). Let's confirm that first. >> >> Avati >> From ian.latter at midnightcode.org Tue May 22 07:18:08 2012 From: ian.latter at midnightcode.org (Ian Latter) Date: Tue, 22 May 2012 17:18:08 +1000 Subject: [Gluster-devel] Gluster internals Message-ID: <201205220718.q4M7I8sJ019827@singularity.tronunltd.com> > > But they include an assumption. > > > > The query is: are async writes and reads sequential? The > > two specific cases are; > > > > 1) Are all reads that are initiated in time after a write > > guaranteed to occur after that write has taken affect? > > > > Yes > Excellent. > > > > 2) Are all writes that are initiated in time after a write (x) > > guaranteed to occur after that write (x) has taken > > affect? > > > > Only overlapping offsets/regions retain causal ordering of completion. It > is write-behind which acknowledges writes pre-maturely and therefore the > layer which must maintain the 'effects' for further reads and writes by > making the dependent IOs (overlapping offset/regions) wait for previous > write's actual completion. > Ok, that should do the trick. Let me mull over this for a while .. Thanks for that info. > Avati > -- Ian Latter Late night coder .. http://midnightcode.org/ From xhernandez at datalab.es Tue May 22 07:44:25 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Tue, 22 May 2012 09:44:25 +0200 Subject: [Gluster-devel] A healing translator In-Reply-To: References: <4FA8E8AB.2040604@datalab.es> Message-ID: <4FBB43D9.9070605@datalab.es> On 05/22/2012 02:11 AM, Anand Avati wrote: > > > On Tue, May 8, 2012 at 2:34 AM, Xavier Hernandez > > wrote: > > Hello developers, > > I would like to expose some ideas we are working on to create a > new kind of translator that should be able to unify and simplify > to some extent the healing procedures of complex translators. > > Currently, the only translator with complex healing capabilities > that we are aware of is AFR. We are developing another translator > that will also need healing capabilities, so we thought that it > would be interesting to create a new translator able to handle the > common part of the healing process and hence to simplify and avoid > duplicated code in other translators. > > The basic idea of the new translator is to handle healing tasks > nearer the storage translator on the server nodes instead to > control everything from a translator on the client nodes. Of > course the heal translator is not able to handle healing entirely > by itself, it needs a client translator which will coordinate all > tasks. The heal translator is intended to be used by translators > that work with multiple subvolumes. > > I will try to explain how it works without entering into too much > details. > > There is an important requisite for all client translators that > use healing: they must have exactly the same list of subvolumes > and in the same order. Currently, I think this is not a problem. > > The heal translator treats each file as an independent entity, and > each one can be in 3 modes: > > 1. Normal mode > > This is the normal mode for a copy or fragment of a file when > it is synchronized and consistent with the same file on other > nodes (for example with other replicas. It is the client > translator who decides if it is synchronized or not). > > 2. Healing mode > > This is the mode used when a client detects an inconsistency > in the copy or fragment of the file stored on this node and > initiates the healing procedures. > > 3. Provider mode (I don't like very much this name, though) > > This is the mode used by client translators when an > inconsistency is detected in this file, but the copy or > fragment stored in this node is considered good and it will be > used as a source to repair the contents of this file on other > nodes. > > Initially, when a file is created, it is set in normal mode. > Client translators that make changes must guarantee that they send > the modification requests in the same order to all the servers. > This should be done using inodelk/entrylk. > > When a change is sent to a server, the client must include a > bitmap mask of the clients to which the request is being sent. > Normally this is a bitmap containing all the clients, however, > when a server fails for some reason some bits will be cleared. The > heal translator uses this bitmap to early detect failures on other > nodes from the point of view of each client. When this condition > is detected, the request is aborted with an error and the client > is notified with the remaining list of valid nodes. If the client > considers the request can be successfully server with the > remaining list of nodes, it can resend the request with the > updated bitmap. > > The heal translator also updates two file attributes for each > change request to mantain the "version" of the data and metadata > contents of the file. A similar task is currently made by AFR > using xattrop. This would not be needed anymore, speeding write > requests. > > The version of data and metadata is returned to the client for > each read request, allowing it to detect inconsistent data. > > When a client detects an inconsistency, it initiates healing. > First of all, it must lock the entry and inode (when necessary). > Then, from the data collected from each node, it must decide which > nodes have good data and which ones have bad data and hence need > to be healed. There are two possible cases: > > 1. File is not a regular file > > In this case the reconstruction is very fast and requires few > requests, so it is done while the file is locked. In this > case, the heal translator does nothing relevant. > > 2. File is a regular file > > For regular files, the first step is to synchronize the > metadata to the bad nodes, including the version information. > Once this is done, the file is set in healing mode on bad > nodes, and provider mode on good nodes. Then the entry and > inode are unlocked. > > When a file is in provider mode, it works as in normal mode, but > refuses to start another healing. Only one client can be healing a > file. > > When a file is in healing mode, each normal write request from any > client are handled as if the file were in normal mode, updating > the version information and detecting possible inconsistencies > with the bitmap. Additionally, the healing translator marks the > written region of the file as "good". > > Each write request from the healing client intended to repair the > file must be marked with a special flag. In this case, the area > that wants to be written is filtered by the list of "good" ranges > (if there are any intersection with a good range, it is removed > from the request). The resulting set of ranges are propagated to > the lower translator and added to the list of "good" ranges but > the version information is not updated. > > Read requests are only served if the range requested is entirely > contained into the "good" regions list. > > There are some additional details, but I think this is enough to > have a general idea of its purpose and how it works. > > The main advantages of this translator are: > > 1. Avoid duplicated code in client translators > 2. Simplify and unify healing methods in client translators > 3. xattrop is not needed anymore in client translators to keep > track of changes > 4. Full file contents are repaired without locking the file > 5. Better detection and prevention of some split brain situations > as soon as possible > > I think it would be very useful. It seems to me that it works > correctly in all situations, however I don't have all the > experience that other developers have with the healing functions > of AFR, so I will be happy to answer any question or suggestion to > solve problems it may have or to improve it. > > What do you think about it ? > > > The goals you state above are all valid. What would really help > (adoption) is if you can implement this as a modification of AFR by > utilizing all the work already done, and you get brownie points if it > is backward compatible with existing AFR. If you already have any code > in a publishable state, please share it with us (github link?). > > Avati I've tried to understand how AFR works and, in some way, some of the ideas have been taken from it. However it is very complex and a lot of changes have been carried out in the master branch over the latest months. It's hard for me to follow them while actively working on my translator. Nevertheless, the main reason to take a separate path was that AFR is strongly bound to replication (at least from what I saw when I analyzed it more deeply. Maybe things have changed now, but haven't had time to review them). The requirements for my translator didn't fit very well with AFR, and the needed effort to understand and modify it to adapt it was too high. It also seems that there isn't any detailed developer info about internals of AFR that could have helped to be more confident to modify it (at least I haven't found it). I'm currenty working on it, but it's not ready yet. As soon as it is in a minimally stable state we will publish it, probably on github. I'll write the url to this list. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From anand.avati at gmail.com Tue May 22 07:48:43 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 22 May 2012 00:48:43 -0700 Subject: [Gluster-devel] A healing translator In-Reply-To: <4FBB43D9.9070605@datalab.es> References: <4FA8E8AB.2040604@datalab.es> <4FBB43D9.9070605@datalab.es> Message-ID: > > > I've tried to understand how AFR works and, in some way, some of the > ideas have been taken from it. However it is very complex and a lot of > changes have been carried out in the master branch over the latest months. > It's hard for me to follow them while actively working on my translator. > Nevertheless, the main reason to take a separate path was that AFR is > strongly bound to replication (at least from what I saw when I analyzed it > more deeply. Maybe things have changed now, but haven't had time to review > them). > Have you reviewed the proactive self-heal daemon (+ changelog indexing translator) which is a potential functional replacement for what you might be attempting? Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 22 08:16:06 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 22 May 2012 08:16:06 +0000 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Message-ID: <20120522081606.GA3976@homeworld.netbsd.org> On Mon, May 21, 2012 at 10:44:30PM -0700, Anand Avati wrote: > Even in the case where bugc1.txt got a wrong uid returned (assuming so), it > should not influence the permissibility of it getting deleted. The > deletability of a file is based on the permissions on the parent directory > and not the ownership of the file (unless +t sticky bit was set on the > directory). This is interesting: I get the behavior you describe on Linux (ext2fs), but NetBSD (FFS) hehaves differently (these are native test, without glusterfs). Is it a grey area in standards? $ ls -la test/ total 16 drwxr-xr-x 2 root wheel 512 May 22 10:10 . drwxr-xr-x 19 manu wheel 5632 May 22 10:10 .. -rw-r--r-- 1 manu wheel 0 May 22 10:10 toto $ whoami manu $ rm -f test/toto rm: test/toto: Permission denied $ uname -sr NetBSD 5.1_STABLE -- Emmanuel Dreyfus manu at netbsd.org From rgowdapp at redhat.com Tue May 22 08:44:00 2012 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Tue, 22 May 2012 04:44:00 -0400 (EDT) Subject: [Gluster-devel] RFC on fix to bug #802414 In-Reply-To: <4FBB3C28.2020106@redhat.com> Message-ID: <96991134-54b7-4e4b-a325-b0cdafec8abb@zmail01.collab.prod.int.phx2.redhat.com> ----- Original Message ----- > From: "Anand Avati" > To: "Raghavendra Gowdappa" > Cc: "Pranith Kumar Karampuri" , "Vijay Bellur" , "Amar Tumballi" > , "Krishnan Parthasarathi" , gluster-devel at nongnu.org > Sent: Tuesday, May 22, 2012 12:41:36 PM > Subject: Re: RFC on fix to bug #802414 > > > > The PARENT_DOWN_HANDLED approach will take us backwards from the > current > state where we are resiliant to frame losses and other class of bugs > (i.e, if a frame loss happens on either server or client, it only > results in prevented graph cleanup but the graph switch still > happens). > > The root "cause" here is that we are giving up on a very important > and > fundamental principle of immutability on the fd object. The real > solution here is to never modify fd->inode. Instead we must bring > about > a more native fd "migration" than just re-opening an existing fd on > the > new graph. > > Think of the inode migration analogy. The handle coming from FUSE > (the > address of the object) is a "hint". Usually the hint is right, if the > object in the address belongs to the latest graph. If not, using the > GFID we resolve a new inode on the latest graph and use it. > > In case of FD we can do something similar, except there are not GFIDs > (which should not be a problem). We need to make the handle coming > from > FUSE (the address of fd_t) just a hint. If the > fd->inode->table->xl->graph is the latest, then the hint was a HIT. > If > the graph was not the latest, we look for a previous migration > attempt+result in the "base" (original) fd's context. If that does > not > exist or is not fresh (on the latest graph) then we do a new fd > creation, open on new graph, fd_unref the old cached result in the fd > context of the "base fd" and keep ref to this new result. All this > must > happen from fuse_resolve_fd(). The setting of the latest fd and > updation > of the latest fd pointer happens under the scope of the > base_fd->lock() > which gives it a very clear and unambiguous scope which was missing > with > the old scheme. I remember discussing this solution during initial design. But, not sure why we dropped it. So, Can I go ahead with the implementation? Is this fix required post 3.3? > > [The next step will be to nuke the fd->inode swapping in > fuse_create_cbk] > > Avati > > On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote: > > > > > > ----- Original Message ----- > >> From: "Pranith Kumar Karampuri" > >> To: "Anand Avati" > >> Cc: "Vijay Bellur", "Amar > >> Tumballi", "Krishnan Parthasarathi" > >> , "Raghavendra Gowdappa" > >> Sent: Tuesday, May 22, 2012 8:42:58 AM > >> Subject: Re: RFC on fix to bug #802414 > >> > >> Dude, > >> We have already put logs yesterday in LOCK and UNLOCK and saw > >> that the&fd->inode->lock address changed from LOCK to UNLOCK. > > > > Yes, even I too believe that the hang is because of fd->inode swap > > in fuse_migrate_fd and not the one in fuse_create_cbk. We could > > clearly see in the log files following race: > > fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this > > was a naive fix - hold lock on inode in old graph - to the > > race-condition caused by swapping fd->inode, which didn't work) > > > > poll-thr: tries to acquire fd->inode->lock (inode is old_inode > > present in old-graph) in afr_local_cleanup > > fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock > > poll-thr: gets woken up from lock call on old_inode->lock. > > poll-thr: does its work, but while unlocking, uses fd->inode where > > inode belongs to new graph. > > > > we had logs printing lock address before and after acquisition of > > lock and we could clearly see that lock address changed after > > acquiring lock in afr_local_cleanup. > > > >> > >>>> "The hang in fuse_migrate_fd is _before_ the inode swap > >>>> performed > >>>> there." > >> All the fds are opened on the same file. So all fds in the fd > >> migration point to same inode. The race is hit by nth fd, (n+1)th > >> fd > >> hangs. We have seen that afr_local_cleanup was doing fd_unref, and > >> LOCK(fd->inode->lock) was done with one address then by the time > >> UNLOCK(fd->inode->lock) is done the address changed. So the next > >> fd > >> that has to migrate hung because the prev inode lock is not > >> unlocked. > >> > >> If after nth fd introduces the race a _cbk comes in epoll thread > >> on > >> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will > >> hang. > >> Which is my theory for the hang we observed on Saturday. > >> > >> Pranith. > >> ----- Original Message ----- > >> From: "Anand Avati" > >> To: "Raghavendra Gowdappa" > >> Cc: "Vijay Bellur", "Amar Tumballi" > >> , "Krishnan Parthasarathi" > >> , "Pranith Kumar Karampuri" > >> > >> Sent: Tuesday, May 22, 2012 2:09:33 AM > >> Subject: Re: RFC on fix to bug #802414 > >> > >> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote: > >>> Avati, > >>> > >>> fuse_migrate_fd (running in reader thread - rdthr) assigns new > >>> inode to fd, once it looks up inode in new graph. But this > >>> assignment can race with code that accesses fd->inode->lock > >>> executing in poll-thread (pthr) as follows > >>> > >>> pthr: LOCK (fd->inode->lock); (inode in old graph) > >>> rdthr: fd->inode = inode (resolved in new graph) > >>> pthr: UNLOCK (fd->inode->lock); (inode in new graph) > >>> > >> > >> The way I see it (the backtrace output in the other mail), the > >> swap > >> happening in fuse_create_cbk() must be the one causing lock/unlock > >> to > >> land on different inode objects. The hang in fuse_migrate_fd is > >> _before_ > >> the inode swap performed there. Can you put some logs in > >> fuse_create_cbk()'s inode swap code and confirm this? > >> > >> > >>> Now, any lock operations on inode in old graph will block. Thanks > >>> to pranith for pointing to this race-condition. > >>> > >>> The problem here is we don't have a single lock that can > >>> synchronize assignment "fd->inode = inode" and other locking > >>> attempts on fd->inode->lock. So, we are thinking that instead of > >>> trying to synchronize, eliminate the parallel accesses > >>> altogether. > >>> This can be done by splitting fd migration into two tasks. > >>> > >>> 1. Actions on old graph (like fsync to flush writes to disk) > >>> 2. Actions in new graph (lookup, open) > >>> > >>> We can send PARENT_DOWN when, > >>> 1. Task 1 is complete. > >>> 2. No fop sent by fuse is pending. > >>> > >>> on receiving PARENT_DOWN, protocol/client will shutdown > >>> transports. > >>> As part of transport cleanup, all pending frames are unwound and > >>> protocol/client will notify its parents with PARENT_DOWN_HANDLED > >>> event. Each of the translator will pass this event to its parents > >>> once it is convinced that there are no pending fops started by it > >>> (like background self-heal, reads as part of read-ahead etc). > >>> Once > >>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there > >>> will be no replies that will be racing with migration (note that > >>> migration is done using syncops). At this point in time, it is > >>> safe to start Task 2 (which associates fd with an inode in new > >>> graph). > >>> > >>> Also note that reader thread will not do other operations till it > >>> completes both tasks. > >>> > >>> As far as the implementation of this patch goes, major work is in > >>> translators like read-ahead, afr, dht to provide the guarantee > >>> required to send PARENT_DOWN_HANDLED event to their parents. > >>> > >>> Please let me know your thoughts on this. > >>> > >> > >> All the above steps might not apply if it is caused by the swap in > >> fuse_create_cbk(). Let's confirm that first. > >> > >> Avati > >> > > From xhernandez at datalab.es Tue May 22 08:51:22 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Tue, 22 May 2012 10:51:22 +0200 Subject: [Gluster-devel] A healing translator In-Reply-To: References: <4FA8E8AB.2040604@datalab.es> <4FBB43D9.9070605@datalab.es> Message-ID: <4FBB538A.70201@datalab.es> On 05/22/2012 09:48 AM, Anand Avati wrote: > >> > I've tried to understand how AFR works and, in some way, some of > the ideas have been taken from it. However it is very complex and > a lot of changes have been carried out in the master branch over > the latest months. It's hard for me to follow them while actively > working on my translator. Nevertheless, the main reason to take a > separate path was that AFR is strongly bound to replication (at > least from what I saw when I analyzed it more deeply. Maybe things > have changed now, but haven't had time to review them). > > > Have you reviewed the proactive self-heal daemon (+ changelog indexing > translator) which is a potential functional replacement for what you > might be attempting? > > Avati I must admit that I've read something about it but I haven't had time to explore it in detail. If I understand it correctly, the self-heal daemon works as a client process but can be executed on server nodes. I suppose that multiple self-heal daemons can be running on different nodes. Then, each daemon detects invalid files (not sure exactly how) and replicates the changes from one good node to the bad nodes. The problem is that in the translator I'm working on, the information is dispersed among multiple nodes, so there isn't a single server node that contains the whole data. To repair a node, data must be read from at least two other nodes (it depends on configuration). From what I've read from AFR and the self-healing daemon, it's not straightforward to adapt them to this mechanism because they would need to know a subset of nodes with consistent data, not only one. Each daemon would have to contact all other nodes, read data from each one, determine which ones are valid, rebuild the data and send it to the bad nodes. This means that the daemon will have to be as complex as the clients. My impression (but I may be wrong) is that AFR and the self-healing daemon are closely bound to the replication schema, so it is very hard to try to use them for other purposes. The healing translator I'm writing tries to offer generic server side helpers for the healing process, but it is the client side who really manages the healing operation (though heavily simplified) and could use it to replicate data, to disperse data, or some other schema. Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Tue May 22 09:08:48 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 22 May 2012 09:08:48 +0000 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> Message-ID: <20120522090848.GC3976@homeworld.netbsd.org> On Mon, May 21, 2012 at 10:44:30PM -0700, Anand Avati wrote: > Is there a way you can extend the trace code above to show the UIDs getting > returned? Maybe it was the parent directory (subdir) that got a wrong UID > returned? Further investigation shows you are right. I traced the struct fuse_entry_out returned by glusterfs on LOOKUP; "/subdir", uid = 500, gid = 500, mode = 040755, attr_valid = 1 "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 bugc1.txt is looked up many times as I loop creating/deleting it subdir is not looked up often since it is cached for 1 second. New subdir lookups will return correct uid/gid/mode. After some time, though, it will return incorrect information: "/subdir/bugc1.txt", uid = 500, gid = 500, mode = 0100644, attr_valid = 1 "/subdir", uid = 0, gid = 0, mode = 040700, attr_valid = 1 -- Emmanuel Dreyfus manu at netbsd.org From aavati at redhat.com Tue May 22 17:47:49 2012 From: aavati at redhat.com (Anand Avati) Date: Tue, 22 May 2012 10:47:49 -0700 Subject: [Gluster-devel] RFC on fix to bug #802414 In-Reply-To: <96991134-54b7-4e4b-a325-b0cdafec8abb@zmail01.collab.prod.int.phx2.redhat.com> References: <96991134-54b7-4e4b-a325-b0cdafec8abb@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <4FBBD145.3030303@redhat.com> On 05/22/2012 01:44 AM, Raghavendra Gowdappa wrote: > > > ----- Original Message ----- >> From: "Anand Avati" >> To: "Raghavendra Gowdappa" >> Cc: "Pranith Kumar Karampuri", "Vijay Bellur", "Amar Tumballi" >> , "Krishnan Parthasarathi", gluster-devel at nongnu.org >> Sent: Tuesday, May 22, 2012 12:41:36 PM >> Subject: Re: RFC on fix to bug #802414 >> >> >> >> The PARENT_DOWN_HANDLED approach will take us backwards from the >> current >> state where we are resiliant to frame losses and other class of bugs >> (i.e, if a frame loss happens on either server or client, it only >> results in prevented graph cleanup but the graph switch still >> happens). >> >> The root "cause" here is that we are giving up on a very important >> and >> fundamental principle of immutability on the fd object. The real >> solution here is to never modify fd->inode. Instead we must bring >> about >> a more native fd "migration" than just re-opening an existing fd on >> the >> new graph. >> >> Think of the inode migration analogy. The handle coming from FUSE >> (the >> address of the object) is a "hint". Usually the hint is right, if the >> object in the address belongs to the latest graph. If not, using the >> GFID we resolve a new inode on the latest graph and use it. >> >> In case of FD we can do something similar, except there are not GFIDs >> (which should not be a problem). We need to make the handle coming >> from >> FUSE (the address of fd_t) just a hint. If the >> fd->inode->table->xl->graph is the latest, then the hint was a HIT. >> If >> the graph was not the latest, we look for a previous migration >> attempt+result in the "base" (original) fd's context. If that does >> not >> exist or is not fresh (on the latest graph) then we do a new fd >> creation, open on new graph, fd_unref the old cached result in the fd >> context of the "base fd" and keep ref to this new result. All this >> must >> happen from fuse_resolve_fd(). The setting of the latest fd and >> updation >> of the latest fd pointer happens under the scope of the >> base_fd->lock() >> which gives it a very clear and unambiguous scope which was missing >> with >> the old scheme. > > I remember discussing this solution during initial design. But, not sure why we dropped it. So, Can I go ahead with the implementation? Is this fix required post 3.3? > The solution you are probably referring to was dropped because there we were talking about chaining FDs to the one on the "next graph" as graphs keep getting changed. The one described above is different because here there will one base fd (the original one on which open() by fuse was performed) and new graphs result in creation of an internal new fd directly referred by the base fd (and naturally unref the previous "new fd") thereby keeping things quite trim. Avati From anand.avati at gmail.com Tue May 22 20:09:52 2012 From: anand.avati at gmail.com (Anand Avati) Date: Tue, 22 May 2012 13:09:52 -0700 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: <20120522090848.GC3976@homeworld.netbsd.org> References: <1kkhgr1.ap0abr3ec5ziM%manu@netbsd.org> <20120522090848.GC3976@homeworld.netbsd.org> Message-ID: On Tue, May 22, 2012 at 2:08 AM, Emmanuel Dreyfus wrote: > > Further investigation shows you are right. I traced the > struct fuse_entry_out returned by glusterfs on LOOKUP; > > "/subdir", uid = 500, gid = 500, mode = 040755, attr_valid = 1 > ... > "/subdir", uid = 0, gid = 0, mode = 040700, attr_valid = 1 > Note that even mode has changed, not just the uid/gid. It will probably help if you can put a breakpoint in this case and inspect the stack about where these attribute fields are fetched from (some cache? from posix?) Avati -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Wed May 23 02:04:25 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 23 May 2012 04:04:25 +0200 Subject: [Gluster-devel] metadata race confition (was: ename(2) race condition) In-Reply-To: Message-ID: <1kkj4ca.1knxmw01kr7wlgM%manu@netbsd.org> Anand Avati wrote: > Note that even mode has changed, not just the uid/gid. It will probably > help if you can put a breakpoint in this case and inspect the stack about > where these attribute fields are fetched from (some cache? from posix?) My tests shows that the garbage is introduced by mdc_inode_iatt_get() in mdc_lookup(). -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vijay at build.gluster.com Wed May 23 13:57:15 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Wed, 23 May 2012 06:57:15 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa43 released Message-ID: <20120523135718.0E6111008C@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa43/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz This release is made off v3.3.0qa43 From manu at netbsd.org Wed May 23 16:58:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Wed, 23 May 2012 16:58:02 +0000 Subject: [Gluster-devel] preparent and postparent? Message-ID: <20120523165802.GC17268@homeworld.netbsd.org> Hi in the protocol/server xlator, there are many occurences where callbacks have a struct iatt for preparent and postparent. What are these for? Is it a normal behavior to have different things in preparent and postparent? -- Emmanuel Dreyfus manu at netbsd.org From jdarcy at redhat.com Wed May 23 17:03:41 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Wed, 23 May 2012 13:03:41 -0400 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <20120523165802.GC17268@homeworld.netbsd.org> References: <20120523165802.GC17268@homeworld.netbsd.org> Message-ID: <20120523130341.1ee693a3@jdarcy-dt.usersys.redhat.com> On Wed, 23 May 2012 16:58:02 +0000 Emmanuel Dreyfus wrote: > in the protocol/server xlator, there are many occurences where > callbacks have a struct iatt for preparent and postparent. What are > these for? NFS needs them to support its style of caching. From manu at netbsd.org Thu May 24 01:31:18 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 24 May 2012 03:31:18 +0200 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <20120523130341.1ee693a3@jdarcy-dt.usersys.redhat.com> Message-ID: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> Jeff Darcy wrote: > > in the protocol/server xlator, there are many occurences where > > callbacks have a struct iatt for preparent and postparent. What are > > these for? > > NFS needs them to support its style of caching. Let me rephrase: what information is stored in preparent and postparent? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Thu May 24 04:29:39 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Thu, 24 May 2012 06:29:39 +0200 Subject: [Gluster-devel] gerrit Message-ID: <1kkl5w4.dyowb9lel6oM%manu@netbsd.org> Hi In gerrit, if I sign it and look at the Download field in a patchset, I see this: git fetch ssh://manu@/glusterfs refs/changes/13/3413/2 && git format-patch -1 --stdout FETCH_HEAD It would be nice if I had ssh://manu at git.gluster.com/glusterfs.git so that the line can be copy/pasted without the need to edit each time. Is it something I need to configure (where?), or is it a global setting beyond my reach (in that case, please someone fix it!) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From anand.avati at gmail.com Thu May 24 06:30:20 2012 From: anand.avati at gmail.com (Anand Avati) Date: Wed, 23 May 2012 23:30:20 -0700 Subject: [Gluster-devel] gerrit In-Reply-To: <1kkl5w4.dyowb9lel6oM%manu@netbsd.org> References: <1kkl5w4.dyowb9lel6oM%manu@netbsd.org> Message-ID: fixed! On Wed, May 23, 2012 at 9:29 PM, Emmanuel Dreyfus wrote: > Hi > > In gerrit, if I sign it and look at the Download field in a patchset, I > see this: > > git fetch ssh://manu@/glusterfs refs/changes/13/3413/2 && git > format-patch -1 --stdout FETCH_HEAD > > It would be nice if I had ssh://manu at git.gluster.com/glusterfs.git > so that the line can be copy/pasted without the need to edit each time. > Is it something I need to configure (where?), or is it a global setting > beyond my reach (in that case, please someone fix it!) > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at datalab.es Thu May 24 07:10:59 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Thu, 24 May 2012 09:10:59 +0200 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> References: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> Message-ID: <4FBDDF03.8080203@datalab.es> On 05/24/2012 03:31 AM, Emmanuel Dreyfus wrote: > Jeff Darcy wrote: > >>> in the protocol/server xlator, there are many occurences where >>> callbacks have a struct iatt for preparent and postparent. What are >>> these for? >> NFS needs them to support its style of caching. > Let me rephrase: what information is stored in preparent and postparent? preparent and postparent have the attributes (modification time, size, permissions, ...) of the parent directory of the file being modified before and after the modification is done. Xavi From jdarcy at redhat.com Thu May 24 13:05:08 2012 From: jdarcy at redhat.com (Jeff Darcy) Date: Thu, 24 May 2012 09:05:08 -0400 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <4FBDDF03.8080203@datalab.es> References: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> <4FBDDF03.8080203@datalab.es> Message-ID: <4FBE3204.7050005@redhat.com> On 05/24/2012 03:10 AM, Xavier Hernandez wrote: > preparent and postparent have the attributes (modification time, size, > permissions, ...) of the parent directory of the file being modified > before and after the modification is done. Thank you, Xavi. :) If you really want to have some fun, you can take a look at the rename callback, which has pre- and post-attributes for both the old and new parent. From johnmark at redhat.com Thu May 24 19:21:22 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 24 May 2012 15:21:22 -0400 (EDT) Subject: [Gluster-devel] glusterfs-3.3.0qa43 released In-Reply-To: <20120523135718.0E6111008C@build.gluster.com> Message-ID: <7c8ea685-d794-451e-820a-25f784e7873d@zmail01.collab.prod.int.phx2.redhat.com> A reminder: As we come down to the final days, it is vitally important that we test these last few qa releases. This one, in particular, contains fixes added to the 3.3 branch after beta 4 was release last week: http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz Please consider using the testing page when evaluating: http://www.gluster.org/community/documentation/index.php/3.3.0_Beta_4_Tests Also, if someone would like to test the object storage as well as the HDFS piece, please report here, or create another test page on the wiki. Finally, you can track all commits to the master and 3.3 branches on Twitter (@glusterdev) ...and via Atom/Rss - https://github.com/gluster/glusterfs/commits/release-3.3.atom https://github.com/gluster/glusterfs/commits/master.atom -JM ----- Original Message ----- > > http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa43/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa43.tar.gz > > This release is made off v3.3.0qa43 > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > From xhernandez at datalab.es Fri May 25 07:28:43 2012 From: xhernandez at datalab.es (Xavier Hernandez) Date: Fri, 25 May 2012 09:28:43 +0200 Subject: [Gluster-devel] preparent and postparent? In-Reply-To: <4FBE3204.7050005@redhat.com> References: <1kkkxdd.899gmz10i9s06M%manu@netbsd.org> <4FBDDF03.8080203@datalab.es> <4FBE3204.7050005@redhat.com> Message-ID: <4FBF34AB.6070606@datalab.es> On 05/24/2012 03:05 PM, Jeff Darcy wrote: > On 05/24/2012 03:10 AM, Xavier Hernandez wrote: >> preparent and postparent have the attributes (modification time, size, >> permissions, ...) of the parent directory of the file being modified >> before and after the modification is done. > Thank you, Xavi. :) If you really want to have some fun, you can take a look > at the rename callback, which has pre- and post-attributes for both the old and > new parent. Yes, I've had some "fun" with them. Without them almost all callbacks would seem too short to me now... hehehe From fernando.frediani at qubenet.net Fri May 25 09:44:10 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 09:44:10 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> Hi, I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 beta4 and the new type of volume striped + replicated. My go is to use it to run Virtual Machines (.vmdk files). Volume is created fine and the ESXi server mountw the Datastore using Gluster built-in NFS, however when trying to use the Datastore or even read, it hangs. Looking at the Gluster NFS logs I see: "[socket.c:195:__socket_rwv] 0-socket.nfs-server: readv failed (Connection reset by peer)" In order to get the rpm files installed I had first to install these two because of the some libraries: "compat-readline5-5.2-17.1.el6.x86_64".rpm and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has anything to do with that. Has anyone ever used Gluster as a backend storage for ESXi ? Does it actually work ? Regards, Fernando Frediani Lead Systems Engineer Qube Managed Services Limited 260-266 Goswell Road, London, EC1V 7EB, United Kingdom sales: +44 (0) 20 7150 3800 ddi: +44 (0) 20 7150 3803 fax: +44 (0) 20 7336 8420 web: http://www.qubenet.net/ P Please consider the environment before printing this email -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.frediani at qubenet.net Fri May 25 11:36:55 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 11:36:55 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> Actually, even on another Linux machine mounting NFS has the same behaviour. I am able to mount it with "mount -t nfs ..." but when I try "ls" it hangs as well. One particular thing of the Gluster servers is that they have two networks, one for management with default gateway and another only for storage. I am only able to mount on the storage network. The hosts file has all nodes' names with the ips on the storage network. I tried to use this but didn't work either. gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* Watching the nfs logs when I try a "ls" from the remote client it shows: pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2012-05-25 11:38:09 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.3.0beta4 /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup+0xa5)[0x7f1c8e7a6ac5] /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdirp_cbk+0x536)[0x7f1c8e543346] /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdirp_cbk+0x1ca)[0x7f1c8e76269a] /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_readdirp_cbk+0x170)[0x7f1c8e9dbbe0] /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x7f1c8f818c8f] /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_handler+0x188)[0x7f1c8f818e38] /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] /usr/sbin/glusterfs(main+0x502)[0x406612] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] /usr/sbin/glusterfs[0x404399] Thanks Fernando From: Fernando Frediani (Qube) Sent: 25 May 2012 10:44 To: 'gluster-devel at nongnu.org' Subject: Can't use NFS with VMware ESXi Hi, I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 beta4 and the new type of volume striped + replicated. My go is to use it to run Virtual Machines (.vmdk files). Volume is created fine and the ESXi server mountw the Datastore using Gluster built-in NFS, however when trying to use the Datastore or even read, it hangs. Looking at the Gluster NFS logs I see: "[socket.c:195:__socket_rwv] 0-socket.nfs-server: readv failed (Connection reset by peer)" In order to get the rpm files installed I had first to install these two because of the some libraries: "compat-readline5-5.2-17.1.el6.x86_64".rpm and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has anything to do with that. Has anyone ever used Gluster as a backend storage for ESXi ? Does it actually work ? Regards, Fernando Frediani Lead Systems Engineer Qube Managed Services Limited 260-266 Goswell Road, London, EC1V 7EB, United Kingdom sales: +44 (0) 20 7150 3800 ddi: +44 (0) 20 7150 3803 fax: +44 (0) 20 7336 8420 web: http://www.qubenet.net/ P Please consider the environment before printing this email -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Fri May 25 13:35:19 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Fri, 25 May 2012 13:35:19 +0000 Subject: [Gluster-devel] mismatching ino/dev between file Message-ID: <20120525133519.GC19383@homeworld.netbsd.org> Hi Here is a bug with release-3.3. It happens on a 2 way replicated. Here is what I have in one brick: [2012-05-25 15:03:09.463446] W [posix-handle.c:487:posix_handle_hard] 0-pfs-posix: mismatching ino/dev between file /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (57943061/16) and handle /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce (57943060/16) [2012-05-25 15:03:09.463552] E [posix.c:1277:posix_symlink] 0-pfs-posix: setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 failed On the other one: [2012-05-25 15:03:09.447682] W [posix-handle.c:487:posix_handle_hard] 0-pfs-posix: mismatching ino/dev between file /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (50557989/24) and handle /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce (50557988/24) [2012-05-25 15:03:09.447774] E [posix.c:1277:posix_symlink] 0-pfs-posix: setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 failed Someone can give me a hint of what happens, and how to track it down? -- Emmanuel Dreyfus manu at netbsd.org From abperiasamy at gmail.com Fri May 25 17:09:09 2012 From: abperiasamy at gmail.com (Anand Babu Periasamy) Date: Fri, 25 May 2012 10:09:09 -0700 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with ?mount ?t nfs ?? but when I try ?ls? it hangs as > well. > > One particular thing of the Gluster servers is that they have two networks, > one for management with default gateway and another only for storage. I am > only able to mount on the storage network. > > The hosts file has all nodes? names with the ips on the storage network. > > > > I tried to use this but didn?t work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a ?ls? from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup+0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdirp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdirp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_readdirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_handler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I?ve setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 beta4 > and the new type of volume striped + replicated. My go is to use it to run > Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or even > read, it hangs. > > > > Looking at the Gluster NFS logs I see: ????[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)? > > > > In order to get the rpm files installed I had first to install these two > because of the some libraries: ?compat-readline5-5.2-17.1.el6.x86_64?.rpm > and ?openssl098e-0.9.8e-17.el6.centos.x86_64.rpm?.Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From pmatthaei at debian.org Fri May 25 18:56:37 2012 From: pmatthaei at debian.org (=?ISO-8859-1?Q?Patrick_Matth=E4i?=) Date: Fri, 25 May 2012 20:56:37 +0200 Subject: [Gluster-devel] glusterfs-3.2.7qa1 released In-Reply-To: <20120412172933.6A2A8102E6@build.gluster.com> References: <20120412172933.6A2A8102E6@build.gluster.com> Message-ID: <4FBFD5E5.1060901@debian.org> Am 12.04.2012 19:29, schrieb Vijay Bellur: > > http://bits.gluster.com/pub/gluster/glusterfs/3.2.7qa1/ > > http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.2.7qa1.tar.gz > > This release is made off v3.2.7qa1 Hey, I have tested this qa release and could not find any regression/problem. It would be realy nice to have a 3.2.7 release in the next days (max 2 weeks from now on) so that we could ship glusterfs 3.2.7 instead of 3.2.6 with our next release Debian Wheezy! -- /* Mit freundlichem Gru? / With kind regards, Patrick Matth?i GNU/Linux Debian Developer E-Mail: pmatthaei at debian.org patrick at linux-dev.org */ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: OpenPGP digital signature URL: From fernando.frediani at qubenet.net Fri May 25 19:33:37 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 19:33:37 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From fernando.frediani at qubenet.net Fri May 25 20:32:25 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Fri, 25 May 2012 20:32:25 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> Anand, Further to that I managed to mount the Datastore and deploy machines there, but when trying to power them On I get an error as if it couldn't find a file. Has anyone seen these kind of error before ? I would say that it could be a lock problem, but it doesn't seem to. Permissions maybe ? Or the way the NFS is exported ? (root_squash, no_root_squash, etc) Here is the log: An unexpected error was received from the ESX host while powering on VM vm-21112. Failed to power on VM. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 20:34 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From manu at netbsd.org Sat May 26 05:37:51 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 07:37:51 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate Message-ID: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> here is a bug in release-3.3: ./xinstall -c -p -r -m 555 xinstall /pfs/manu/netbsd/usr/src/tooldir.NetBSD-6.99.4-i386/bin/i386--netbsdelf-instal xinstall: /pfs/manu/netbsd/usr/src/tooldir.NetBSD-6.99.4-i386/bin/inst.00033a: chmod: Permission denied Kernel trace, client side: 33 1 xinstall CALL open(0xbfbfd8e0,0xa02,0x180) 33 1 xinstall NAMI "/pfs/manu/netbsd/usr/src/tooldir.NetBSD-6.99.4-i38 6/bin/inst.00033a" 33 1 xinstall RET open 3 33 1 xinstall CALL open(0x (...) 33 1 xinstall CALL fchmod(3,0x16d) 33 1 xinstall RET fchmod -1 errno 13 Permission denied I tracked this down to posix_acl_truncate() on the server, where loc->inode and loc->pah are NULL. This code goes red and raise EACCESS: if (acl_permits (frame, loc->inode, POSIX_ACL_WRITE)) goto green; else goto red; Here is the relevant baccktrace: #9 0xb99d1ca6 in posix_acl_truncate (frame=0xbb77f780, this=0xb9d20000, loc=0xb9d41020, off=48933, xdata=0x0) at posix-acl.c:898 #10 0xb99ba4f8 in truncate_stat_cbk (frame=0xbb77f400, cookie=0xbb77f6a0, this=0xb9d22000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at posix.c:204 #11 0xbbb87933 in default_stat_cbk (frame=0xbb77f6a0, cookie=0xbb77f710, this=0xb9d20000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at defaults.c:47 #12 0xb99e1751 in posix_stat (frame=0xbb77f710, this=0xb9d1f000, loc=0xba60091c, xdata=0x0) at posix.c:231 In frame 12, loc->inode is not NULL, and loc->path makes sense: "/netbsd/usr/src/tooldir.NetBSD-6.9 9.4-i386/bin/inst.01911a" In frame 10, loc->path and loc->inode are NULL. In note that xlators/features/locks/src/posix.c:pl_ftruncate() sets truncate_stat_cbk() as the callback, and not ftruncate_stat_cbk(). That later function does not even exist. f-style functions not calling f-style callbacks have been the root of various bugs so far, is it one more of them? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vbellur at redhat.com Sat May 26 07:44:52 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Sat, 26 May 2012 13:14:52 +0530 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> References: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> Message-ID: <4FC089F4.3070004@redhat.com> On 05/26/2012 11:07 AM, Emmanuel Dreyfus wrote: > here is a bug in release-3.3: > > > I tracked this down to posix_acl_truncate() on the server, where loc->inode > and loc->pah are NULL. This code goes red and raise EACCESS: > > if (acl_permits (frame, loc->inode, POSIX_ACL_WRITE)) > goto green; > else > goto red; > > Here is the relevant baccktrace: > > #9 0xb99d1ca6 in posix_acl_truncate (frame=0xbb77f780, this=0xb9d20000, > loc=0xb9d41020, off=48933, xdata=0x0) at posix-acl.c:898 > #10 0xb99ba4f8 in truncate_stat_cbk (frame=0xbb77f400, cookie=0xbb77f6a0, > this=0xb9d22000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) > at posix.c:204 > #11 0xbbb87933 in default_stat_cbk (frame=0xbb77f6a0, cookie=0xbb77f710, > this=0xb9d20000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) > at defaults.c:47 > #12 0xb99e1751 in posix_stat (frame=0xbb77f710, this=0xb9d1f000, > loc=0xba60091c, xdata=0x0) at posix.c:231 > > In frame 12, loc->inode is not NULL, and loc->path makes sense: > "/netbsd/usr/src/tooldir.NetBSD-6.9 > 9.4-i386/bin/inst.01911a" > > In frame 10, loc->path and loc->inode are NULL. > > In note that xlators/features/locks/src/posix.c:pl_ftruncate() sets > truncate_stat_cbk() as the callback, and not ftruncate_stat_cbk(). That later > function does not even exist. f-style functions not calling f-style callbacks > have been the root of various bugs so far, is it one more of them? I don't think it is a f-style problem. I do not get a EPERM with the testcase that you posted for qa39. Can you please provide a bigger bt? Thanks, Vijay > > From manu at netbsd.org Sat May 26 09:00:22 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 11:00:22 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC089F4.3070004@redhat.com> Message-ID: <1kkp7w9.1a5c4mz1tiqw8rM%manu@netbsd.org> Vijay Bellur wrote: > I don't think it is a f-style problem. I do not get a EPERM with the > testcase that you posted for qa39. Can you please provide a bigger bt? #3 0xb99414c4 in server_truncate_cbk (frame=0xba901714, cookie=0xbb77f010, this=0xb9d27000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at server3_1-fops.c:1218 #4 0xb9968bd6 in io_stats_truncate_cbk (frame=0xbb77f010, cookie=0xbb77f080, this=0xb9d26000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at io-stats.c:1600 #5 0xb998036e in marker_truncate_cbk (frame=0xbb77f080, cookie=0xbb77f0f0, this=0xb9d25000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at marker.c:1535 #6 0xbbb87a85 in default_truncate_cbk (frame=0xbb77f0f0, cookie=0xbb77f160, this=0xb9d24000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at defaults.c:58 #7 0xb99a8fa2 in iot_truncate_cbk (frame=0xbb77f160, cookie=0xbb77f400, this=0xb9d23000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at io-threads.c:1270 #8 0xb99b9fe0 in pl_truncate_cbk (frame=0xbb77f400, cookie=0xbb77f780, this=0xb9d22000, op_ret=-1, op_errno=13, prebuf=0x0, postbuf=0x0, xdata=0x0) at posix.c:119 #9 0xb99d1ca6 in posix_acl_truncate (frame=0xbb77f780, this=0xb9d20000, loc=0xb9d41020, off=48933, xdata=0x0) at posix-acl.c:898 #10 0xb99ba4f8 in truncate_stat_cbk (frame=0xbb77f400, cookie=0xbb77f6a0, this=0xb9d22000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at posix.c:204 #11 0xbbb87933 in default_stat_cbk (frame=0xbb77f6a0, cookie=0xbb77f710, this=0xb9d20000, op_ret=0, op_errno=0, buf=0xb89ffac4, xdata=0x0) at defaults.c:47 #12 0xb99e1751 in posix_stat (frame=0xbb77f710, this=0xb9d1f000, loc=0xba60091c, xdata=0x0) at posix.c:231 #13 0xbbb94d76 in default_stat (frame=0xbb77f6a0, this=0xb9d20000, loc=0xba60091c, xdata=0x0) at defaults.c:1231 #14 0xb99babb0 in pl_truncate (frame=0xbb77f400, this=0xb9d22000, loc=0xba60091c, offset=48933, xdata=0x0) at posix.c:249 #15 0xb99a91ac in iot_truncate_wrapper (frame=0xbb77f160, this=0xb9d23000, loc=0xba60091c, offset=48933, xdata=0x0) at io-threads.c:1280 #16 0xbbba76d8 in call_resume_wind (stub=0xba6008fc) at call-stub.c:2474 #17 0xbbbae729 in call_resume (stub=0xba6008fc) at call-stub.c:4151 #18 0xb99a22a3 in iot_worker (data=0xb9d12110) at io-threads.c:131 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 26 11:51:46 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 13:51:46 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC089F4.3070004@redhat.com> Message-ID: <1kkpd53.bn09pz1v8qmwtM%manu@netbsd.org> Vijay Bellur wrote: > I don't think it is a f-style problem. I do not get a EPERM with the > testcase that you posted for qa39. I wonder if the bug can occur because some mess in the .glusterfs directory cause by an earlier problem. Is it possible? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 26 12:55:08 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 14:55:08 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkpd53.bn09pz1v8qmwtM%manu@netbsd.org> Message-ID: <1kkpirc.geu5yvq0165fM%manu@netbsd.org> Emmanuel Dreyfus wrote: > I wonder if the bug can occur because some mess in the .glusterfs > directory cause by an earlier problem. Is it possible? That is not the problem: I nuked .glusterfs on all bricks and the problem remain. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Sat May 26 14:20:10 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sat, 26 May 2012 16:20:10 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC089F4.3070004@redhat.com> Message-ID: <1kkpmmr.rrgubdjz6w9fM%manu@netbsd.org> Vijay Bellur wrote: > I don't think it is a f-style problem. I do not get a EPERM with the > testcase that you posted for qa39. Can you please provide a bigger bt? Here is a minimal test case that reproduces the problem at mine. Run it as un unprivilegied user in a directory you on which you have write access: $ pwd /pfs/manu/xinstall $ ls -ld . drwxr-xr-x 4 manu manu 512 May 26 16:17 . $ id uid=500(manu) gid=500(manu) groups=500(manu),0(wheel) $ ./test test: fchmod failed: Permission denied #include #include #include #include #include #include #include #define TESTFILE "testfile" int main(void) { int fd; char buf[16384]; if ((unlink(TESTFILE) == -1) && (errno != ENOENT)) err(EX_OSERR, "unlink failed"); if ((fd = open(TESTFILE, O_CREAT|O_EXCL|O_RDWR, 0600)) == -1) err(EX_OSERR, "open failed"); if (write(fd, buf, sizeof(buf)) != sizeof(buf)) err(EX_OSERR, "write failed"); if (fchmod(fd, 0555) == -1) err(EX_OSERR, "fchmod failed"); if (close(fd) == -1) err(EX_OSERR, "close failed"); return EX_OK; } -- Emmanuel Dreyfus http://hcpnet.free.fr/pubzx@ manu at netbsd.org From manu at netbsd.org Sun May 27 05:17:51 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 27 May 2012 07:17:51 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkoxfb.xo4yxvos90qeM%manu@netbsd.org> Message-ID: <1kkqh7z.uvmz7na7peuaM%manu@netbsd.org> Emmanuel Dreyfus wrote: > In frame 10, loc->path and loc->inode are NULL. Here is the investigation so far: xlators/features/locks/src/posix.c:truncate_stat_cbk() has a NULL loc->inode, and this leads to the acl check that fails. As I understand this is a FUSE implentation problem. fchmod() produces a FUSE SETATTR. If the file is being written, NetBSD FUSE will set mode, size, atime, mtime, and fh in this operation. I suspect Linux FUSE only sets mode and fh and this is why the bug does not appear on Linux: the truncate code path is probably not involved. Can someone confirm? If this is the case, it suggests the code path may have never been tested. I suspect there are bugs there, for instance, in pl_truncate_cbk, local is erased after being retreived, which does not look right: local = frame->local; local = mem_get0 (this->local_pool); if (local->op == TRUNCATE) loc_wipe (&local->loc); I tried fixing that one without much improvments. There may be other problems. About fchmod() setting size: is it a reasonable behavior? FUSE does not specify what must happens, so if glusterfs rely on the Linux kernel not doing it may be begging for future bugs if that behavior change. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From vbellur at redhat.com Sun May 27 06:54:43 2012 From: vbellur at redhat.com (Vijay Bellur) Date: Sun, 27 May 2012 12:24:43 +0530 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkqh7z.uvmz7na7peuaM%manu@netbsd.org> References: <1kkqh7z.uvmz7na7peuaM%manu@netbsd.org> Message-ID: <4FC1CFB3.7050808@redhat.com> On 05/27/2012 10:47 AM, Emmanuel Dreyfus wrote: > Emmanuel Dreyfus wrote: > >> In frame 10, loc->path and loc->inode are NULL. > > > As I understand this is a FUSE implentation problem. fchmod() produces a > FUSE SETATTR. If the file is being written, NetBSD FUSE will set mode, > size, atime, mtime, and fh in this operation. I suspect Linux FUSE only > sets mode and fh and this is why the bug does not appear on Linux: the > truncate code path is probably not involved. For the testcase that you sent out, I see fsi->valid being set to 1 which indicates only mode on Linux. The truncate path does not get involved. I modified the testcase to send ftruncate/truncate and it completed successfully. > > > Can someone confirm? If this is the case, it suggests the code path may > have never been tested. I suspect there are bugs there, for instance, in > pl_truncate_cbk, local is erased after being retreived, which does not > look right: > > local = frame->local; > > local = mem_get0 (this->local_pool); I don't see this in pl_truncate_cbk(). mem_get0 is done only in pl_truncate(). A code inspection in pl_(f)truncate did not raise any suspicions to me. > > > About fchmod() setting size: is it a reasonable behavior? FUSE does not > specify what must happens, so if glusterfs rely on the Linux kernel not > doing it may be begging for future bugs if that behavior change. > I am not sure why fchmod() should set size. Csaba, any thoughts on this? Vijay From manu at netbsd.org Sun May 27 07:34:02 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Sun, 27 May 2012 09:34:02 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <4FC1CFB3.7050808@redhat.com> Message-ID: <1kkqxw0.1smapd1jsih9iM%manu@netbsd.org> Vijay Bellur wrote: > For the testcase that you sent out, I see fsi->valid being set to 1 > which indicates only mode on Linux. The truncate path does not get > involved. I modified the testcase to send ftruncate/truncate and it > completed successfully. I modified by FUSE implementation to send FATTR_SIZE|FATTR_FH in one request, and FATTR_MODE|FATTR_FH|FATTR_MTIME|FATTR_ATIME in a separate one, and the test passes fine. On your test not raising the bug: Is it possible that Linux already sent a FATTR_SIZE|FATTR_FH when fchmod() is invoked, and that glusterfs discards a FATTR_SIZE that does not really resize? Did you try with supplying a bigger size? > > local = mem_get0 (this->local_pool); > I don't see this in pl_truncate_cbk(). mem_get0 is done only in > pl_truncate(). A code inspection in pl_(f)truncate did not raise any > suspicions to me. Right, this was an unfortunate copy/paste. However reverting to correct code does not fix the bug when FUSE sends FATTR_SIZE is set with FATTR_MODE at the same time. > I am not sure why fchmod() should set size. Csaba, any thoughts on this? This is an optimization. You have an open file, you just grew it and you change mode. The NetBSD kernel and its FUSE implementation do the two operations in a single FUSE request, because they are smart :-) I will commit the fix in NetBSD FUSE. But one day the Linux kernel could decide to use the same shortcut too. It may be wise to fix glusterfs so that it does not assume FATTR_SIZE is not sent with other metadata changes. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From anand.avati at gmail.com Sun May 27 21:40:35 2012 From: anand.avati at gmail.com (Anand Avati) Date: Sun, 27 May 2012 14:40:35 -0700 Subject: [Gluster-devel] mismatching ino/dev between file In-Reply-To: <20120525133519.GC19383@homeworld.netbsd.org> References: <20120525133519.GC19383@homeworld.netbsd.org> Message-ID: Can you give some more steps how you reproduced this? This has never happened in any of our testing. This might probably related to the dirname() differences in BSD? Have you noticed this after the GNU dirname usage? Avati On Fri, May 25, 2012 at 6:35 AM, Emmanuel Dreyfus wrote: > Hi > > Here is a bug with release-3.3. It happens on a 2 way replicated. Here is > what I have in one brick: > > [2012-05-25 15:03:09.463446] W [posix-handle.c:487:posix_handle_hard] > 0-pfs-posix: mismatching ino/dev between file > /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (57943061/16) > and handle > /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce > (57943060/16) > [2012-05-25 15:03:09.463552] E [posix.c:1277:posix_symlink] 0-pfs-posix: > setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 > failed > > > On the other one: > > [2012-05-25 15:03:09.447682] W [posix-handle.c:487:posix_handle_hard] > 0-pfs-posix: mismatching ino/dev between file > /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 (50557989/24) > and handle > /export/wd3a/.glusterfs/0c/f3/0cf38737-4639-4112-8170-8720ae45d6ce > (50557988/24) > [2012-05-25 15:03:09.447774] E [posix.c:1277:posix_symlink] 0-pfs-posix: > setting gfid on /export/wd3a/manu/netbsd/usr/src/tools/host-mkdep/conf29276 > failed > > Someone can give me a hint of what happens, and how to track it down? > -- > Emmanuel Dreyfus > manu at netbsd.org > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manu at netbsd.org Mon May 28 01:52:41 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 03:52:41 +0200 Subject: [Gluster-devel] mismatching ino/dev between file In-Reply-To: Message-ID: <1kkscxr.1k0ou1xcxcd7rM%manu@netbsd.org> Anand Avati wrote: > Can you give some more steps how you reproduced this? This has never > happened in any of our testing. This might probably related to the > dirname() differences in BSD? Have you noticed this after the GNU dirname > usage? I will investigate further. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 28 02:08:19 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 04:08:19 +0200 Subject: [Gluster-devel] NULL loc in posix_acl_truncate In-Reply-To: <1kkqxw0.1smapd1jsih9iM%manu@netbsd.org> Message-ID: <1kkscze.1y0ip7wj3y9uoM%manu@netbsd.org> Emmanuel Dreyfus wrote: > I modified by FUSE implementation to send FATTR_SIZE|FATTR_FH in one > request, and FATTR_MODE|FATTR_FH|FATTR_MTIME|FATTR_ATIME in a separate > one, and the test passes fine. Um, I spoke too fast. Please disreagard the previous post. The problem was not setting size, and mode in the same request. That works fine. The bug appear when setting size, atime and mtime. It also appear when setting mode, atime and mtime. So here is the summary so far: ATTR_SIZE|FATTR_FH -> ok ATTR_SIZE|FATTR_FH|FATTR_ATIME|FATTR_MTIME -> breaks (*) ATTR_MODE|FATTR_FH -> ok ATTR_MODE|FATTR_FH|FATTR_ATIME|FATTR_MTIME -> breaks ATTR_MODE|FATTR_SIZE|FATTR_FH -> ok (I was wrong here) (*) I noticed that one long time ago, and NetBSD FUSE already strips atime and mtime if ATTR_SIZE is set without ATTR_MODE|ATTR_UID|ATTR_GID. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 28 05:07:46 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 07:07:46 +0200 Subject: [Gluster-devel] Testing server down in replicated volume Message-ID: <1kkslvx.fbj1ua1gom7oyM%manu@netbsd.org> Hi everybody After the last fix in NetBSD FUSE (cf NULL loc in posix_acl_truncate), glusterfs release-3.3 now behaves quite nicely on NetBSD. I have been able to build stuff in a replicated glusterfs volume for a few hours, and it seems much faster than 3.2.6. However things turn badly when I tried to kill glusterfsd on a server. Since the volume is replicated, I would have expected the build to carry on unaffected. but this is now what happens: a ENOTCONN is raised up to the processes using the glusterfs volume: In file included from /pfs/manu/netbsd/usr/src/sys/sys/signal.h:114, from /pfs/manu/netbsd/usr/src/sys/sys/param.h:150, from /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/net/__cmsg_align bytes.c:40: /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string /machine/signal.h: Socket is not connected Is it the intended behavior? Here is the client log: [2012-05-28 05:48:27.440017] W [socket.c:195:__socket_rwv] 0-pfs-client-1: writev failed (Broken pipe) [2012-05-28 05:48:27.440989] W [socket.c:195:__socket_rwv] 0-pfs-client-1: readv failed (Connection reset by peer) [2012-05-28 05:48:27.441496] W [socket.c:1512:__socket_proto_state_machine] 0-pfs-client-1: reading from socket failed. Error (Connection reset by peer), peer (193.54.82.98:24011) [2012-05-28 05:48:27.441825] E [rpc-clnt.c:373:saved_frames_unwind] 0-pfs-client-1: forced unwinding frame type(GlusterFS 3.1) op(INODELK(29)) called at 2012-05-28 05:48:27.439249 (xid=0x1715867x) [2012-05-28 05:48:27.442222] W [client3_1-fops.c:1495:client3_1_inodelk_cbk] 0-pfs-client-1: remote operation failed: Socket is not connected [2012-05-28 05:48:27.442528] E [rpc-clnt.c:373:saved_frames_unwind] 0-pfs-client-1: forced unwinding frame type(GlusterFS 3.1) op(SETATTR(38)) called at 2012-05-28 05:48:27.440397 (xid=0x1715868x) [2012-05-28 05:48:27.442971] W [client3_1-fops.c:1954:client3_1_setattr_cbk] 0-pfs-client-1: remote operation failed: Socket is not connected (and so on with other saved_frames_unwind) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Mon May 28 05:08:36 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Mon, 28 May 2012 07:08:36 +0200 Subject: [Gluster-devel] mismatching ino/dev between file In-Reply-To: <1kkscxr.1k0ou1xcxcd7rM%manu@netbsd.org> Message-ID: <1kksmhc.zfnn6i6bllp8M%manu@netbsd.org> Emmanuel Dreyfus wrote: > > Can you give some more steps how you reproduced this? This has never > > happened in any of our testing. This might probably related to the > > dirname() differences in BSD? Have you noticed this after the GNU dirname > > usage? > I will investigate further. It does not happen anymore. I think it was a consequence of the other bug I fixed. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu at netbsd.org From manu at netbsd.org Tue May 29 07:55:09 2012 From: manu at netbsd.org (Emmanuel Dreyfus) Date: Tue, 29 May 2012 07:55:09 +0000 Subject: [Gluster-devel] Testing server down in replicated volume In-Reply-To: <1kkslvx.fbj1ua1gom7oyM%manu@netbsd.org> References: <1kkslvx.fbj1ua1gom7oyM%manu@netbsd.org> Message-ID: <20120529075509.GE19383@homeworld.netbsd.org> On Mon, May 28, 2012 at 07:07:46AM +0200, Emmanuel Dreyfus wrote: [One server down in a replicated volume] > /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: > /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string > /machine/signal.h: Socket is not connected > > Is it the intended behavior? No reply? I would like to know if I have a NetBSD-specific bug to fix or if it is standard glusterfs behavior. -- Emmanuel Dreyfus manu at netbsd.org From pkarampu at redhat.com Tue May 29 09:09:04 2012 From: pkarampu at redhat.com (Pranith Kumar Karampuri) Date: Tue, 29 May 2012 05:09:04 -0400 (EDT) Subject: [Gluster-devel] Testing server down in replicated volume In-Reply-To: <20120529075509.GE19383@homeworld.netbsd.org> Message-ID: <97e7abfe-e431-47b8-bb26-cf70adbef253@zmail01.collab.prod.int.phx2.redhat.com> I am looking into this. Will reply soon. Pranith ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Emmanuel Dreyfus" Cc: gluster-devel at nongnu.org Sent: Tuesday, May 29, 2012 1:25:09 PM Subject: Re: [Gluster-devel] Testing server down in replicated volume On Mon, May 28, 2012 at 07:07:46AM +0200, Emmanuel Dreyfus wrote: [One server down in a replicated volume] > /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: > /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string > /machine/signal.h: Socket is not connected > > Is it the intended behavior? No reply? I would like to know if I have a NetBSD-specific bug to fix or if it is standard glusterfs behavior. -- Emmanuel Dreyfus manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From vijay at build.gluster.com Tue May 29 13:44:11 2012 From: vijay at build.gluster.com (Vijay Bellur) Date: Tue, 29 May 2012 06:44:11 -0700 (PDT) Subject: [Gluster-devel] glusterfs-3.3.0qa44 released Message-ID: <20120529134412.E8A3C100CB@build.gluster.com> http://bits.gluster.com/pub/gluster/glusterfs/3.3.0qa44/ http://bits.gluster.com/pub/gluster/glusterfs/src/glusterfs-3.3.0qa44.tar.gz This release is made off v3.3.0qa44 From pkarampu at redhat.com Tue May 29 17:28:32 2012 From: pkarampu at redhat.com (Pranith Kumar Karampuri) Date: Tue, 29 May 2012 13:28:32 -0400 (EDT) Subject: [Gluster-devel] Testing server down in replicated volume In-Reply-To: <20120529075509.GE19383@homeworld.netbsd.org> Message-ID: <4fb4ce32-9683-44cd-a7bd-aa935c79db29@zmail01.collab.prod.int.phx2.redhat.com> hi Emmanuel, I tried this for half an hour, everytime it failed because of readdir. It did not fail in any other fop. I saw that FINODELKs which relate to transactions in afr failed, but the fop succeeded on the other brick. I am not sure why a setattr (metadata transaction) is failing in your setup when a node is down. I will instrument the code to simulate the inodelk failure in setattr. Will update you tomorrow. Fop failing in readdir is also an issue that needs to be addressed. Pranith. ----- Original Message ----- From: "Emmanuel Dreyfus" To: "Emmanuel Dreyfus" Cc: gluster-devel at nongnu.org Sent: Tuesday, May 29, 2012 1:25:09 PM Subject: Re: [Gluster-devel] Testing server down in replicated volume On Mon, May 28, 2012 at 07:07:46AM +0200, Emmanuel Dreyfus wrote: [One server down in a replicated volume] > /pfs/manu/netbsd/usr/src/sys/sys/siginfo.h:35:54: error: > /pfs/manu/netbsd/usr/src/lib/libc/../../common/lib/libc/arch/i386/string > /machine/signal.h: Socket is not connected > > Is it the intended behavior? No reply? I would like to know if I have a NetBSD-specific bug to fix or if it is standard glusterfs behavior. -- Emmanuel Dreyfus manu at netbsd.org _______________________________________________ Gluster-devel mailing list Gluster-devel at nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel From bfoster at redhat.com Wed May 30 15:16:16 2012 From: bfoster at redhat.com (Brian Foster) Date: Wed, 30 May 2012 11:16:16 -0400 Subject: [Gluster-devel] glusterfs client and page cache Message-ID: <4FC639C0.6020503@redhat.com> Hi all, I've been playing with a little hack recently to add a gluster mount option to support FOPEN_KEEP_CACHE and I wanted to solicit some thoughts on whether there's value to find an intelligent way to support this functionality. To provide some context: Our current behavior with regard to fuse is that page cache is utilized by fuse, from what I can tell, just about in the same manner as a typical local fs. The primary difference is that by default, the address space mapping for an inode is completely invalidated on open. So for example, if process A opens and reads a file in a loop, subsequent reads are served from cache (bypassing fuse and gluster). If process B steps in and opens the same file, the cache is flushed and the next reads from either process are passed down through fuse. The FOPEN_KEEP_CACHE option simply disables this cache flash on open behavior. The following are some notes on my experimentation thus far: - With FOPEN_KEEP_CACHE, fuse currently only invalidates on file size changes. This is a problem in that I can rewrite some or all of a file from another client and the cached client wouldn't notice. I've sent a patch to fuse-devel to also invalidate on mtime changes (similar to nfsv3 or cifs), so we'll see how well that is received. fuse also supports a range based invalidation notification that we could take advantage of if necessary. - I reproduce a measurable performance benefit in the local/cached read situation. For example, running a kernel compile against a source tree in a gluster volume (no other xlators and build output to local storage) improves to 6 minutes from just under 8 minutes with the default graph (9.5 minutes with only the client xlator and 1:09 locally). - Some of the specific differences from current io-cache caching: - io-cache supports time based invalidation and tunables such as cache size and priority. The page cache has no such controls. - io-cache invalidates more frequently on various fops. It also looks like we invalidate on writes and don't take advantage of the write data most recently sent, whereas page cache writes are cached (errors notwithstanding). - Page cache obviously has tighter integration with the system (i.e., drop_caches controls, more specific reporting, ability to drop cache when memory is needed). All in all, I'm curious what people think about enabling the cache behavior in gluster. We could support anything from the basic mount option I'm currently using (i.e., similar to attribute/dentry caching) to something integrated with io-cache (doing invalidations when necessary), or maybe even something eventually along the lines of the nfs weak cache consistency model where it validates the cache after every fop based on file attributes. In general, are there other big issues/questions that would need to be explored before this is useful (i.e., the size invalidation issue)? Are there other performance tests that should be explored? Thoughts appreciated. Thanks. Brian From fernando.frediani at qubenet.net Wed May 30 16:19:36 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Wed, 30 May 2012 16:19:36 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F8BF93@QUBE-TR2-EXC01.qube.qubenet.net> Does anyone have an idea of this problem of not being able to power up the virtual machines on that NFS mount ? Also what do those logs mean that Anand say that there is a problem with the Repstr model. Is it something isn't finished yet ? Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 21:32 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Anand, Further to that I managed to mount the Datastore and deploy machines there, but when trying to power them On I get an error as if it couldn't find a file. Has anyone seen these kind of error before ? I would say that it could be a lock problem, but it doesn't seem to. Permissions maybe ? Or the way the NFS is exported ? (root_squash, no_root_squash, etc) Here is the log: An unexpected error was received from the ESX host while powering on VM vm-21112. Failed to power on VM. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 20:34 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From anand.avati at gmail.com Wed May 30 19:32:50 2012 From: anand.avati at gmail.com (Anand Avati) Date: Wed, 30 May 2012 12:32:50 -0700 Subject: [Gluster-devel] glusterfs client and page cache In-Reply-To: <4FC639C0.6020503@redhat.com> References: <4FC639C0.6020503@redhat.com> Message-ID: Brian, You are right, today we hardly leverage the page cache in the kernel. When Gluster started and performance translators were implemented, the fuse invalidation support did not exist, and since that support was brought in upstream fuse we haven't leveraged that effectively. We can actually do a lot more smart things using the invalidation changes. For the consistency concerns where an open fd continues to refer to local page cache - if that is a problem, today you need to mount with --enable-direct-io-mode to bypass the page cache altogether (this is very different from O_DIRECT open() support). On the other hand, to utilize the fuse invalidation APIs and promote using the page cache and still be consistent, we need to gear up glusterfs framework by first implementing server originated messaging support, then build some kind of opportunistic locking or leases to notify glusterfs clients about modifications from a second client, and third implement hooks in the client side listener to do things like sending fuse invalidations or purge pages in io-cache or flush pending writes in write-behind etc. This needs to happen, but we're short on resources to prioritize this sooner :-) Avati On Wed, May 30, 2012 at 8:16 AM, Brian Foster wrote: > Hi all, > > I've been playing with a little hack recently to add a gluster mount > option to support FOPEN_KEEP_CACHE and I wanted to solicit some thoughts > on whether there's value to find an intelligent way to support this > functionality. To provide some context: > > Our current behavior with regard to fuse is that page cache is utilized > by fuse, from what I can tell, just about in the same manner as a > typical local fs. The primary difference is that by default, the address > space mapping for an inode is completely invalidated on open. So for > example, if process A opens and reads a file in a loop, subsequent reads > are served from cache (bypassing fuse and gluster). If process B steps > in and opens the same file, the cache is flushed and the next reads from > either process are passed down through fuse. The FOPEN_KEEP_CACHE option > simply disables this cache flash on open behavior. > > The following are some notes on my experimentation thus far: > > - With FOPEN_KEEP_CACHE, fuse currently only invalidates on file size > changes. This is a problem in that I can rewrite some or all of a file > from another client and the cached client wouldn't notice. I've sent a > patch to fuse-devel to also invalidate on mtime changes (similar to > nfsv3 or cifs), so we'll see how well that is received. fuse also > supports a range based invalidation notification that we could take > advantage of if necessary. > > - I reproduce a measurable performance benefit in the local/cached read > situation. For example, running a kernel compile against a source tree > in a gluster volume (no other xlators and build output to local storage) > improves to 6 minutes from just under 8 minutes with the default graph > (9.5 minutes with only the client xlator and 1:09 locally). > > - Some of the specific differences from current io-cache caching: > - io-cache supports time based invalidation and tunables such as > cache > size and priority. The page cache has no such controls. > - io-cache invalidates more frequently on various fops. It also > looks > like we invalidate on writes and don't take advantage of the write data > most recently sent, whereas page cache writes are cached (errors > notwithstanding). > - Page cache obviously has tighter integration with the system > (i.e., > drop_caches controls, more specific reporting, ability to drop cache > when memory is needed). > > All in all, I'm curious what people think about enabling the cache > behavior in gluster. We could support anything from the basic mount > option I'm currently using (i.e., similar to attribute/dentry caching) > to something integrated with io-cache (doing invalidations when > necessary), or maybe even something eventually along the lines of the > nfs weak cache consistency model where it validates the cache after > every fop based on file attributes. > > In general, are there other big issues/questions that would need to be > explored before this is useful (i.e., the size invalidation issue)? Are > there other performance tests that should be explored? Thoughts > appreciated. Thanks. > > Brian > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bfoster at redhat.com Wed May 30 23:10:58 2012 From: bfoster at redhat.com (Brian Foster) Date: Wed, 30 May 2012 19:10:58 -0400 Subject: [Gluster-devel] glusterfs client and page cache In-Reply-To: References: <4FC639C0.6020503@redhat.com> Message-ID: <4FC6A902.9010406@redhat.com> On 05/30/2012 03:32 PM, Anand Avati wrote: > Brian, > You are right, today we hardly leverage the page cache in the kernel. > When Gluster started and performance translators were implemented, the > fuse invalidation support did not exist, and since that support was > brought in upstream fuse we haven't leveraged that effectively. We can > actually do a lot more smart things using the invalidation changes. > > For the consistency concerns where an open fd continues to refer to > local page cache - if that is a problem, today you need to mount with > --enable-direct-io-mode to bypass the page cache altogether (this is > very different from O_DIRECT open() support). On the other hand, to > utilize the fuse invalidation APIs and promote using the page cache and > still be consistent, we need to gear up glusterfs framework by first > implementing server originated messaging support, then build some kind > of opportunistic locking or leases to notify glusterfs clients about > modifications from a second client, and third implement hooks in the > client side listener to do things like sending fuse invalidations or > purge pages in io-cache or flush pending writes in write-behind etc. > This needs to happen, but we're short on resources to prioritize this > sooner :-) > Thanks for the context Avati. The fuse patch I sent lead to a similar thought process with regard to finer grained invalidation. So far it seems well received, and as I understand it, we can also utilize that mechanism to do full invalidations from gluster on older fuse modules that wouldn't have that fix. I'll look into incorporating that into what I have so far and making it available for review. Brian > Avati > > On Wed, May 30, 2012 at 8:16 AM, Brian Foster > wrote: > > Hi all, > > I've been playing with a little hack recently to add a gluster mount > option to support FOPEN_KEEP_CACHE and I wanted to solicit some thoughts > on whether there's value to find an intelligent way to support this > functionality. To provide some context: > > Our current behavior with regard to fuse is that page cache is utilized > by fuse, from what I can tell, just about in the same manner as a > typical local fs. The primary difference is that by default, the address > space mapping for an inode is completely invalidated on open. So for > example, if process A opens and reads a file in a loop, subsequent reads > are served from cache (bypassing fuse and gluster). If process B steps > in and opens the same file, the cache is flushed and the next reads from > either process are passed down through fuse. The FOPEN_KEEP_CACHE option > simply disables this cache flash on open behavior. > > The following are some notes on my experimentation thus far: > > - With FOPEN_KEEP_CACHE, fuse currently only invalidates on file size > changes. This is a problem in that I can rewrite some or all of a file > from another client and the cached client wouldn't notice. I've sent a > patch to fuse-devel to also invalidate on mtime changes (similar to > nfsv3 or cifs), so we'll see how well that is received. fuse also > supports a range based invalidation notification that we could take > advantage of if necessary. > > - I reproduce a measurable performance benefit in the local/cached read > situation. For example, running a kernel compile against a source tree > in a gluster volume (no other xlators and build output to local storage) > improves to 6 minutes from just under 8 minutes with the default graph > (9.5 minutes with only the client xlator and 1:09 locally). > > - Some of the specific differences from current io-cache caching: > - io-cache supports time based invalidation and tunables such > as cache > size and priority. The page cache has no such controls. > - io-cache invalidates more frequently on various fops. It > also looks > like we invalidate on writes and don't take advantage of the write data > most recently sent, whereas page cache writes are cached (errors > notwithstanding). > - Page cache obviously has tighter integration with the > system (i.e., > drop_caches controls, more specific reporting, ability to drop cache > when memory is needed). > > All in all, I'm curious what people think about enabling the cache > behavior in gluster. We could support anything from the basic mount > option I'm currently using (i.e., similar to attribute/dentry caching) > to something integrated with io-cache (doing invalidations when > necessary), or maybe even something eventually along the lines of the > nfs weak cache consistency model where it validates the cache after > every fop based on file attributes. > > In general, are there other big issues/questions that would need to be > explored before this is useful (i.e., the size invalidation issue)? Are > there other performance tests that should be explored? Thoughts > appreciated. Thanks. > > Brian > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > From johnmark at redhat.com Thu May 31 16:33:20 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 31 May 2012 12:33:20 -0400 (EDT) Subject: [Gluster-devel] A very special announcement from Gluster.org In-Reply-To: <344ab6e5-d6de-48d9-bfe8-e2727af7b45e@zmail01.collab.prod.int.phx2.redhat.com> Message-ID: <660ccad1-e191-405c-8645-1cb2fb02f80c@zmail01.collab.prod.int.phx2.redhat.com> Today, we?re announcing the next generation of GlusterFS , version 3.3. The release has been a year in the making and marks several firsts: the first post-acquisition release under Red Hat, our first major act as an openly-governed project and our first foray beyond NAS. We?ve also taken our first steps towards merging big data and unstructured data storage, giving users and developers new ways of managing their data scalability challenges. GlusterFS is an open source, fully distributed storage solution for the world?s ever-increasing volume of unstructured data. It is a software-only, highly available, scale-out, centrally managed storage pool that can be backed by POSIX filesystems that support extended attributes, such as Ext3/4, XFS, BTRFS and many more. This release provides many of the most commonly requested features including proactive self-healing, quorum enforcement, and granular locking for self-healing, as well as many additional bug fixes and enhancements. Some of the more noteworthy features include: ? Unified File and Object storage ? Blending OpenStack?s Object Storage API with GlusterFS provides simultaneous read and write access to data as files or as objects. ? HDFS compatibility ? Gives Hadoop administrators the ability to run MapReduce jobs on unstructured data on GlusterFS and access the data with well-known tools and shell scripts. ? Proactive self-healing ? GlusterFS volumes will now automatically restore file integrity after a replica recovers from failure. ? Granular locking ? Allows large files to be accessed even during self-healing, a feature that is particularly important for VM images. ? Replication improvements ? With quorum enforcement you can be confident that your data has been written in at least the configured number of places before the file operation returns, allowing a user-configurable adjustment to fault tolerance vs performance. Visit http://www.gluster.org to download. Packages are available for most distributions, including Fedora, Debian, RHEL, Ubuntu and CentOS. Get involved! Join us on #gluster on freenode, join our mailing list , ?like? our Facebook page , follow us on Twitter , or check out our LinkedIn group . GlusterFS is an open source project sponsored by Red Hat ?, who uses it in its line of Red Hat Storage products. (this post published at http://www.gluster.org/2012/05/introducing-glusterfs-3-3/ ) -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.frediani at qubenet.net Thu May 31 16:36:36 2012 From: fernando.frediani at qubenet.net (Fernando Frediani (Qube)) Date: Thu, 31 May 2012 16:36:36 +0000 Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F8BF93@QUBE-TR2-EXC01.qube.qubenet.net> References: <6EC7489C49252F4F823EAE91E3A9393931F743EF@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F744FA@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F75854@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F758CE@QUBE-TR2-EXC01.qube.qubenet.net> <6EC7489C49252F4F823EAE91E3A9393931F8BF93@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <6EC7489C49252F4F823EAE91E3A9393931F8DC08@QUBE-TR2-EXC01.qube.qubenet.net> What is happening with this ? Non one actually care to take ownership about this ? If this is a bug why nobody is interested to get it fixed ? If not someone speak up please. Two things are not working as they supposed, I am reporting back and nobody seems to give a dam about it. -----Original Message----- From: Fernando Frediani (Qube) Sent: 30 May 2012 17:20 To: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Does anyone have an idea of this problem of not being able to power up the virtual machines on that NFS mount ? Also what do those logs mean that Anand say that there is a problem with the Repstr model. Is it something isn't finished yet ? Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 21:32 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Anand, Further to that I managed to mount the Datastore and deploy machines there, but when trying to power them On I get an error as if it couldn't find a file. Has anyone seen these kind of error before ? I would say that it could be a lock problem, but it doesn't seem to. Permissions maybe ? Or the way the NFS is exported ? (root_squash, no_root_squash, etc) Here is the log: An unexpected error was received from the ESX host while powering on VM vm-21112. Failed to power on VM. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Unable to retrieve the current working directory: 0 (No such file or directory). Check if the directory has been deleted or unmounted. Regards, Fernando -----Original Message----- From: Fernando Frediani (Qube) Sent: 25 May 2012 20:34 To: 'Anand Babu Periasamy' Cc: 'gluster-devel at nongnu.org' Subject: RE: [Gluster-devel] Can't use NFS with VMware ESXi Hi Anand, Thanks for that . It actually worked using Distributed+Replicated. However the 2 main reasons I am testing version 3.3 is first and mainly because of the Granular Locking therefore suited to run VMs and also I found that using Repstr(Replicated + Striped (+ distributed)) for VMDK files as they are normally large it was going to distribute it in many chunks across several bricks increasing both read and write performance when accessing it as that would spread the IOPS too all bricks and disks containing the chunks of the file. Also if I understand correctly, if a VM that has a massive VMDK file (2TB for example) using this new volume type it wouldn't be stored into a single brick preventing it to get unbalanced on the amount of free space compared to the others. Am I right on my assumptions ? Also with regards the problem I've reported below what do you think it could be and how to get that working ? I wanted afterwards to make a performance comparison between both volume types. Thanks Regards, Fernando -----Original Message----- From: Anand Babu Periasamy [mailto:abperiasamy at gmail.com] Sent: 25 May 2012 18:09 To: Fernando Frediani (Qube) Cc: gluster-devel at nongnu.org Subject: Re: [Gluster-devel] Can't use NFS with VMware ESXi On Fri, May 25, 2012 at 4:36 AM, Fernando Frediani (Qube) wrote: > Actually, even on another Linux machine mounting NFS has the same behaviour. > I am able to mount it with "mount -t nfs ." but when I try "ls" it > hangs as well. > > One particular thing of the Gluster servers is that they have two > networks, one for management with default gateway and another only for > storage. I am only able to mount on the storage network. > > The hosts file has all nodes' names with the ips on the storage network. > > > > I tried to use this but didn't work either. > > gluster volume set VOLUME nfs.rpc-auth-allow 10.10.100.* > > > > Watching the nfs logs when I try a "ls" from the remote client it shows: > > > > pending frames: > > > > patchset: git://git.gluster.com/glusterfs.git > > signal received: 11 > > time of crash: 2012-05-25 11:38:09 > > configuration details: > > argp 1 > > backtrace 1 > > dlfcn 1 > > fdatasync 1 > > libpthread 1 > > llistxattr 1 > > setfsid 1 > > spinlock 1 > > epoll.h 1 > > xattr.h 1 > > st_atim.tv_nsec 1 > > package-string: glusterfs 3.3.0beta4 > > /lib64/libc.so.6(+0x32900)[0x7f1c92d92900] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_lookup > +0xa5)[0x7f1c8e7a6ac5] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/stripe.so(stripe_readdi > rp_cbk+0x536)[0x7f1c8e543346] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/cluster/replicate.so(afr_readdi > rp_cbk+0x1ca)[0x7f1c8e76269a] > > /usr/lib64/glusterfs/3.3.0beta4/xlator/protocol/client.so(client3_1_re > addirp_cbk+0x170)[0x7f1c8e9dbbe0] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x7f1c9388b302] > > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb6)[0x7f1c9388b516] > > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7f1c93886e17] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_p > oll_in+0x3f)[0x7f1c8f818c8f] > > /usr/lib64/glusterfs/3.3.0beta4/rpc-transport/socket.so(socket_event_h > andler+0x188)[0x7f1c8f818e38] > > /usr/lib64/libglusterfs.so.0(+0x3eb51)[0x7f1c93ad0b51] > > /usr/sbin/glusterfs(main+0x502)[0x406612] > > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1c92d7ecdd] > > /usr/sbin/glusterfs[0x404399] > > > > Thanks > > > Fernando > > > > From: Fernando Frediani (Qube) > Sent: 25 May 2012 10:44 > To: 'gluster-devel at nongnu.org' > Subject: Can't use NFS with VMware ESXi > > > > Hi, > > > > I've setup a Gluster environment using CentOS 6.2 and GlusterFS 3.3 > beta4 and the new type of volume striped + replicated. My go is to use > it to run Virtual Machines (.vmdk files). > > > > Volume is created fine and the ESXi server mountw the Datastore using > Gluster ?built-in NFS, however ?when trying to use the Datastore or > even read, it hangs. > > > > Looking at the Gluster NFS logs I see: ???"[socket.c:195:__socket_rwv] > 0-socket.nfs-server: readv failed (Connection reset by peer)" > > > > In order to get the rpm files installed I had first to install these > two because of the some libraries: > "compat-readline5-5.2-17.1.el6.x86_64".rpm > and "openssl098e-0.9.8e-17.el6.centos.x86_64.rpm".Not sure if it has > anything to do with that. > > > > Has anyone ever used Gluster as a backend storage for ESXi ? Does it > actually work ? > > > > Regards, > > > > Fernando Frediani > Lead Systems Engineer > > Qube Managed Services Limited > 260-266 Goswell Road, London, EC1V 7EB, United Kingdom Hi Fernando, can you please try distributed+replicated. I won't recommend replicated-stripe for VM environment. Stripe was largely developed for HPC pre and post processing jobs (large number of clients reading / writing same file). In any case, this looks like a bug in replicated-stripe. -- Anand Babu Periasamy Blog [http://www.unlocksmith.org] Imagination is more important than knowledge --Albert Einstein From johnmark at redhat.com Thu May 31 16:48:45 2012 From: johnmark at redhat.com (John Mark Walker) Date: Thu, 31 May 2012 12:48:45 -0400 (EDT) Subject: [Gluster-devel] Can't use NFS with VMware ESXi In-Reply-To: <6EC7489C49252F4F823EAE91E3A9393931F8DC08@QUBE-TR2-EXC01.qube.qubenet.net> Message-ID: <59507de0-4264-4e27-ac94-c9b34890a5f4@zmail01.collab.prod.int.phx2.redhat.com> ----- Original Message ----- > What is happening with this ? > Non one actually care to take ownership about this ? > If this is a bug why nobody is interested to get it fixed ? If not > someone speak up please. > Two things are not working as they supposed, I am reporting back and > nobody seems to give a dam about it. Hi Fernando, If nobody is replying, it's because they don't have experience with your particular setup, or they've never seen this problem before. If you feel it's a bug, then please file a bug at http://bugzilla.redhat.com/ You can also ask questions on the IRC channel: #gluster Or on http://community.gluster.org/ I know it can be frustrating, but please understand that you will get a response only if someone out there has experience with your problem. Thanks, John Mark Community guy