<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On 15 September 2016 at 17:21, Raghavendra Gowdappa <span dir="ltr"><<a href="mailto:rgowdapp@redhat.com" target="_blank">rgowdapp@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>
<br>
----- Original Message -----<br>
> From: "Xavier Hernandez" <<a href="mailto:xhernandez@datalab.es">xhernandez@datalab.es</a>><br>
> To: "Raghavendra G" <<a href="mailto:raghavendra@gluster.com">raghavendra@gluster.com</a>>, "Nithya Balachandran" <<a href="mailto:nbalacha@redhat.com">nbalacha@redhat.com</a>><br>
> Cc: "Gluster Devel" <<a href="mailto:gluster-devel@gluster.org">gluster-devel@gluster.org</a>>, "Mohit Agrawal" <<a href="mailto:moagrawa@redhat.com">moagrawa@redhat.com</a>><br>
> Sent: Thursday, September 15, 2016 4:54:25 PM<br>
> Subject: Re: [Gluster-devel] Query regards to heal xattr heal in dht<br>
><br>
><br>
><br>
> On 15/09/16 11:31, Raghavendra G wrote:<br>
> ><br>
> ><br>
> > On Thu, Sep 15, 2016 at 12:02 PM, Nithya Balachandran<br>
> > <<a href="mailto:nbalacha@redhat.com">nbalacha@redhat.com</a> <mailto:<a href="mailto:nbalacha@redhat.com">nbalacha@redhat.com</a>>> wrote:<br>
> ><br>
> ><br>
> ><br>
> > On 8 September 2016 at 12:02, Mohit Agrawal <<a href="mailto:moagrawa@redhat.com">moagrawa@redhat.com</a><br>
> > <mailto:<a href="mailto:moagrawa@redhat.com">moagrawa@redhat.com</a>>> wrote:<br>
> ><br>
> > Hi All,<br>
> ><br>
> > I have one another solution to heal user xattr but before<br>
> > implement it i would like to discuss with you.<br>
> ><br>
> > Can i call function (dht_dir_xattr_heal internally it is<br>
> > calling syncop_setxattr) to heal xattr in dht_getxattr_cbk in last<br>
> > after make sure we have a valid xattr.<br>
> > In function(dht_dir_xattr_heal) it will copy blindly all user<br>
> > xattr on all subvolume or i can compare subvol xattr with valid<br>
> > xattr if there is any mismatch then i will call syncop_setxattr<br>
> > otherwise no need to call. syncop_setxattr.<br>
> ><br>
> ><br>
> ><br>
> > This can be problematic if a particular xattr is being removed - it<br>
> > might still exist on some subvols. IIUC, the heal would go and reset<br>
> > it again?<br>
> ><br>
> > One option is to use the hash subvol for the dir as the source - so<br>
> > perform xattr op on hashed subvol first and on the others only if it<br>
> > succeeds on the hashed. This does have the problem of being unable<br>
> > to set xattrs if the hashed subvol is unavailable. This might not be<br>
> > such a big deal in case of distributed replicate or distribute<br>
> > disperse volumes but will affect pure distribute. However, this way<br>
> > we can at least be reasonably certain of the correctness (leaving<br>
> > rebalance out of the picture).<br>
> ><br>
> ><br>
> > * What is the behavior of getxattr when hashed subvol is down? Should we<br>
> > succeed with values from non-hashed subvols or should we fail getxattr?<br>
> > With hashed-subvol as source of truth, its difficult to determine<br>
> > correctness of xattrs and their values when it is down.<br>
> ><br>
> > * setxattr is an inode operation (as opposed to entry operation). So, we<br>
> > cannot calculate hashed-subvol as in (get)(set)xattr, parent layout and<br>
> > "basename" is not available. This forces us to store hashed subvol in<br>
> > inode-ctx. Now, when the hashed-subvol changes we need to update these<br>
> > inode-ctxs too.<br>
> ><br>
> > What do you think about a Quorum based solution to this problem?<br>
> ><br>
> > 1. setxattr succeeds only if it is successful on at least (n/2 + 1)<br>
> > number of subvols.<br>
> > 2. getxattr succeeds only if it is successful and values match on at<br>
> > least (n/2 + 1) number of subvols.<br>
> ><br>
> > The flip-side of this solution is we are increasing the probability of<br>
> > failure of (get)(set)xattr operations as opposed to the hashed-subvol as<br>
> > source of truth solution. Or are we - how do we compare probability of<br>
> > hashed-subvol going down with probability of (n/2 + 1) nodes going down<br>
> > simultaneously? Is it 1/n vs (1/n*1/n*... (n/2+1 times)?. Is 1/n correct<br>
> > probability for _a specific subvol (hashed-subvol)_ going down (as<br>
> > opposed to _any one subvol_ going down)?<br>
><br>
> If we suppose p to be the probability of failure of a subvolume in a<br>
> period of time (a year for example), all subvolumes have the same<br>
> probability, and we have N subvolumes, then:<br>
><br>
> Probability of failure of hashed-subvol: p<br>
> Probability of failure of N/2 + 1 or more subvols: <attached as an image><br>
<br>
</div></div>Thanks Xavi. That was quick :).<br>
<span class=""><br>
><br>
> Note that this probability says how much probable is that N/2 + 1<br>
> subvols or more fail in the specified period of time, but not<br>
> necessarily simultaneously. If we suppose that subvolumes are recovered<br>
> as fast as possible, the real probability of simultaneous failure will<br>
> be much smaller.<br>
><br>
> In worst case (not recovering the failed subvolumes in the given period<br>
> of time), if p < 0.5 or N = 2 (and p != 1), then it's always better to<br>
> check N/2 + 1 subvolumes. Otherwise, it's better to check the hashed-subvol.<br>
><br>
> I think that p should always be much smaller than 0.5 for small periods<br>
> of time where subvolume recovery could no be completed before other<br>
> failures, so checking half plus one subvols should always be the best<br>
> option in terms of probability. Performance can suffer though if some<br>
> kind of synchronization is needed.<br>
<br>
</span>For this problem, no synchronization is needed. We need to wind the (get)(set)xattr call to all subvols though. What I didn't think through is rollback/rollforward during setxattr if the op fails on more than quorum subvols. One problem with rollback approach is that we may never get a chance to rollback at all and how do we handle racing setxattrs on the same key from different clients/apps (coupled with rollback etc). I need to think more about this.<br>
<div class="HOEnZb"><div class="h5"><br></div></div></blockquote><div><br></div><div>A quorum will make it difficult to figure out the correct value in case of in flight modifications/deletions How do you decide the correct value ? This can potentially cause in progress modifications /deletions to be overwritten and to fail silently. The single point of truth (hashed subvol) helps here. Else we will need to bring in some synchronization here.</div><div><br></div><div>I have not looked at the patch yet but how does it currently handle xattr deletes that failed on a subvol?</div><div><br></div><div>A hashed subvol being unavailable should hopefully not be very common esp if there is AFR or EC loaded. A pure distribute volume is expected to have some data unavailability if bricks are unavailable.</div><div><br></div><div>The one case where this could cause major issues as Shyam pointed out is in the case of ACLs /SE linux where the information is stored in xattrs. Something to think about.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">
><br>
> Xavi<br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > Let me know if this approach is suitable.<br>
> ><br>
> ><br>
> ><br>
> > Regards<br>
> > Mohit Agrawal<br>
> ><br>
> > On Wed, Sep 7, 2016 at 10:27 PM, Pranith Kumar Karampuri<br>
> > <<a href="mailto:pkarampu@redhat.com">pkarampu@redhat.com</a> <mailto:<a href="mailto:pkarampu@redhat.com">pkarampu@redhat.com</a>>> wrote:<br>
> ><br>
> ><br>
> ><br>
> > On Wed, Sep 7, 2016 at 9:46 PM, Mohit Agrawal<br>
> > <<a href="mailto:moagrawa@redhat.com">moagrawa@redhat.com</a> <mailto:<a href="mailto:moagrawa@redhat.com">moagrawa@redhat.com</a>>> wrote:<br>
> ><br>
> > Hi Pranith,<br>
> ><br>
> ><br>
> > In current approach i am getting list of xattr from<br>
> > first up volume and update the user attributes from that<br>
> > xattr to<br>
> > all other volumes.<br>
> ><br>
> > I have assumed first up subvol is source and rest of<br>
> > them are sink as we are doing same in dht_dir_attr_heal.<br>
> ><br>
> ><br>
> > I think first up subvol is different for different mounts as<br>
> > per my understanding, I could be wrong.<br>
> ><br>
> ><br>
> ><br>
> > Regards<br>
> > Mohit Agrawal<br>
> ><br>
> > On Wed, Sep 7, 2016 at 9:34 PM, Pranith Kumar Karampuri<br>
> > <<a href="mailto:pkarampu@redhat.com">pkarampu@redhat.com</a> <mailto:<a href="mailto:pkarampu@redhat.com">pkarampu@redhat.com</a>>> wrote:<br>
> ><br>
> > hi Mohit,<br>
> > How does dht find which subvolume has the<br>
> > correct list of xattrs? i.e. how does it determine<br>
> > which subvolume is source and which is sink?<br>
> ><br>
> > On Wed, Sep 7, 2016 at 2:35 PM, Mohit Agrawal<br>
> > <<a href="mailto:moagrawa@redhat.com">moagrawa@redhat.com</a> <mailto:<a href="mailto:moagrawa@redhat.com">moagrawa@redhat.com</a>>><br>
> > wrote:<br>
> ><br>
> > Hi,<br>
> ><br>
> > I am trying to find out solution of one<br>
> > problem in dht specific to user xattr healing.<br>
> > I tried to correct it in a same way as we are<br>
> > doing for healing dir attribute but i feel it is<br>
> > not best solution.<br>
> ><br>
> > To find a right way to heal xattr i want to<br>
> > discuss with you if anyone does have better<br>
> > solution to correct it.<br>
> ><br>
> > Problem:<br>
> > In a distributed volume environment custom<br>
> > extended attribute value for a directory does<br>
> > not display correct value after stop/start the<br>
> > brick. If any extended attribute value is set<br>
> > for a directory after stop the brick the<br>
> > attribute value is not updated on brick after<br>
> > start the brick.<br>
> ><br>
> > Current approach:<br>
> > 1) function set_user_xattr to store user<br>
> > extended attribute in dictionary<br>
> > 2) function dht_dir_xattr_heal call<br>
> > syncop_setxattr to update the attribute on all<br>
> > volume<br>
> > 3) Call the function (dht_dir_xattr_heal)<br>
> > for every directory lookup in<br>
> > dht_lookup_revalidate_cbk<br>
> ><br>
> > Psuedocode for function dht_dir_xatt_heal is<br>
> > like below<br>
> ><br>
> > 1) First it will fetch atttributes from first<br>
> > up volume and store into xattr.<br>
> > 2) Run loop on all subvolume and fetch<br>
> > existing attributes from every volume<br>
> > 3) Replace user attributes from current<br>
> > attributes with xattr user attributes<br>
> > 4) Set latest extended attributes(current +<br>
> > old user attributes) inot subvol.<br>
> ><br>
> ><br>
> > In this current approach problem is<br>
> ><br>
> > 1) it will call heal<br>
> > function(dht_dir_xattr_heal) for every directory<br>
> > lookup without comparing xattr.<br>
> > 2) The function internally call syncop xattr<br>
> > for every subvolume that would be a expensive<br>
> > operation.<br>
> ><br>
> > I have one another way like below to correct<br>
> > it but again in this one it does have dependency<br>
> > on time (not sure time is synch on all bricks or<br>
> > not)<br>
> ><br>
> > 1) At the time of set extended<br>
> > attribute(setxattr) change time in metadata at<br>
> > server side<br>
> > 2) Compare change time before call healing<br>
> > function in dht_revalidate_cbk<br>
> ><br>
> > Please share your input on this.<br>
> > Appreciate your input.<br>
> ><br>
> > Regards<br>
> > Mohit Agrawal<br>
> ><br>
> > ______________________________<wbr>_________________<br>
> > Gluster-devel mailing list<br>
> > <a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>
> > <mailto:<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.<wbr>org</a>><br>
> > <a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><br>
> > <<a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><wbr>><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > --<br>
> > Pranith<br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > --<br>
> > Pranith<br>
> ><br>
> ><br>
> ><br>
> > ______________________________<wbr>_________________<br>
> > Gluster-devel mailing list<br>
> > <a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a> <mailto:<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.<wbr>org</a>><br>
> > <a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><br>
> > <<a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><wbr>><br>
> ><br>
> ><br>
> ><br>
> > ______________________________<wbr>_________________<br>
> > Gluster-devel mailing list<br>
> > <a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a> <mailto:<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.<wbr>org</a>><br>
> > <a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><br>
> > <<a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><wbr>><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > --<br>
> > Raghavendra G<br>
> ><br>
> ><br>
> > ______________________________<wbr>_________________<br>
> > Gluster-devel mailing list<br>
> > <a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>
> > <a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><br>
> ><br>
><br>
</div></div></blockquote></div><br></div></div>