<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 8, 2016 at 4:31 PM, Soumya Koduri <span dir="ltr">&lt;<a href="mailto:skoduri@redhat.com" target="_blank">skoduri@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>

<br>

On 02/08/2016 09:13 AM, Shyam wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On 02/06/2016 06:36 PM, Raghavendra Gowdappa wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

----- Original Message -----<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

From: &quot;Raghavendra Gowdappa&quot; &lt;<a href="mailto:rgowdapp@redhat.com" target="_blank">rgowdapp@redhat.com</a>&gt;<br>

To: &quot;Sakshi Bansal&quot; &lt;<a href="mailto:sabansal@redhat.com" target="_blank">sabansal@redhat.com</a>&gt;, &quot;Susant Palai&quot;<br>

&lt;<a href="mailto:spalai@redhat.com" target="_blank">spalai@redhat.com</a>&gt;<br>

Cc: &quot;Gluster Devel&quot; &lt;<a href="mailto:gluster-devel@gluster.org" target="_blank">gluster-devel@gluster.org</a>&gt;, &quot;Nithya<br>

Balachandran&quot; &lt;<a href="mailto:nbalacha@redhat.com" target="_blank">nbalacha@redhat.com</a>&gt;, &quot;Shyamsundar<br>

Ranganathan&quot; &lt;<a href="mailto:srangana@redhat.com" target="_blank">srangana@redhat.com</a>&gt;<br>

Sent: Friday, February 5, 2016 4:32:40 PM<br>

Subject: Re: Rebalance data migration and corruption<br>

<br>

+gluster-devel<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Hi Sakshi/Susant,<br>

<br>

- There is a data corruption issue in migration code. Rebalance<br>

process,<br>

   1. Reads data from src<br>

   2. Writes (say w1) it to dst<br>

<br>

   However, 1 and 2 are not atomic, so another write (say w2) to<br>

same region<br>

   can happen between 1. But these two writes can reach dst in the<br>

order<br>

   (w2,<br>

   w1) resulting in a subtle corruption. This issue is not fixed yet<br>

and can<br>

   cause subtle data corruptions. The fix is simple and involves<br>

rebalance<br>

   process acquiring a mandatory lock to make 1 and 2 atomic.<br>

</blockquote>

<br>

We can make use of compound fop framework to make sure we don&#39;t suffer a<br>

significant performance hit. Following will be the sequence of<br>

operations<br>

done by rebalance process:<br>

<br>

1. issues a compound (mandatory lock, read) operation on src.<br>

2. writes this data to dst.<br>

3. issues unlock of lock acquired in 1.<br>

<br>

Please co-ordinate with Anuradha for implementation of this compound<br>

fop.<br>

<br>

Following are the issues I see with this approach:<br>

1. features/locks provides mandatory lock functionality only for<br>

posix-locks<br>

(flock and fcntl based locks). So, mandatory locks will be<br>

posix-locks which<br>

will conflict with locks held by application. So, if an application<br>

has held<br>

an fcntl/flock, migration cannot proceed.<br>

</blockquote></blockquote></blockquote>

<br></div></div>

What if the file is opened with O_NONBLOCK? Cant rebalance process skip the file and continue in case if mandatory lock acquisition fails?</blockquote><div><br></div><div style="">Similar functionality can be achieved by acquiring non-blocking inodelk like SETLK (as opposed to SETLKW). However whether rebalance process should block or not depends on the use case. In Some use-cases (like remove-brick) rebalance process _has_ to migrate all the files. Even for other scenarios skipping too many files is not a good idea as it beats the purpose of running rebalance. So one of the design goals is to migrate as many files as possible without making design too complex.</div><div style=""><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

We can implement a &quot;special&quot; domain for mandatory internal locks.<br>

These locks will behave similar to posix mandatory locks in that<br>

conflicting fops (like write, read) are blocked/failed if they are<br>

done while a lock is held.<br>

</blockquote></blockquote>

<br></span>

So is the only difference between mandatory internal locks and posix mandatory locks is that internal locks shall not conflict with other application locks(advisory/mandatory)?</blockquote><div><br></div><div style="">Yes. Mandatory internal locks (aka Mandatory inodelk for this discussion) will conflict only in their domain. They also conflict with any fops that might change the file (primarily write here, but different fops can be added based on requirement). So in a fop like writev we need to check in two lists - external lock (posix lock) list _and_ mandatory inodelk list.</div><div style=""><br></div><div style="">The reason (if not clear) for using mandatory locks by rebalance process is that clients need not be bothered with acquiring a lock (which will unnecessarily degrade performance of I/O when there is no rebalance going on). Thanks to Raghavendra Talur for suggesting this idea (though in a different context of lock migration, but the use-cases are similar).</div><div style=""><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

2. data migration will be less efficient because of an extra unlock<br>

(with<br>

compound lock + read) or extra lock and unlock (for non-compound fop<br>

based<br>

implementation) for every read it does from src.<br>

</blockquote>

<br>

Can we use delegations here? Rebalance process can acquire a<br>

mandatory-write-delegation (an exclusive lock with a functionality<br>

that delegation is recalled when a write operation happens). In that<br>

case rebalance process, can do something like:<br>

<br>

1. Acquire a read delegation for entire file.<br>

2. Migrate the entire file.<br>

3. Remove/unlock/give-back the delegation it has acquired.<br>

<br>

If a recall is issued from brick (when a write happens from mount), it<br>

completes the current write to dst (or throws away the read from src)<br>

to maintain atomicity. Before doing next set of (read, src) and<br>

(write, dst) tries to reacquire lock.<br>

</blockquote>

<br>

With delegations this simplifies the normal path, when a file is<br>

exclusively handled by rebalance. It also improves the case where a<br>

client and rebalance are conflicting on a file, to degrade to mandatory<br>

locks by either parties.<br>

<br>

I would prefer we take the delegation route for such needs in the future.<br>

<br>

</blockquote></span>

Right. But if there are simultaneous access to the same file from any other client and rebalance process, delegations shall not be granted or revoked if granted even though they are operating at different offsets. So if you rely only on delegations, migration may not proceed if an application has held a lock or doing any I/Os.<br></blockquote><div><br></div><div style="">Does the brick process wait for the response of delegation holder (rebalance process here) before it wipes out the delegation/locks? If that&#39;s the case, rebalance process can complete one transaction of (read, src) and (write, dst) before responding to a delegation recall. That way there is no starvation for both applications and rebalance process (though this makes both of them slower, but that cannot helped I think).</div><div style=""><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Also ideally rebalance process has to take write delegation as it would end up writing the data on destination brick which shall affect READ I/Os, (though of course we can have special checks/hacks for internal generated fops).<br></blockquote><div><br></div><div style="">No, read delegations (on src) are sufficient for our use case. All we need is that if there is a write on src while rebalance-process has a delegation, We need that write to be blocked till rebalance process returns that delegation back. Write delegations are unnecessarily more restrictive as they conflict with application reads too, which we don&#39;t need. For the sake of clarity client always writes to src first and then to dst. Also, writes to src and dst are serialized. So, its sufficient we synchronize on src.</div><div style=""><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

That said, having delegations shall definitely ensure correctness with respect to exclusive file access.<br>

<br>

Thanks,<br>

Soumya<div class="HOEnZb"><div class="h5"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

@Soumyak, can something like this be done with delegations?<br>

<br>

@Pranith,<br>

Afr does transactions for writing to its subvols. Can you suggest any<br>

optimizations here so that rebalance process can have a transaction<br>

for (read, src) and (write, dst) with minimal performance overhead?<br>

<br>

regards,<br>

Raghavendra.<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Comments?<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

regards,<br>

Raghavendra.<br>

</blockquote>

<br>

</blockquote></blockquote></blockquote>

_______________________________________________<br>

Gluster-devel mailing list<br>

<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>

<a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-devel</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Raghavendra G<br></div>

</div></div>