<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Sep 21, 2016 at 10:58 PM, Raghavendra Talur <span dir="ltr">&lt;<a href="mailto:rtalur@redhat.com" target="_blank">rtalur@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span class="gmail-">On Wed, Sep 21, 2016 at 6:32 PM, Ric Wheeler <span dir="ltr">&lt;<a href="mailto:ricwheeler@gmail.com" target="_blank">ricwheeler@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><span>On 09/21/2016 08:06 AM, Raghavendra Gowdappa wrote:<br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">


Hi all,<br>


<br>


This mail is to figure out the behavior of write to same file from two different fds. As Ryan quotes in one of comments,<br>


<br>


&lt;comment&gt;<br>


<br>


I think it’s not safe. in this case:<br>


1. P1 write to F1 use FD1<br>


2. after P1 write finish, P2 write to the same place use FD2<br>


since they are not conflict with each other now, the order the 2 writes send to underlying fs is not determined. so the final data may be P1’s or P2’s.<br>


this semantics is not the same with linux buffer io. linux buffer io will make the second write cover the first one, this is to say the final data is P2’s.<br>


you can see it from linux NFS (as we are all network filesystem) fs/nfs/file.c:nfs_write_begin(<wbr>), nfs will flush ‘incompatible’ request first before another write begin. the way 2 request is determine to be ‘incompatible’ is that they are from 2 different open fds.<br>


I think write-behind behaviour should keep the same with linux page cache.<br>


<br>


&lt;/comment&gt;<br>


</blockquote>


<br></span>


I think that how this actually would work is that both would be written to the same page in the page cache (if not using buffered IO), so as long as they do not happen at the same time, you would get the second P2 copy of data each time.<br></blockquote><div><br></div></span><div>I apologize if my understanding is wrong but IMO this is exactly what we do in write-behind too. The cache is inode based and ensures that writes are ordered irrespective of the FD used for the write.</div><div><br></div><div><br></div><div>Here is the commit message which brought the change</div><div>------------------------------<wbr>------------------------------<wbr>-------------------------</div><div><div>write-behind: implement causal ordering and other cleanup</div><div>                                                                              </div><div>Rules of causal ordering implemented:¬                                                                                                                                                                                                                                                                                                                                                    </div><div>                                                                               </div><div> - If request A arrives after the acknowledgement (to the app,¬                  </div><div>   i.e, STACK_UNWIND) of another request B, then request B is¬                   </div><div>   said to have &#39;caused&#39; request A.¬                                             </div></div></div></div></div></blockquote><div><br></div><div>With the above principle, two write requests (p1 and p2 in example above) issued by _two different threads/processes_ there need _not always_ be a &#39;causal&#39; relationship (whether there is a causal relationship is purely based on the &quot;chance&quot; that write-behind chose to ack one/both of them and their timing of arrival). So, current write-behind is agnostic to the ordering of p1 and p2 (when done by two threads).</div><div><br></div><div>However if p1 and p2 are issued by same thread there is _always_ a causal relationship (p2 being caused by p1). </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div>                                                                               </div><div>- (corollary) Two requests, which at any point of time, are¬                    </div><div>   unacknowledged simultaneously in the system can never &#39;cause&#39;¬                </div><div>   each other (wb_inode-&gt;gen is based on this)¬                                  </div><div>                                                                               </div><div> - If request A is caused by request B, AND request A&#39;s region¬                  </div><div>   has an overlap with request B&#39;s region, then then the fulfillment¬            </div><div>   of request A is guaranteed to happen after the fulfillment of B.¬             </div><div>                                                                               </div><div> - FD of origin is not considered for the determination of causal¬               </div><div>   ordering.¬                                                                    </div><div>                                                                               </div><div> - Append operation&#39;s region is considered the whole file.¬                      </div><div>                                                                               </div><div> Other cleanup:¬                                                                 </div><div>                                                                               </div><div> - wb_file_t not required any more.¬                                             </div><div>                                                                              </div><div> - wb_local_t not required any more.¬                                            </div><div>                                                                               </div><div> - O_RDONLY fd&#39;s operations now go through the queue to make sure¬               </div><div>   writes in the requested region get fulfilled be</div></div><div>------------------------------<wbr>------------------------------<wbr>------------------------------<wbr>-----</div><div><br></div><div>Thanks,</div><div>Raghavendra Talur</div><div><div class="gmail-h5"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">


<br>


Same story for using O_DIRECT - that write bypasses the page cache and will update the data directly.<br>


<br>


What might happen in practice though is that your applications might use higher level IO routines and they might buffer data internally. If that happens, there is no ordering that is predictable.<br>


<br>


Regards,<br>


<br>


Ric<div><div><br>


<br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">


<br>


However, my understanding is that filesystems need not maintain the relative order of writes (as it received from vfs/kernel) on two different fds. Also, if we have to maintain the order it might come with increased latency. The increased latency can be because of having &quot;newer&quot; writes to wait on &quot;older&quot; ones. This wait can fill up write-behind buffer and can eventually result in a full write-behind cache and hence not able to &quot;write-back&quot; newer writes.<br>


<br>


* What does POSIX say about it?<br>


* How do other filesystems behave in this scenario?<br>


<br>


<br>


Also, the current write-behind implementation has the concept of &quot;generation numbers&quot;. To quote from comment:<br>


<br>


&lt;write-behind src&gt;<br>


<br>


         uint64_t     gen;    /* Liability generation number. Represents<br>


                                 the current &#39;state&#39; of liability. Every<br>


                                 new addition to the liability list bumps<br>


                                 the generation number.<br>


                                                                                                                                                                                                          a newly arrived request is only required<br>


                                 to perform causal checks against the entries<br>


                                 in the liability list which were present<br>


                                 at the time of its addition. the generation<br>


                                 number at the time of its addition is stored<br>


                                 in the request and used during checks.<br>


                                                                                                                                                                                                          the liability list can grow while the request<br>


                                 waits in the todo list waiting for its<br>


                                 dependent operations to complete. however<br>


                                 it is not of the request&#39;s concern to depend<br>


                                 itself on those new entries which arrived<br>


                                 after it arrived (i.e, those that have a<br>


                                 liability generation higher than itself)<br>


                              */<br>


&lt;/src&gt;<br>


<br>


So, if a single thread is doing writes on two different fds, generation numbers are sufficient to enforce the relative ordering. If writes are from two different threads/processes, I think write-behind is not obligated to maintain their order. Comments?<br>


<br>


[1] <a href="http://review.gluster.org/#/c/15380/" rel="noreferrer" target="_blank">http://review.gluster.org/#/c/<wbr>15380/</a><br>


<br>


regards,<br>


Raghavendra<br>


______________________________<wbr>_________________<br>


Gluster-devel mailing list<br>


<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>


<a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/mailman<wbr>/listinfo/gluster-devel</a><br>


</blockquote>


<br>


<br>


______________________________<wbr>_________________<br>


Gluster-devel mailing list<br>


<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>


<a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/mailman<wbr>/listinfo/gluster-devel</a></div></div></blockquote></div></div></div><br></div></div>


<br>______________________________<wbr>_________________<br>


Gluster-devel mailing list<br>


<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>


<a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Raghavendra G<br></div>


</div></div>