<div dir="ltr">Ok, that made a lot of sense. I guess what I was expecting was that the writes were (close to) immediately consistent, but Gluster is rather designed to be eventually consistent. <div><div><br></div><div>Thanks for explaining all that.</div><div><br></div><div>Eric<br><div><br></div></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Apr 9, 2015 at 5:45 PM, Jeff Darcy <span dir="ltr">&lt;<a href="mailto:jdarcy@redhat.com" target="_blank">jdarcy@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">&gt; Jeff: I don&#39;t really understand how a write-behind translator could keep data<br>

&gt; in memory before flushing to the replication module if the replication is<br>

&gt; synchronous. Or put another way, from whose perspective is the replication<br>

&gt; synchronous? The gluster daemon or the creating client?<br>

<br>

</span>That&#39;s actually a more complicated question than many would think.  When we<br>

say &quot;synchronous replication&quot; we&#39;re talking about *durability* (i.e. does<br>

the disk see it) from the perspective of the replication module.  It does<br>

none of its own caching or buffering.  When it is asked to do a write, it<br>

does not report that write as complete until all copies have been updated.<br>

<br>

However, durability is not the same as consistency (i.e. do *other clients*<br>

see it) and the replication component does not exist in a vacuum.  There<br>

are other components both before and after that can affect durability and<br>

consistency.  We&#39;ve already touched on the &quot;after&quot; part.  There might be<br>

caches at many levels that become stale as the result of a file being<br>

created and written.  Of particular interest here are &quot;negative directory<br>

entries&quot; which indicate that a file is *not* present.  Until those expire,<br>

it is possible to see a file as &quot;not there&quot; even though it does actually<br>

exist on disk.  We can control some of this caching, but not all.<br>

<br>

The other side is *before* the replication module, and that&#39;s where<br>

write-behind comes in.  POSIX does not require that a write be immediately<br>

durable in the absence of O_SYNC/fsync and so on.  We do honor those<br>

requirements where applicable.  However, the most common user expectation<br>

is that we will defer/batch/coalesce writes, because making every write<br>

individually immediate and synchronous has a very large performance impact.<br>

Therefore we implement write-behind, as a layer above replication.  Absent<br>

any specific request to perform a write immediately, data might sit there<br>

for an indeterminate (but usually short) time before the replication code<br>

even gets to see it.<br>

<br>

I don&#39;t think write-behind is likely to be the issue here, because it<br>

only applies to data within a file.  It will pass create(2) calls through<br>

immediately, so all servers should become aware of the file&#39;s existence<br>

right away.  On the other hand, various forms of caching on the *client*<br>

side (even if they&#39;re the same physical machines) could still prevent a<br>

new file from being seen immediately.<br>

</blockquote></div><br></div>