<div dir="ltr">Ok, that made a lot of sense. I guess what I was expecting was that the writes were (close to) immediately consistent, but Gluster is rather designed to be eventually consistent. <div><div><br></div><div>Thanks for explaining all that.</div><div><br></div><div>Eric<br><div><br></div></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Apr 9, 2015 at 5:45 PM, Jeff Darcy <span dir="ltr"><<a href="mailto:jdarcy@redhat.com" target="_blank">jdarcy@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">> Jeff: I don't really understand how a write-behind translator could keep data<br>
> in memory before flushing to the replication module if the replication is<br>
> synchronous. Or put another way, from whose perspective is the replication<br>
> synchronous? The gluster daemon or the creating client?<br>
<br>
</span>That's actually a more complicated question than many would think. When we<br>
say "synchronous replication" we're talking about *durability* (i.e. does<br>
the disk see it) from the perspective of the replication module. It does<br>
none of its own caching or buffering. When it is asked to do a write, it<br>
does not report that write as complete until all copies have been updated.<br>
<br>
However, durability is not the same as consistency (i.e. do *other clients*<br>
see it) and the replication component does not exist in a vacuum. There<br>
are other components both before and after that can affect durability and<br>
consistency. We've already touched on the "after" part. There might be<br>
caches at many levels that become stale as the result of a file being<br>
created and written. Of particular interest here are "negative directory<br>
entries" which indicate that a file is *not* present. Until those expire,<br>
it is possible to see a file as "not there" even though it does actually<br>
exist on disk. We can control some of this caching, but not all.<br>
<br>
The other side is *before* the replication module, and that's where<br>
write-behind comes in. POSIX does not require that a write be immediately<br>
durable in the absence of O_SYNC/fsync and so on. We do honor those<br>
requirements where applicable. However, the most common user expectation<br>
is that we will defer/batch/coalesce writes, because making every write<br>
individually immediate and synchronous has a very large performance impact.<br>
Therefore we implement write-behind, as a layer above replication. Absent<br>
any specific request to perform a write immediately, data might sit there<br>
for an indeterminate (but usually short) time before the replication code<br>
even gets to see it.<br>
<br>
I don't think write-behind is likely to be the issue here, because it<br>
only applies to data within a file. It will pass create(2) calls through<br>
immediately, so all servers should become aware of the file's existence<br>
right away. On the other hand, various forms of caching on the *client*<br>
side (even if they're the same physical machines) could still prevent a<br>
new file from being seen immediately.<br>
</blockquote></div><br></div>