<br><br><div class="gmail_quote">On Sun Feb 08 2015 at 10:16:27 PM Ben England &lt;<a href="mailto:bengland@redhat.com">bengland@redhat.com</a>&gt; wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Avati, I&#39;m all for your zero-copy RDMA API proposal, but I have a concern about your proposed zero-copy fop below...<br>

<br>

----- Original Message -----<br>

&gt; From: &quot;Anand Avati&quot; &lt;<a href="mailto:avati@gluster.org" target="_blank">avati@gluster.org</a>&gt;<br>

&gt; To: &quot;Mohammed Rafi K C&quot; &lt;<a href="mailto:rkavunga@redhat.com" target="_blank">rkavunga@redhat.com</a>&gt;, &quot;Gluster Devel&quot; &lt;<a href="mailto:gluster-devel@gluster.org" target="_blank">gluster-devel@gluster.org</a>&gt;<br>

&gt; Cc: &quot;Raghavendra Gowdappa&quot; &lt;<a href="mailto:rgowdapp@redhat.com" target="_blank">rgowdapp@redhat.com</a>&gt;, &quot;Ben Turner&quot; &lt;<a href="mailto:bturner@redhat.com" target="_blank">bturner@redhat.com</a>&gt;, &quot;Ben England&quot;<br>

&gt; &lt;<a href="mailto:bengland@redhat.com" target="_blank">bengland@redhat.com</a>&gt;, &quot;Suman Debnath&quot; &lt;<a href="mailto:sdebnath@redhat.com" target="_blank">sdebnath@redhat.com</a>&gt;<br>

&gt; Sent: Saturday, January 24, 2015 1:15:52 AM<br>

&gt; Subject: Re: RDMA: Patch to make use of pre registered memory<br>

&gt;<br>

&gt; Couple of comments -<br>

&gt;<br>

&gt; ...<br>

&gt; 4. Next step for zero-copy would be introduction of a new fop readto()<br>

&gt; where the destination pointer is passed from the caller (gfapi being the<br>

&gt; primary use case). In this situation RDMA ought to register that memory if<br>

&gt; necessary and request server to RDMA_WRITE into the pointer provided by<br>

&gt; gfapi caller.<br>

<br>

The readto() API is emulating the Linux/Unix read() system call, where the caller passes in the address of the read buffer.  This API was created half a century ago in a non-distributed world.  IMHO The caller should not specify where the read data should arrive, instead it should let the read API specify where the data arrived.  There should be a pre-registered pool of buffers, that both the sender and receiver *already* knew about, that can be used for RDMA reads, and one of these will be passed to the caller as part of the read &quot;event&quot; or completion.  This seems related to performance results that Rafi KC had posted earlier this month.<br>

<br>

Why does it matter?  With RDMA, the read transfer cannot begin until the OTHER END of the RDMA connection knows where the data will land, and it cannot know this soon enough if we wait until the read API call to specify what address to target.  An API where the caller specifies the buffer address *blocks* the sender, introduces latency (transmitting RDMA-able address to sender) and prevents pipelined, overlapping activity by sender and receiver.<br></blockquote><div><br></div><div>If I understand your question right, you are expressing concern that read-ahead cannot be done with readto() semantics. That is true in a sense, but generally not a concern. The typical use case is with qemu where we ideally want gluster server to do RDMA_WRITE of read() rpc reply straight into the page cache of the guest. QEMU always gives a pointer to its block layer (just like the half-century old Unix read()) to fill data in. So this is a given constraint under which we need to work. The reason this is not as grave a problem as it appears is because all these are working behind the read-ahead of the guest VM. Guest VM&#39;s read-ahead (if it is linux, probably other OSes too) is typically asynchronous and as long as gluster can handle multiple RDMA/RPC in parallel/pipeline (which it can) the &quot;blocks the sender&quot; problem does not really exist.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

So a read FOP for RDMA should be more like read_completion_event(buffer ** read_data_delivered).   It is possible to change libgfapi to support this since it does not have to conform rigidly to POSIX.  Could this work in Gluster translator API?   RPC interface?<br>

<br>

So then how would the remote sender find out when it was ok to re-use this buffer to service another RDMA read request?   Is there an interface, something like read_buffer_consumed(buffer * available_buf), on read API side that indicates to RDMA that the caller has consumed the buffer and it is ready for re-use, without the added expense of unregistering and re-registering?<br>

<br>

If so, then you then have a pipeline of buffers in one of 4 states:<br>

<br>

- in transmission by sender to reader<br>

- being consumed by reader<br>

- being returned to sender for re-use<br>

- available to sender<br>

- go back to state 1<br>

<br>

By increasing the number of buffers sufficiently, we can avoid a situation where round-trip latency prevents you from filling the gigantic 40-Gbps (56-Gbps for FDR IB) RDMA pipeline.<br>

<br>

I&#39;m also interested in how writes work - how do we avoid copies on the write path and also avoid having to re-register buffers with each write?<br>

<br>

BTW None of these concerns, or the concerns discussed by Rafi KC, are addressed in the Gluster 3.6 RDMA feature page.<br><br></blockquote><div><br></div><div>We had a very similar API in the previous incarnation of libgfapi (which was called libglusterfsclient) where read would just return the iobuf which could possibly have been read-ahead into or io-cache&#39;ed in the past. It had the equivalent of <span style="font-size:13.1999998092651px">read_buffer_consumed() etc as well. This is a fine approach in terms of efficiency, but in practice you would need to create an app from scratch designed for this style of API. The reality is that applications like to keep control of memory and what lands where, and do not like memory to be dictated and managed by underlying layers. What I mean is, even if we provide an API like what you suggest, the caller would most likely just memcpy() the data in iobuf into its own managed buffer anyways. Also the application has to now be very careful of calling read_buffer_consumed() instead of free() depending on the specific buffer. This can be very tricky has free() could be done in so many places and deeply nested layers in the application and all those places should now have awareness that a buffer could either be malloced or originated from gluster - this model is a big failure, we saw it in libglusterfslcient.</span></div><div><span style="font-size:13.1999998092651px"><br></span></div><div><span style="font-size:13.1999998092651px">So, it is always best to let read-ahead happen in a layer as close to the client as possible. In case of qemu/gfapi it is best left to the guest VM to do read-ahead, and all layers below do the best w.r.t avoiding memcpy. In other gfapi use cases read-ahead xlator makes sense (when the app needs to be simple) and in such cases memcpy() is unavoidable (the cost one pays for having a simple app and expecting performance)</span></div><div><span style="font-size:13.1999998092651px"><br></span></div><div>Thanks</div><div>Avati</div></div>