Please review <a href="http://review.gluster.org/9332/">http://review.gluster.org/9332/</a>, as it undoes the introduction of itransform on d_off in AFR. This does not solve DHT-over-DHT or other future use cases, but at least fixes the regression in 3.6.x.<br><div><br></div><div>Thanks</div><br><div class="gmail_quote">On Tue Dec 23 2014 at 10:34:41 AM Anand Avati &lt;<a href="mailto:avati@gluster.org">avati@gluster.org</a>&gt; wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Using GFID does not work for d_off. The GFID represents and inode, and a d_off represents a directory entry. Therefore using GFID as an alternative to d_off breaks down when you have hardlinks for the same inode in a single directory.<br><br><div class="gmail_quote">On Tue Dec 23 2014 at 2:20:34 AM Xavier Hernandez &lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt; wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 12/22/2014 06:41 PM, Jeff Darcy wrote:<br>

&gt;&gt; An alternative would be to convert directories into regular files from<br>

&gt;&gt; the brick point of view.<br>

&gt;&gt;<br>

&gt;&gt; The benefits of this would be:<br>

&gt;&gt;<br>

&gt;&gt; * d_off would be controlled by gluster, so all bricks would have the<br>

&gt;&gt; same d_off and order. No need to use any d_off mapping or transformation.<br>

&gt;<br>

&gt; I don&#39;t think a full-out change from real directories to virtual ones is<br>

&gt; in the cards, but a variant of this idea might be worth exploring further.<br>

&gt; If we had a *server side* component to map between on-disk d_off values<br>

&gt; and those we present to clients, then it might be able to do a better job<br>

&gt; than the local FS of ensuring uniqueness within the bits (e.g. 48 of them)<br>

&gt; that are left over after we subtract some for a brick ID.  This could be<br>

&gt; enough to make the bit-stealing approach (on the client) viable.  There<br>

&gt; are probably some issues with failing over between replicas, which should<br>

&gt; have the same files but might not have assigned the same internal d_off<br>

&gt; values, but those issues might be avoidable if the d_off values are<br>

&gt; deterministic with respect to GFIDs.<br>

<br>

Having a server-side xlator seems a better approximation, however I see<br>

some problems that need to be solved:<br>

<br>

The mapper should work on the fly (i.e. it should do the mapping between<br>

the local d_off to the client d_off without having full knowledge of the<br>

directory contents). This is a good approach for really big directories<br>

because doesn&#39;t require to waste large amounts of memory, but it will be<br>

hard to find a way to avoid duplicates, specially if are limited to ~48<br>

bits.<br>

<br>

Making it based on the GFID would be a good way to have common d_off<br>

between bricks, however maintaining order will be harder. It will also<br>

be hard to guarantee uniqueness if mapping is deterministic and<br>

directory is very big. Otherwise it would need to read full directory<br>

contents before returning mapped d_off&#39;s.<br>

<br>

To minimize the collision problem, we need to solve the ordering<br>

problem. If we can guarantee that all bricks return directory entries in<br>

the same order and d_off, we don&#39;t need to reserve some bits in d_off.<br>

<br>

I think the virtual directories solution should be the one to consider<br>

for 4.0. For earlier versions we can try to find an intermediate solution.<br>

<br>

Following your idea of a server side component, could this be useful ?<br>

<br>

* Keep all directories and its entries in a double linked list stored in<br>

xattr of each inode.<br>

<br>

* Use this linked list to build the readdir answer.<br>

<br>

* Use the first 64 (or 63) bits of gfid as the d_off.<br>

<br>

* There will be two special offsets: 0 for &#39;.&#39; and 1 for &#39;..&#39;<br>

<br>

Example (using shorter gfid&#39;s for simplicity):<br>

<br>

Directory root with gfid 0001<br>

Directory &#39;test1&#39; inside root with gfid 1111<br>

Directory &#39;test2&#39; inside root with gfid 2222<br>

Entry &#39;entry1&#39; inside &#39;test1&#39; with gfid 3333<br>

Entry &#39;entry2&#39; inside &#39;test1&#39; with gfid 4444<br>

Entry &#39;entry3&#39; inside &#39;test2&#39; with gfid 4444<br>

Entry &#39;entry4&#39; inside &#39;test2&#39; with gfid 5555<br>

Entry &#39;entry5&#39; inside &#39;test2&#39; with gfid 6666<br>

<br>

/ (0001)<br>

   test1/ (1111)<br>

     entry1 (3333)<br>

     entry2 (4444)<br>

   test2/ (2222)<br>

     entry3 (4444)<br>

     entry4 (5555)<br>

     entry5 (6666)<br>

<br>

Note that entry2 and entry3 are hardlinks.<br>

<br>

xattrs of root (0001):<br>

    trusted.dirmap.0001.next = 1111<br>

    trusted.dirmap.0001.prev = 2222<br>

<br>

xattrs of &#39;test1&#39; (1111):<br>

    trusted.dirmap.0001.next = 2222<br>

    trusted.dirmap.0001.prev = 0001<br>

    trusted.dirmap.1111.next = 3333<br>

    trusted.dirmap.1111.prev = 4444<br>

<br>

xattrs of &#39;test2&#39; (2222):<br>

    trusted.dirmap.0001.next = 0001<br>

    trusted.dirmap.0001.prev = 1111<br>

    trusted.dirmap.2222.next = 4444<br>

    trusted.dirmap.2222.prev = 6666<br>

<br>

xattrs of &#39;entry1&#39; (3333):<br>

    trusted.dirmap.1111.next = 4444<br>

    trusted.dirmap.1111.prev = 1111<br>

<br>

xattrs of &#39;entry2&#39;/&#39;entry3&#39; (4444):<br>

    trusted.dirmap.1111.next = 1111<br>

    trusted.dirmap.1111.prev = 3333<br>

    trusted.dirmap.2222.next = 5555<br>

    trusted.dirmap.2222.prev = 2222<br>

<br>

xattrs of &#39;entry4&#39; (5555):<br>

    trusted.dirmap.2222.next = 6666<br>

    trusted.dirmap.2222.prev = 4444<br>

<br>

xattrs of &#39;entry5&#39; (6666):<br>

    trusted.dirmap.2222.next = 2222<br>

    trusted.dirmap.2222.prev = 5555<br>

<br>

It&#39;s easy to enumerate all entries from the beginning of a directory.<br>

Also, since we return extra information from each inode in a directory,<br>

accessing these new xattrs doesn&#39;t represent a big impact.<br>

<br>

Given a random d_off, it&#39;s relatively easy to find a gfid that starts<br>

with d_off and belongs to the directory (using .glusterfs/xx/yy/...). It<br>

could also be easy to implement the &quot;nearest&quot; of d_off.<br>

<br>

This mechanism guarantees the same order and d_off on all bricks<br>

(assuming that directory modification operations are done with some<br>

locks held), and all management is made locally to each brick, being<br>

transparent to the clients. It also has a very low probability of<br>

collisions (we can use 63 or 64 bits for d_off instead of 48) and does<br>

not require any transformation in dht/afr/ec.<br>

<br>

However some special operation would need to be implemented to allow<br>

healing procedures to correctly add a directory entry in the correct<br>

place to maintain order between bricks when things are not ok.<br>

<br>

It adds some complexity to ensure integrity, but since this job is done<br>

locally on each brick, it seems easier to maintain.<br>

<br>

Xavi<br>

</blockquote></div></blockquote></div>