Please review <a href="http://review.gluster.org/9332/">http://review.gluster.org/9332/</a>, as it undoes the introduction of itransform on d_off in AFR. This does not solve DHT-over-DHT or other future use cases, but at least fixes the regression in 3.6.x.<br><div><br></div><div>Thanks</div><br><div class="gmail_quote">On Tue Dec 23 2014 at 10:34:41 AM Anand Avati <<a href="mailto:avati@gluster.org">avati@gluster.org</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Using GFID does not work for d_off. The GFID represents and inode, and a d_off represents a directory entry. Therefore using GFID as an alternative to d_off breaks down when you have hardlinks for the same inode in a single directory.<br><br><div class="gmail_quote">On Tue Dec 23 2014 at 2:20:34 AM Xavier Hernandez <<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 12/22/2014 06:41 PM, Jeff Darcy wrote:<br>
>> An alternative would be to convert directories into regular files from<br>
>> the brick point of view.<br>
>><br>
>> The benefits of this would be:<br>
>><br>
>> * d_off would be controlled by gluster, so all bricks would have the<br>
>> same d_off and order. No need to use any d_off mapping or transformation.<br>
><br>
> I don't think a full-out change from real directories to virtual ones is<br>
> in the cards, but a variant of this idea might be worth exploring further.<br>
> If we had a *server side* component to map between on-disk d_off values<br>
> and those we present to clients, then it might be able to do a better job<br>
> than the local FS of ensuring uniqueness within the bits (e.g. 48 of them)<br>
> that are left over after we subtract some for a brick ID. This could be<br>
> enough to make the bit-stealing approach (on the client) viable. There<br>
> are probably some issues with failing over between replicas, which should<br>
> have the same files but might not have assigned the same internal d_off<br>
> values, but those issues might be avoidable if the d_off values are<br>
> deterministic with respect to GFIDs.<br>
<br>
Having a server-side xlator seems a better approximation, however I see<br>
some problems that need to be solved:<br>
<br>
The mapper should work on the fly (i.e. it should do the mapping between<br>
the local d_off to the client d_off without having full knowledge of the<br>
directory contents). This is a good approach for really big directories<br>
because doesn't require to waste large amounts of memory, but it will be<br>
hard to find a way to avoid duplicates, specially if are limited to ~48<br>
bits.<br>
<br>
Making it based on the GFID would be a good way to have common d_off<br>
between bricks, however maintaining order will be harder. It will also<br>
be hard to guarantee uniqueness if mapping is deterministic and<br>
directory is very big. Otherwise it would need to read full directory<br>
contents before returning mapped d_off's.<br>
<br>
To minimize the collision problem, we need to solve the ordering<br>
problem. If we can guarantee that all bricks return directory entries in<br>
the same order and d_off, we don't need to reserve some bits in d_off.<br>
<br>
I think the virtual directories solution should be the one to consider<br>
for 4.0. For earlier versions we can try to find an intermediate solution.<br>
<br>
Following your idea of a server side component, could this be useful ?<br>
<br>
* Keep all directories and its entries in a double linked list stored in<br>
xattr of each inode.<br>
<br>
* Use this linked list to build the readdir answer.<br>
<br>
* Use the first 64 (or 63) bits of gfid as the d_off.<br>
<br>
* There will be two special offsets: 0 for '.' and 1 for '..'<br>
<br>
Example (using shorter gfid's for simplicity):<br>
<br>
Directory root with gfid 0001<br>
Directory 'test1' inside root with gfid 1111<br>
Directory 'test2' inside root with gfid 2222<br>
Entry 'entry1' inside 'test1' with gfid 3333<br>
Entry 'entry2' inside 'test1' with gfid 4444<br>
Entry 'entry3' inside 'test2' with gfid 4444<br>
Entry 'entry4' inside 'test2' with gfid 5555<br>
Entry 'entry5' inside 'test2' with gfid 6666<br>
<br>
/ (0001)<br>
test1/ (1111)<br>
entry1 (3333)<br>
entry2 (4444)<br>
test2/ (2222)<br>
entry3 (4444)<br>
entry4 (5555)<br>
entry5 (6666)<br>
<br>
Note that entry2 and entry3 are hardlinks.<br>
<br>
xattrs of root (0001):<br>
trusted.dirmap.0001.next = 1111<br>
trusted.dirmap.0001.prev = 2222<br>
<br>
xattrs of 'test1' (1111):<br>
trusted.dirmap.0001.next = 2222<br>
trusted.dirmap.0001.prev = 0001<br>
trusted.dirmap.1111.next = 3333<br>
trusted.dirmap.1111.prev = 4444<br>
<br>
xattrs of 'test2' (2222):<br>
trusted.dirmap.0001.next = 0001<br>
trusted.dirmap.0001.prev = 1111<br>
trusted.dirmap.2222.next = 4444<br>
trusted.dirmap.2222.prev = 6666<br>
<br>
xattrs of 'entry1' (3333):<br>
trusted.dirmap.1111.next = 4444<br>
trusted.dirmap.1111.prev = 1111<br>
<br>
xattrs of 'entry2'/'entry3' (4444):<br>
trusted.dirmap.1111.next = 1111<br>
trusted.dirmap.1111.prev = 3333<br>
trusted.dirmap.2222.next = 5555<br>
trusted.dirmap.2222.prev = 2222<br>
<br>
xattrs of 'entry4' (5555):<br>
trusted.dirmap.2222.next = 6666<br>
trusted.dirmap.2222.prev = 4444<br>
<br>
xattrs of 'entry5' (6666):<br>
trusted.dirmap.2222.next = 2222<br>
trusted.dirmap.2222.prev = 5555<br>
<br>
It's easy to enumerate all entries from the beginning of a directory.<br>
Also, since we return extra information from each inode in a directory,<br>
accessing these new xattrs doesn't represent a big impact.<br>
<br>
Given a random d_off, it's relatively easy to find a gfid that starts<br>
with d_off and belongs to the directory (using .glusterfs/xx/yy/...). It<br>
could also be easy to implement the "nearest" of d_off.<br>
<br>
This mechanism guarantees the same order and d_off on all bricks<br>
(assuming that directory modification operations are done with some<br>
locks held), and all management is made locally to each brick, being<br>
transparent to the clients. It also has a very low probability of<br>
collisions (we can use 63 or 64 bits for d_off instead of 48) and does<br>
not require any transformation in dht/afr/ec.<br>
<br>
However some special operation would need to be implemented to allow<br>
healing procedures to correctly add a directory entry in the correct<br>
place to maintain order between bricks when things are not ok.<br>
<br>
It adds some complexity to ensure integrity, but since this job is done<br>
locally on each brick, it seems easier to maintain.<br>
<br>
Xavi<br>
</blockquote></div></blockquote></div>