<html><body><div style="font-family: garamond,new york,times,serif; font-size: 12pt; color: #000000"><div><br></div><div><br></div><hr id="zwchr"><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;" data-mce-style="border-left: 2px solid #1010FF; margin-left: 5px; padding-left: 5px; color: #000; font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><b>From: </b>"Vijay Bellur" <vbellur@redhat.com><br><b>To: </b>"Krutika Dhananjay" <kdhananj@redhat.com><br><b>Cc: </b>"Gluster Devel" <gluster-devel@gluster.org><br><b>Sent: </b>Tuesday, February 24, 2015 12:26:58 PM<br><b>Subject: </b>Re: [Gluster-devel] Sharding - Inode write fops - recoverability from failures - design<br><div><br></div>On 02/24/2015 12:19 PM, Krutika Dhananjay wrote:<br>><br>><br>> ------------------------------------------------------------------------<br>><br>> *From: *"Vijay Bellur" <vbellur@redhat.com><br>> *To: *"Krutika Dhananjay" <kdhananj@redhat.com><br>> *Cc: *"Gluster Devel" <gluster-devel@gluster.org><br>> *Sent: *Tuesday, February 24, 2015 11:35:28 AM<br>> *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -<br>> recoverability from failures - design<br>><br>> On 02/24/2015 10:36 AM, Krutika Dhananjay wrote:<br>> ><br>> ><br>> ><br>> ------------------------------------------------------------------------<br>> ><br>> > *From: *"Vijay Bellur" <vbellur@redhat.com><br>> > *To: *"Krutika Dhananjay" <kdhananj@redhat.com>, "Gluster Devel"<br>> > <gluster-devel@gluster.org><br>> > *Sent: *Monday, February 23, 2015 5:25:57 PM<br>> > *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -<br>> > recoverability from failures - design<br>> ><br>> > On 02/22/2015 06:08 PM, Krutika Dhananjay wrote:<br>> > > Hi,<br>> > ><br>> > > Please find the design doc for one of the problems in<br>> sharding which<br>> > > Pranith and I are trying to solve and its solution @<br>> > > http://review.gluster.org/#/c/9723/1.<br>> > > Reviews and feedback are much appreciated.<br>> > ><br>> ><br>> > Can this feature be made optional? I think there are use<br>> cases like<br>> > virtual machine image storage, hdfs etc. where the number of<br>> metadata<br>> > queries might not be very high. It would be an acceptable<br>> tradeoff in<br>> > such cases to not be very efficient for answering metadata<br>> queries but<br>> > be very efficient for data operations.<br>> ><br>> > IOW, can we have two possible modes of operation for the sharding<br>> > translator to answer metadata queries?<br>> ><br>> > 1. One that behaves like a regular filesystem where we expect<br>> a mix of<br>> > data and metadata operations. Your document seems to cover<br>> that part<br>> > well. We can look at optimizing behavior for multi-threaded<br>> single<br>> > writer use cases after an initial implementation is in place.<br>> > Techniques<br>> > like eager locking can be applied here.<br>> ><br>> > 2. Another mode where we do not expect a lot of metadata<br>> queries. In<br>> > this mode, we can visit all nodes where we have shards to<br>> answer these<br>> > queries.<br>> ><br>> > But for sharding translator to be able to visit all shards, it is<br>> > required to know the last shard number.<br>> > Without this, it will never know when to stop looking up the<br>> different<br>> > shards. For this to happen, we<br>> > still need to maintain the size attribute for each file.<br>> ><br>><br>> Wouldn't maintaining the total number of shards in the metadata<br>> shard be<br>> sufficient?<br>><br>> Maintaining the correctness of "total number of shards" would again<br>> incur the same cost as maintaining size or any other metadata attribute<br>> if a client/brick crashes in the middle of a write fop before the<br>> attribute is committed to disk.<br>> In other words, we will again need to maintain a "dirty" and "committed"<br>> copy of the shard_count to ensure its correctness.<br>><br><div><br></div>I think the cost of maintaining "total number of shards" is not as <br>expensive as maintaining size or any other metadata attribute. The shard <br>count needs to be updated only when an extending operation results in <br>the creation of a new shard or when a truncate operation results in the <br>removal of a shard. Maintaining other metadata attributes would need a 5 <br>phase transaction for every write operation. Isn't that the case?</blockquote><div>Even size attribute changes only in case of extending writes and truncates. In fact, Pranith and I had</div><div>initially chosen to persist shard count as opposed to size in the first design for inode write fops.</div><div>But the reason we <span style="font-size: 12pt;">decided to go with size in the end is to prevent extra lookup on the last shard to</span></div><div><span style="font-size: 12pt;">find the total size </span><span style="font-size: 12pt;">of the file (i.e., if N is the total number of shards, file size = (N-1)*shard_block_size + sizeof(last shard)).</span></div><div><span style="font-size: 12pt;"><br></span></div><div><span style="font-size: 12pt;">-Krutika</span></div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;" data-mce-style="border-left: 2px solid #1010FF; margin-left: 5px; padding-left: 5px; color: #000; font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><br><div><br></div>-Vijay<br><div><br></div></blockquote><div><br></div></div></body></html>