<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">Good luck!<br>

      <br>

      On 25/04/2016 10:01 PM, Kevin Lemonnier wrote:<br>

    </div>

    <blockquote cite="mid:20160425120108.GZ22525@luwin.ulrar.net"

      type="cite">

      <pre wrap="">Hi,

So I'm trying that now.

I installed 3.7.11 on two nodes and put a few VMs on it, same config

as before but with 64MB shards and the heal algo to full. As expected,

if I poweroff one of the nodes, everything is dead, which is fine.

Now I'm adding a third node, a big heal was started after the add-brick

of everything (7000+ shards), and for now everything seems to be working

fine on the VMs. Last time I tried adding a brick, all those VM died for

the duration of the heal, so that's already pretty good.

I'm gonna let it finish to copy everything on the new nodes, then I'll try

to simulate nodes going down to see if my original problem of freezing and

low heal time is solved with this config.

For reference, here is the volume info, if someone sees something I should change :

Volume Name: gluster

Type: Replicate

Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a

Status: Started

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: ipvr2.client_name:/mnt/storage/gluster

Brick2: ipvr3.client_name:/mnt/storage/gluster

Brick3: ipvr50.client_name:/mnt/storage/gluster

Options Reconfigured:

cluster.quorum-type: auto

cluster.server-quorum-type: server

network.remote-dio: enable

cluster.eager-lock: enable

performance.quick-read: off

performance.read-ahead: off

performance.io-cache: off

performance.stat-prefetch: off

features.shard: on

features.shard-block-size: 64MB

cluster.data-self-heal-algorithm: full

performance.readdir-ahead: on

It starts at 2 and jumps to 50 because the first server is doing something else for now,

and I use 50 to be the temporary third node. If everything goes well, I'll migrate the production

on the cluster, re-install the first server and do a replace-brick, which I hope will work just as well

as the add-brick I'm doing now. Last replace-brick also brought everything down, but I guess that was the

joy of 3.7.6 :).

Thanks !

On Mon, Apr 18, 2016 at 08:17:05PM +0530, Krutika Dhananjay wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">On Mon, Apr 18, 2016 at 8:02 PM, Kevin Lemonnier <a class="moz-txt-link-rfc2396E" href="mailto:lemonnierk@ulrar.net">&lt;lemonnierk@ulrar.net&gt;</a>

wrote:

</pre>

        <blockquote type="cite">

          <pre wrap="">I will try migrating to 3.7.10, is it considered stable yet ?

</pre>

        </blockquote>

        <pre wrap="">

Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)

</pre>

        <blockquote type="cite">

          <pre wrap="">Should I change the self heal algorithm even if I move to 3.7.10, or is

that not necessary ?

Not sure what that change might do.

</pre>

        </blockquote>

        <pre wrap="">

So the other algorithm is 'diff' which computes rolling checksum on chunks

of the src(es) and sink(s), compares them and heals upon mismatch. This is

known to consume lot of CPU. 'full' algo on the other hand simply copies

the src into sink in chunks. With sharding, it shouldn't be all that bad

copying a 256MB file (in your case) from src to sink. We've used double the

block size and had no issues reported.

So you could change self heal algo to full even in the upgraded cluster.

-Krutika

</pre>

        <blockquote type="cite">

          <pre wrap="">

Anyway, I'll try to create a 3.7.10 cluster in the week end slowly move

the VMs on it then,

Thanks a lot for your help,

Regards

On Mon, Apr 18, 2016 at 07:58:44PM +0530, Krutika Dhananjay wrote:

</pre>

          <blockquote type="cite">

            <pre wrap="">Hi,

Yeah, so the fuse mount log didn't convey much information.

So one of the reasons heal may have taken so long (and also consumed

resources) is because of a bug in self-heal where it would do heal from

both source bricks in 3-way replication. With such a bug, heal would take

twice the amount of time and consume resources both the times by the same

amount.

This issue is fixed at <a class="moz-txt-link-freetext" href="http://review.gluster.org/#/c/14008/">http://review.gluster.org/#/c/14008/</a> and will be

available in 3.7.12.

The other thing you could do is to set cluster.data-self-heal-algorithm

</pre>

          </blockquote>

          <pre wrap="">to

</pre>

          <blockquote type="cite">

            <pre wrap="">'full', for better heal performance and more regulated resource

</pre>

          </blockquote>

          <pre wrap="">consumption

</pre>

          <blockquote type="cite">

            <pre wrap="">by the same.

 #gluster volume set &lt;VOL&gt; cluster.data-self-heal-algorithm full

As far as sharding is concerned, some critical caching issues were fixed

</pre>

          </blockquote>

          <pre wrap="">in

</pre>

          <blockquote type="cite">

            <pre wrap="">3.7.7 and 3.7.8.

And my guess is that the vm crash/unbootable state could be because of

</pre>

          </blockquote>

          <pre wrap="">this

</pre>

          <blockquote type="cite">

            <pre wrap="">issue, which exists in 3.7.6.

3.7.10 saw the introduction of throttled client side heals which also

</pre>

          </blockquote>

          <pre wrap="">moves

</pre>

          <blockquote type="cite">

            <pre wrap="">such heals to the background, which is all the more helpful for

</pre>

          </blockquote>

          <pre wrap="">preventing

</pre>

          <blockquote type="cite">

            <pre wrap="">starvation of vms during client heal.

Considering these factors, I think it would be better if you upgraded

</pre>

          </blockquote>

          <pre wrap="">your

</pre>

          <blockquote type="cite">

            <pre wrap="">machines to 3.7.10.

Do let me know if migrating to 3.7.10 solves your issues.

-Krutika

On Mon, Apr 18, 2016 at 12:40 PM, Kevin Lemonnier <a class="moz-txt-link-rfc2396E" href="mailto:lemonnierk@ulrar.net">&lt;lemonnierk@ulrar.net&gt;</a>

wrote:

</pre>

            <blockquote type="cite">

              <pre wrap="">Yes, but as I was saying I don't believe KVM is using a mount point, I

think it uses

the API (

</pre>

            </blockquote>

          </blockquote>

          <pre wrap=""><a class="moz-txt-link-freetext" href="http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt">http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt</a>

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <pre wrap="">).

Might be mistaken ofcourse. Proxmox does have a mountpoint for

conveniance, I'll attach those

logs, hoping they contain the informations you need. They do seem to

contain a lot of errors

for the 15.

For reference, there was a disconnect of the first brick (10.10.0.1) in

the morning and then a successfull

heal that caused about 40 minutes downtime of the VMs. Right after that

heal finished (if my memory is

correct it was about noon or close) the second brick (10.10.0.2)

</pre>

            </blockquote>

          </blockquote>

          <pre wrap="">rebooted,

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <pre wrap="">and that's the one I disconnected

to prevent the heal from causing another downtime.

I reconnected it one at the end of the afternoon, hoping the heal

</pre>

            </blockquote>

          </blockquote>

          <pre wrap="">would go

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <pre wrap="">well but everything went down

like in the morning so I disconnected it again, and waited 11pm

</pre>

            </blockquote>

          </blockquote>

          <pre wrap="">(23:00) to

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <pre wrap="">reconnect it and let it finish.

Thanks for your help,

On Mon, Apr 18, 2016 at 12:28:28PM +0530, Krutika Dhananjay wrote:

</pre>

              <blockquote type="cite">

                <pre wrap="">Sorry, I was referring to the glusterfs client logs.

Assuming you are using FUSE mount, your log file will be in

/var/log/glusterfs/&lt;hyphenated-mount-point-path&gt;.log

-Krutika

On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier &lt;

</pre>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap=""><a class="moz-txt-link-abbreviated" href="mailto:lemonnierk@ulrar.net">lemonnierk@ulrar.net</a>&gt;

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <pre wrap="">wrote:

</pre>

                <blockquote type="cite">

                  <pre wrap="">I believe Proxmox is just an interface to KVM that uses the lib,

</pre>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">so if

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <pre wrap="">I'm

</pre>

              <blockquote type="cite">

                <blockquote type="cite">

                  <pre wrap="">not mistaken there isn't client logs ?

It's not the first time I have the issue, it happens on every heal

</pre>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">on

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <pre wrap="">the

</pre>

              <blockquote type="cite">

                <blockquote type="cite">

                  <pre wrap="">2 clusters I have.

I did let the heal finish that night and the VMs are working now,

</pre>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">but

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <pre wrap="">it

</pre>

              <blockquote type="cite">

                <blockquote type="cite">

                  <pre wrap="">is pretty scarry for future crashes or brick replacement.

Should I maybe lower the shard size ? Won't solve the fact that 2

</pre>

                </blockquote>

              </blockquote>

              <pre wrap="">bricks

</pre>

              <blockquote type="cite">

                <blockquote type="cite">

                  <pre wrap="">on 3 aren't keeping the filesystem usable but might make the

</pre>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">healing

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <blockquote type="cite">

                  <pre wrap="">quicker right ?

Thanks

Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay &lt;

<a class="moz-txt-link-abbreviated" href="mailto:kdhananj@redhat.com">kdhananj@redhat.com</a>&gt; a �crit :

</pre>

                  <blockquote type="cite">

                    <pre wrap="">Could you share the client logs and information about the approx

time/day

when you saw this issue?

-Krutika

On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier

<a class="moz-txt-link-rfc2396E" href="mailto:lemonnierk@ulrar.net">&lt;lemonnierk@ulrar.net&gt;</a>

wrote:

</pre>

                    <blockquote type="cite">

                      <pre wrap="">Hi,

We have a small glusterFS 3.7.6 cluster with 3 nodes running

</pre>

                    </blockquote>

                  </blockquote>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">with

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <blockquote type="cite">

                  <blockquote type="cite">

                    <pre wrap="">proxmox

</pre>

                    <blockquote type="cite">

                      <pre wrap="">VM's on it. I did set up the different recommended option like

</pre>

                    </blockquote>

                  </blockquote>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">the

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <blockquote type="cite">

                  <blockquote type="cite">

                    <pre wrap="">virt

</pre>

                    <blockquote type="cite">

                      <pre wrap="">group, but

by hand since it's on debian. The shards are 256MB, if that

</pre>

                    </blockquote>

                  </blockquote>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">matters.

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <blockquote type="cite">

                  <blockquote type="cite">

                    <blockquote type="cite">

                      <pre wrap="">

This morning the second node crashed, and as it came back up

</pre>

                    </blockquote>

                  </blockquote>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">started

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <blockquote type="cite">

                  <blockquote type="cite">

                    <pre wrap="">a

</pre>

                    <blockquote type="cite">

                      <pre wrap="">heal, but that basically froze all the VM's running on that

</pre>

                    </blockquote>

                  </blockquote>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">volume.

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <blockquote type="cite">

                  <blockquote type="cite">

                    <pre wrap="">Since

</pre>

                    <blockquote type="cite">

                      <pre wrap="">we really really

can't have 40 minutes down time in the middle of the day, I just

</pre>

                    </blockquote>

                    <pre wrap="">removed

</pre>

                    <blockquote type="cite">

                      <pre wrap="">the node from the network and that stopped the heal, allowing

</pre>

                    </blockquote>

                  </blockquote>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">the

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <blockquote type="cite">

                  <blockquote type="cite">

                    <pre wrap="">VM's to

</pre>

                    <blockquote type="cite">

                      <pre wrap="">access

their disks again. The plan was to re-connecte the node in a

</pre>

                    </blockquote>

                  </blockquote>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">couple

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <blockquote type="cite">

                  <blockquote type="cite">

                    <pre wrap="">of

</pre>

                    <blockquote type="cite">

                      <pre wrap="">hours to let it heal at night.

But a VM crashed now, and it can't boot up again : seems to

</pre>

                    </blockquote>

                  </blockquote>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">freez

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <blockquote type="cite">

                  <blockquote type="cite">

                    <pre wrap="">trying

</pre>

                    <blockquote type="cite">

                      <pre wrap="">to access the disks.

Looking at the heal info for the volume, it has gone way up

</pre>

                    </blockquote>

                  </blockquote>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">since

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <blockquote type="cite">

                  <blockquote type="cite">

                    <pre wrap="">this

</pre>

                    <blockquote type="cite">

                      <pre wrap="">morning, it looks like the VM's aren't writing to both nodes,

</pre>

                    </blockquote>

                  </blockquote>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">just

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <blockquote type="cite">

                  <blockquote type="cite">

                    <pre wrap="">the one

</pre>

                    <blockquote type="cite">

                      <pre wrap="">they are on.

It seems pretty bad, we have 2 nodes on 3 up, I would expect the

</pre>

                    </blockquote>

                    <pre wrap="">volume to

</pre>

                    <blockquote type="cite">

                      <pre wrap="">work just fine since it has quorum. What am I missing ?

It is still too early to start the heal, is there a way to

</pre>

                    </blockquote>

                  </blockquote>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">start the

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <blockquote type="cite">

                <blockquote type="cite">

                  <blockquote type="cite">

                    <pre wrap="">VM

</pre>

                    <blockquote type="cite">

                      <pre wrap="">anyway right now ? I mean, it was running a moment ago so the

</pre>

                    </blockquote>

                  </blockquote>

                </blockquote>

              </blockquote>

            </blockquote>

          </blockquote>

          <pre wrap="">data

</pre>

          <blockquote type="cite">

            <blockquote type="cite">

              <pre wrap="">is

</pre>

              <blockquote type="cite">

                <blockquote type="cite">

                  <blockquote type="cite">

                    <blockquote type="cite">

                      <pre wrap="">there, it just needs

to let the VM access it.

Volume Name: vm-storage

Type: Replicate

Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef

Status: Started

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: first_node:/mnt/vg1-storage

Brick2: second_node:/mnt/vg1-storage

Brick3: third_node:/mnt/vg1-storage

Options Reconfigured:

cluster.quorum-type: auto

cluster.server-quorum-type: server

network.remote-dio: enable

cluster.eager-lock: enable

performance.readdir-ahead: on

performance.quick-read: off

performance.read-ahead: off

performance.io-cache: off

performance.stat-prefetch: off

features.shard: on

features.shard-block-size: 256MB

cluster.server-quorum-ratio: 51%

Thanks for your help

--

Kevin Lemonnier

PGP Fingerprint : 89A5 2283 04A0 E6E9 0111

_______________________________________________

Gluster-users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a>

<a class="moz-txt-link-freetext" href="http://www.gluster.org/mailman/listinfo/gluster-users">http://www.gluster.org/mailman/listinfo/gluster-users</a>

</pre>

                    </blockquote>

                  </blockquote>

                  <pre wrap="">

--

Envoy� de mon appareil Android avec K-9 Mail. Veuillez excuser ma

</pre>

                </blockquote>

              </blockquote>

              <pre wrap="">bri�vet�.

</pre>

              <blockquote type="cite">

                <blockquote type="cite">

                  <pre wrap="">_______________________________________________

Gluster-users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a>

<a class="moz-txt-link-freetext" href="http://www.gluster.org/mailman/listinfo/gluster-users">http://www.gluster.org/mailman/listinfo/gluster-users</a>

</pre>

                </blockquote>

              </blockquote>

              <pre wrap="">

--

Kevin Lemonnier

PGP Fingerprint : 89A5 2283 04A0 E6E9 0111

_______________________________________________

Gluster-users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a>

<a class="moz-txt-link-freetext" href="http://www.gluster.org/mailman/listinfo/gluster-users">http://www.gluster.org/mailman/listinfo/gluster-users</a>

</pre>

            </blockquote>

          </blockquote>

          <pre wrap="">

--

Kevin Lemonnier

PGP Fingerprint : 89A5 2283 04A0 E6E9 0111

_______________________________________________

Gluster-users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a>

<a class="moz-txt-link-freetext" href="http://www.gluster.org/mailman/listinfo/gluster-users">http://www.gluster.org/mailman/listinfo/gluster-users</a>

</pre>

        </blockquote>

      </blockquote>

      <pre wrap="">

</pre>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Gluster-users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a>

<a class="moz-txt-link-freetext" href="http://www.gluster.org/mailman/listinfo/gluster-users">http://www.gluster.org/mailman/listinfo/gluster-users</a></pre>

    </blockquote>

    <br>

    <br>

    <pre class="moz-signature" cols="72">-- 

Lindsay Mathieson</pre>

  </body>

</html>