Thanks Pranith, Will do. Sunday night we put some things in place seem to be mitigating it and thankfully haven't seen it again, but if we do I'll send the profile info to the list. I was able to collect some profile info under normal load.<div><br></div><div>We added some caching to some files we noticed had become really popular, and when that didn't entirely stop the problem, also stopped the most recently added gluster volume. It's odd that volume would have any impact as it was only used to archive backups and was almost never active, but several times we'd stop it during the month just because it was most recently added and the issue would go away, start it back up and it would come back. Since then it's been quiet.<br><br>On Thu, Feb 5, 2015 at 5:14 AM, Pranith Kumar Karampuri &lt;pkarampu@redhat.com&gt; wrote:<br>

<blockquote type="cite">

    <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">

    <br>

    <div class="moz-cite-prefix">On 02/03/2015 11:16 AM, Matt wrote:<br>

    </div>

    <blockquote cite="mid:1422942371.16185.1@smtp.gmail.com" type="cite">

      <div>Hello List,</div>

      <div><br>

      </div>

      So I've been frustraded by intermittent performance problems

      throughout January. The problem occurs on a two node setup running

      3.4.5, 16 gigs of ram with a bunch of local disk. For sometimes an

      hour for sometimes weeks at a time (I have extensive graphs in

      OpenNMS) our Gluster boxes will get their CPUs pegged, and in

      vmstat they'll show extremely high numbers of context switches and

      interrupts. Eventually things calm down. During this time, memory

      usage actually drops. Overall usage on the box goes from between

      6-10 gigs to right around 4 gigs, and stays there. That's what

      really puzzles me.

      <div><br>

      </div>

      <div>When performance is problematic, sar shows one device, the

        device corresponding to the glusterfsd problem using all the CPU

        doing lots of little reads, Sometimes 70k/second, very small avg

        rq size, say 10-12. Afraid I don't have any saved output handy,

        but I can try to capture some next time it happens. I have tons

        of information frankly, but am trying to keep this reasonably

        brief.<br>

        <div><br>

        </div>

        <div>There are more than a dozen volumes on this two node setup.

          The CPU usage is pretty much entirely contained to one volume,

          a 1.5 TB volume that is just shy of 70% full. It stores

          uploaded files for a web app. What I hate about this app and

          so am always suspicious of, is that it stores a directory for

          every user in one level, so under the /data directory in the

          volume, there are 450,000 sub directories at this point.</div>

        <div><br>

        </div>

        <div>The only real mitigation step that's been taken so far was

          to turn off the self-heal daemon on the volume, as I thought

          maybe crawling that large directory was getting expensive.

          This doesn't seem to have done anything as the problem still

          occurs.</div>

      </div>

      <div><br>

      </div>

      <div>At this point I figure there are one of two things sorts of

        things happening really broadly: one we're running into some

        sort of bug or performance problem with gluster we should either

        fix perhaps by upgrading or tuning around, or two, some process

        we're running but not aware of is hammering the file system

        causing problems.</div>

      <div><br>

      </div>

      <div>If it's the latter option, can anyone give me any tips on

        figuring out what might be hammering the system? I can use

        volume top to see what a brick is doing, but I can't figure out

        how to tell what clients are doing what.</div>

      <div><br>

      </div>

      <div>Apologies for the somewhat broad nature of the question, any

        input thoughts would be much appreciated. I can certainly

        provide more info about some things if it would help, but I've

        tried not to write a novel here.</div>

      <div><br>

      </div>

      <div>Thanks,</div>

    </blockquote>

    Could you enable 'gluster volume profile &lt;volname&gt; start' for

    this volume?<br>

    When next time this issue happens, keep collecting 'gluster volume

    profile &lt;volname&gt; info' outputs. Mail them and lets see what

    is happening.<br>

    <br>

    Pranith<br>

    <blockquote cite="mid:1422942371.16185.1@smtp.gmail.com" type="cite">

      <div><br>

      </div>

      <div>-Matt</div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Gluster-users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a>

<a class="moz-txt-link-freetext" href="http://www.gluster.org/mailman/listinfo/gluster-users">http://www.gluster.org/mailman/listinfo/gluster-users</a></pre>

    </blockquote>

    <br>

</blockquote></div>