<div dir="ltr">Hi all,<div><br></div><div>I have a 3 node gluster cluster:<br>glusterfs 3.7.12 built on Jun 27 2016 12:40:53<br></div><div>3 x AWS EC2 m4.xlarge  1 x 1.5Tb STI EBS</div><div><br></div><div>We have had several issues where GlusterFS FUSE client hosts get into &#39;D&#39; on reading the file system and now have one of the 3 nodes with substantially more CPU load than the other two.</div><div><br></div><div>This morning (several hours ago) we shut down one node exhibiting the high CPU behaviour and now that node is back to normal and one of the other 3 has high CPU.</div><div><br></div><div>The brick logs on both the &#39;good&#39; nodes are emitting this over and over.  </div><div><br></div><div>[2016-10-30 20:55:48.483045] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 676120: SETATTR /ftpdata/[REDACTED]/bulk_import/.batches/fa44bf76-5706-40fc-b068-33e11a22bdd9/source (f4569925-6c13-4e51-8d97-38cf6c0b198a) ==&gt; (Operation not permitted) [Operation not permitted]<br></div><div><div>[2016-10-30 20:55:51.511782] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 695527: SETATTR /ftpdata/[REDACTED]/bulk_import/.batches/fa44bf76-5706-40fc-b068-33e11a22bdd9/source (f4569925-6c13-4e51-8d97-38cf6c0b198a) ==&gt; (Operation not permitted) [Operation not permitted]</div><div>[2016-10-30 20:55:51.553858] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 708963: SETATTR /ftpdata/[REDACTED]/bulk_import/.batches/c42cc627-adc6-43a4-93dd-ea6ec0eaa9cb/source (eaceb846-9ee1-4fab-8f4d-e2a93e85710f) ==&gt; (Operation not permitted) [Operation not permitted]</div><div>[2016-10-30 20:55:54.660251] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 709004: SETATTR /ftpdata/[REDACTED]/bulk_import/.batches/fa44bf76-5706-40fc-b068-33e11a22bdd9/source (f4569925-6c13-4e51-8d97-38cf6c0b198a) ==&gt; (Operation not permitted) [Operation not permitted]</div><div>[2016-10-30 20:55:54.866259] I [MSGID: 115072] [server-rpc-fops.c:1786:server_setattr_cbk] 0-marketplace_nfs-server: 356367: SETATTR /ftpdata/[REDACTED]/bulk_import (d7dda9df-bf56-4f71-b6b8-59f2ccc41016) ==&gt; (Operation not permitted) [Operation not permitted]</div></div><div><br></div><div>There are loads of files that are in the gluster volume heal &lt;vol&gt; info. </div><div><br></div><div>Client nodes using GlusterFS FUSE clients have been hanging in an uninterruptible state causing issues.  I can see files are missing on some of the GlusterFS bricks and I have used a client mounting the bricks via NFS and doing a file traverse and stat on each file and directory (find /&lt;mountpoints&gt; -type f -exec stat {} \;) and I can see files being healed. </div><div><br></div><div>We are hoping that this will get us back into a normal state but we are unsure if there is a larger problem looming.</div><div><br></div><div>I have included the statedump and current setting:</div><div><br></div><div><br></div><div>Option                                  Value</div><div>------                                  -----</div><div>cluster.lookup-unhashed                 on</div><div>cluster.lookup-optimize                 on</div><div>cluster.min-free-disk                   10%</div><div>cluster.min-free-inodes                 5%</div><div>cluster.rebalance-stats                 off</div><div>cluster.subvols-per-directory           (null)</div><div>cluster.readdir-optimize                off</div><div>cluster.rsync-hash-regex                (null)</div><div>cluster.extra-hash-regex                (null)</div><div>cluster.dht-xattr-name                  trusted.glusterfs.dht</div><div>cluster.randomize-hash-range-by-gfid    off</div><div>cluster.rebal-throttle                  normal</div><div>cluster.local-volume-name               (null)</div><div>cluster.weighted-rebalance              on</div><div>cluster.switch-pattern                  (null)</div><div>cluster.entry-change-log                on</div><div>cluster.read-subvolume                  (null)</div><div>cluster.read-subvolume-index            -1</div><div>cluster.read-hash-mode                  1</div><div>cluster.background-self-heal-count      8</div><div>cluster.metadata-self-heal              on</div><div>cluster.data-self-heal                  on</div><div>cluster.entry-self-heal                 on</div><div>cluster.self-heal-daemon                disable</div><div>cluster.heal-timeout                    600</div><div>cluster.self-heal-window-size           1</div><div>cluster.data-change-log                 on</div><div>cluster.metadata-change-log             on</div><div>cluster.data-self-heal-algorithm        full</div><div>cluster.eager-lock                      on</div><div>disperse.eager-lock                     on</div><div>cluster.quorum-type                     none</div><div>cluster.quorum-count                    (null)</div><div>cluster.choose-local                    true</div><div>cluster.self-heal-readdir-size          1KB</div><div>cluster.post-op-delay-secs              1</div><div>cluster.ensure-durability               on</div><div>cluster.consistent-metadata             no</div><div>cluster.heal-wait-queue-length          128</div><div>cluster.stripe-block-size               128KB</div><div>cluster.stripe-coalesce                 true</div><div>diagnostics.latency-measurement         off</div><div>diagnostics.dump-fd-stats               off</div><div>diagnostics.count-fop-hits              off</div><div>diagnostics.brick-log-level             INFO</div><div>diagnostics.client-log-level            INFO</div><div>diagnostics.brick-sys-log-level         CRITICAL</div><div>diagnostics.client-sys-log-level        CRITICAL</div><div>diagnostics.brick-logger                (null)</div><div>diagnostics.client-logger               (null)</div><div>diagnostics.brick-log-format            (null)</div><div>diagnostics.client-log-format           (null)</div><div>diagnostics.brick-log-buf-size          5</div><div>diagnostics.client-log-buf-size         5</div><div>diagnostics.brick-log-flush-timeout     120</div><div>diagnostics.client-log-flush-timeout    120</div><div>performance.cache-max-file-size         0</div><div>performance.cache-min-file-size         0</div><div>performance.cache-refresh-timeout       1</div><div>performance.cache-priority</div><div>performance.cache-size                  512MB</div><div>performance.io-thread-count             16</div><div>performance.high-prio-threads           16</div><div>performance.normal-prio-threads         16</div><div>performance.low-prio-threads            16</div><div>performance.least-prio-threads          1</div><div>performance.enable-least-priority       on</div><div>performance.least-rate-limit            0</div><div>performance.cache-size                  512MB</div><div>performance.flush-behind                on</div><div>performance.nfs.flush-behind            on</div><div>performance.write-behind-window-size    1MB</div><div>performance.resync-failed-syncs-after-fsyncoff</div><div>performance.nfs.write-behind-window-size1MB</div><div>performance.strict-o-direct             off</div><div>performance.nfs.strict-o-direct         off</div><div>performance.strict-write-ordering       off</div><div>performance.nfs.strict-write-ordering   off</div><div>performance.lazy-open                   yes</div><div>performance.read-after-open             no</div><div>performance.read-ahead-page-count       4</div><div>performance.md-cache-timeout            1</div><div>performance.cache-swift-metadata        true</div><div>features.encryption                     off</div><div>encryption.master-key                   (null)</div><div>encryption.data-key-size                256</div><div>encryption.block-size                   4096</div><div>network.frame-timeout                   1800</div><div>network.ping-timeout                    15</div><div>network.tcp-window-size                 (null)</div><div>features.lock-heal                      off</div><div>features.grace-timeout                  10</div><div>network.remote-dio                      disable</div><div>client.event-threads                    2</div><div>network.ping-timeout                    15</div><div>network.tcp-window-size                 (null)</div><div>network.inode-lru-limit                 16384</div><div>auth.allow                              *</div><div>auth.reject                             (null)</div><div>transport.keepalive                     (null)</div><div>server.allow-insecure                   (null)</div><div>server.root-squash                      off</div><div>server.anonuid                          65534</div><div>server.anongid                          65534</div><div>server.statedump-path                   /var/run/gluster</div><div>server.outstanding-rpc-limit            64</div><div>features.lock-heal                      off</div><div>features.grace-timeout                  (null)</div><div>server.ssl                              (null)</div><div>auth.ssl-allow                          *</div><div>server.manage-gids                      off</div><div>server.dynamic-auth                     on</div><div>client.send-gids                        on</div><div>server.gid-timeout                      300</div><div>server.own-thread                       (null)</div><div>server.event-threads                    2</div><div>ssl.own-cert                            (null)</div><div>ssl.private-key                         (null)</div><div>ssl.ca-list                             (null)</div><div>ssl.crl-path                            (null)</div><div>ssl.certificate-depth                   (null)</div><div>ssl.cipher-list                         (null)</div><div>ssl.dh-param                            (null)</div><div>ssl.ec-curve                            (null)</div><div>performance.write-behind                on</div><div>performance.read-ahead                  on</div><div>performance.readdir-ahead               on</div><div>performance.io-cache                    on</div><div>performance.quick-read                  on</div><div>performance.open-behind                 on</div><div>performance.stat-prefetch               on</div><div>performance.client-io-threads           off</div><div>performance.nfs.write-behind            on</div><div>performance.nfs.read-ahead              off</div><div>performance.nfs.io-cache                off</div><div>performance.nfs.quick-read              off</div><div>performance.nfs.stat-prefetch           off</div><div>performance.nfs.io-threads              off</div><div>performance.force-readdirp              true</div><div>features.file-snapshot                  off</div><div>features.uss                            off</div><div>features.snapshot-directory             .snaps</div><div>features.show-snapshot-directory        off</div><div>network.compression                     off</div><div>network.compression.window-size         -15</div><div>network.compression.mem-level           8</div><div>network.compression.min-size            0</div><div>network.compression.compression-level   -1</div><div>network.compression.debug               false</div><div>features.limit-usage                    (null)</div><div>features.quota-timeout                  0</div><div>features.default-soft-limit             80%</div><div>features.soft-timeout                   60</div><div>features.hard-timeout                   5</div><div>features.alert-time                     86400</div><div>features.quota-deem-statfs              off</div><div>geo-replication.indexing                off</div><div>geo-replication.indexing                off</div><div>geo-replication.ignore-pid-check        off</div><div>geo-replication.ignore-pid-check        off</div><div>features.quota                          off</div><div>features.inode-quota                    off</div><div>features.bitrot                         disable</div><div>debug.trace                             off</div><div>debug.log-history                       no</div><div>debug.log-file                          no</div><div>debug.exclude-ops                       (null)</div><div>debug.include-ops                       (null)</div><div>debug.error-gen                         off</div><div>debug.error-failure                     (null)</div><div>debug.error-number                      (null)</div><div>debug.random-failure                    off</div><div>debug.error-fops                        (null)</div><div>nfs.enable-ino32                        no</div><div>nfs.mem-factor                          15</div><div>nfs.export-dirs                         on</div><div>nfs.export-volumes                      on</div><div>nfs.addr-namelookup                     off</div><div>nfs.dynamic-volumes                     off</div><div>nfs.register-with-portmap               on</div><div>nfs.outstanding-rpc-limit               16</div><div>nfs.port                                2049</div><div>nfs.rpc-auth-unix                       on</div><div>nfs.rpc-auth-null                       on</div><div>nfs.rpc-auth-allow                      all</div><div>nfs.rpc-auth-reject                     none</div><div>nfs.ports-insecure                      off</div><div>nfs.trusted-sync                        off</div><div>nfs.trusted-write                       off</div><div>nfs.volume-access                       read-write</div><div>nfs.export-dir</div><div>nfs.disable                             false</div><div>nfs.nlm                                 on</div><div>nfs.acl                                 on</div><div>nfs.mount-udp                           off</div><div>nfs.mount-rmtab                         /var/lib/glusterd/nfs/rmtab</div><div>nfs.rpc-statd                           /sbin/rpc.statd</div><div>nfs.server-aux-gids                     off</div><div>nfs.drc                                 off</div><div>nfs.drc-size                            0x20000</div><div>nfs.read-size                           (1 * 1048576ULL)</div><div>nfs.write-size                          (1 * 1048576ULL)</div><div>nfs.readdir-size                        (1 * 1048576ULL)</div><div>nfs.rdirplus                            on</div><div>nfs.exports-auth-enable                 (null)</div><div>nfs.auth-refresh-interval-sec           (null)</div><div>nfs.auth-cache-ttl-sec                  (null)</div><div>features.read-only                      off</div><div>features.worm                           off</div><div>storage.linux-aio                       off</div><div>storage.batch-fsync-mode                reverse-fsync</div><div>storage.batch-fsync-delay-usec          0</div><div>storage.owner-uid                       -1</div><div>storage.owner-gid                       -1</div><div>storage.node-uuid-pathinfo              off</div><div>storage.health-check-interval           30</div><div>storage.build-pgfid                     off</div><div>storage.bd-aio                          off</div><div>cluster.server-quorum-type              off</div><div>cluster.server-quorum-ratio             51%</div><div>changelog.changelog                     off</div><div>changelog.changelog-dir                 (null)</div><div>changelog.encoding                      ascii</div><div>changelog.rollover-time                 15</div><div>changelog.fsync-interval                5</div><div>changelog.changelog-barrier-timeout     120</div><div>changelog.capture-del-path              off</div><div>features.barrier                        disable</div><div>features.barrier-timeout                120</div><div>features.trash                          off</div><div>features.trash-dir                      .trashcan</div><div>features.trash-eliminate-path           (null)</div><div>features.trash-max-filesize             5MB</div><div>features.trash-internal-op              off</div><div>cluster.enable-shared-storage           disable</div><div>cluster.write-freq-threshold            0</div><div>cluster.read-freq-threshold             0</div><div>cluster.tier-pause                      off</div><div>cluster.tier-promote-frequency          120</div><div>cluster.tier-demote-frequency           3600</div><div>cluster.watermark-hi                    90</div><div>cluster.watermark-low                   75</div><div>cluster.tier-mode                       cache</div><div>cluster.tier-max-mb                     4000</div><div>cluster.tier-max-files                  10000</div><div>features.ctr-enabled                    off</div><div>features.record-counters                off</div><div>features.ctr-record-metadata-heat       off</div><div>features.ctr_link_consistency           off</div><div>features.ctr_lookupheal_link_timeout    300</div><div>features.ctr_lookupheal_inode_timeout   300</div><div>features.ctr-sql-db-cachesize           1000</div><div>features.ctr-sql-db-wal-autocheckpoint  1000</div><div>locks.trace                             (null)</div><div>cluster.disperse-self-heal-daemon       enable</div><div>cluster.quorum-reads                    no</div><div>client.bind-insecure                    (null)</div><div>ganesha.enable                          off</div><div>features.shard                          off</div><div>features.shard-block-size               4MB</div><div>features.scrub-throttle                 lazy</div><div>features.scrub-freq                     biweekly</div><div>features.scrub                          false</div><div>features.expiry-time                    120</div><div>features.cache-invalidation             off</div><div>features.cache-invalidation-timeout     60</div><div>disperse.background-heals               8</div><div>disperse.heal-wait-qlength              128</div><div>dht.force-readdirp                      on</div><div>disperse.read-policy                    round-robin</div><div>cluster.shd-max-threads                 1</div><div>cluster.shd-wait-qlength                1024</div><div>cluster.locking-scheme                  full </div><div><br></div><div>Help is appreciated</div><div><br></div><div>Den</div><div><br></div><div><br></div><div><br></div><div><br></div></div>