Planet Gluster

Aggregated news from external sources

The Funniest Incident Postmortem

2019-02-08

Recently, I had a chance to think about an outage that I debugged and fixed a few years ago that involves Jenkins and systemd (or in this case lack thereof!).

Generally, if you want to run a task at the end of every Jenkins job whether the job has passed or failed, you have two options. You could use trap and write a clean up function. I would highly recommend that you use trap. Or you could be like me and write a post-build publisher that would run a script if it finds the line “Building remotely” in the console output. It’s quite hacky, but since the first line of every job is “Building remotely“, it works. I used to depend on this for clean up on a couple of Jenkins jobs a while ago and later removed it because of this infamous outage.

The Problem

Let me preface this by stating and this happened due to a combination of factors that I don’t expect to repeat. We were using an old version of Jenkins on an old version of CentOS. This means, it was still using init scripts and not systemd. The init file is just a shell script.

If you didn’t already know, SSH tends to forward your LANG information to the environment you connect to and force that environment to be similar to your current locale. I use en_US, but my French sysadmin colleague uses fr_FR locale. Which mean if I connect to a server, I would have English errors messages and if he did, he would have French ones.

When my colleague restarted Jenkins on that fateful day, his environment leaked into the Jenkins init script possibly due to a bug. Voila! Jenkins now speaks French. This meant my clean up didn’t work anymore. Instead of “Building Remotely” we had “Construction à distance“. Obviously, all the jobs failed.

The Solution

I had to stop and start Jenkins again so it spoke English. We made plans to upgrade both the OS and Jenkins so we didn’t run into this specific bug again. Aside from making sure that Jenkins didn’t accidentally speak French again, we also removed the clean up script.

In this case, the the job was creating rpms using mock. We would run mock with sudo and that meant the rpms were owned by the root user and the jenkins user could not delete the rpms. My solution back then was to use ACLs to give the jenkins user write access to files in the Jenkins workspace folder irrespective of the real owner. You can read my original postmortem on the gluster-infra mailing list archives.

We are currently in the process of changing hosting providers. The fix with ACLs always seemed hacky to me and I wanted to take this chance to remove the ACLs entirely. I’ve just added the jenkins user to the mock group and we build rpms without using sudo. That solves all the problems much more cleanly.

But hey, it brings me great joy to say we had a bug where Jenkins spoke French and thus caused a fun day of debugging and fixing.

Source: nigelb (The Funniest Incident Postmortem)

BLOG

  • 28 Nov 2019
    Planning ahead for Gluster releases

    In order to plan the content for upcoming releases, it is good to take a moment of pause, step back and attempt to look at the consumption of GlusterFS within large enterprises. With the enterprise architecture taking large strides towards cloud and more specifically, the hybrid cloud, continued efforts towards...

    Read more
  • 13 Nov 2019
    Announcing Gluster 7.0

    The Gluster community is pleased to announce the release of 7.0, our latest release. This is a major release that includes a range of code improvements and stability fixes along with a few features as noted below. A selection of the key features and bugs addressed are documented in this...

    Read more
  • 15 Oct 2019
    Gluster and CentOS Stream

    Progress cannot be made without change. As technologists, we recognize this every day. Most of the time, these changes are iterative: progresssive additions of features to projects like Gluster. Sometimes those changes are small, and sometimes not. And that’s, of course, just talking about our project. But one of the...

    Read more