Planet Gluster

Aggregated news from external sources

The Funniest Incident Postmortem

2019-02-08

Recently, I had a chance to think about an outage that I debugged and fixed a few years ago that involves Jenkins and systemd (or in this case lack thereof!).

Generally, if you want to run a task at the end of every Jenkins job whether the job has passed or failed, you have two options. You could use trap and write a clean up function. I would highly recommend that you use trap. Or you could be like me and write a post-build publisher that would run a script if it finds the line “Building remotely” in the console output. It’s quite hacky, but since the first line of every job is “Building remotely“, it works. I used to depend on this for clean up on a couple of Jenkins jobs a while ago and later removed it because of this infamous outage.

The Problem

Let me preface this by stating and this happened due to a combination of factors that I don’t expect to repeat. We were using an old version of Jenkins on an old version of CentOS. This means, it was still using init scripts and not systemd. The init file is just a shell script.

If you didn’t already know, SSH tends to forward your LANG information to the environment you connect to and force that environment to be similar to your current locale. I use en_US, but my French sysadmin colleague uses fr_FR locale. Which mean if I connect to a server, I would have English errors messages and if he did, he would have French ones.

When my colleague restarted Jenkins on that fateful day, his environment leaked into the Jenkins init script possibly due to a bug. Voila! Jenkins now speaks French. This meant my clean up didn’t work anymore. Instead of “Building Remotely” we had “Construction à distance“. Obviously, all the jobs failed.

The Solution

I had to stop and start Jenkins again so it spoke English. We made plans to upgrade both the OS and Jenkins so we didn’t run into this specific bug again. Aside from making sure that Jenkins didn’t accidentally speak French again, we also removed the clean up script.

In this case, the the job was creating rpms using mock. We would run mock with sudo and that meant the rpms were owned by the root user and the jenkins user could not delete the rpms. My solution back then was to use ACLs to give the jenkins user write access to files in the Jenkins workspace folder irrespective of the real owner. You can read my original postmortem on the gluster-infra mailing list archives.

We are currently in the process of changing hosting providers. The fix with ACLs always seemed hacky to me and I wanted to take this chance to remove the ACLs entirely. I’ve just added the jenkins user to the mock group and we build rpms without using sudo. That solves all the problems much more cleanly.

But hey, it brings me great joy to say we had a bug where Jenkins spoke French and thus caused a fun day of debugging and fixing.

Source: nigelb (The Funniest Incident Postmortem)

BLOG

  • 06 Dec 2020
    Looking back at 2020 – with g...

    2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...

    Read more
  • 27 Apr 2020
    Update from the team

    It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...

    Read more
  • 03 Feb 2020
    Building a longer term focus for Gl...

    The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...

    Read more