[Gluster-infra] Postmortem for Gerrit Upgrade

Nigel Babu nigelb at redhat.com
Tue Jun 7 01:43:37 UTC 2016


Hello folks,

Here's a postmortem for the Gerrit migration issues

# Timeline of Events
May 25 - Test migration to PostgreSQL
May 27 - Migration to PostgreSQL on production
May 30 - Staging with Gerrit 2.12.2 available for testing
Jun 01 - Gerrit upgrade complete ( 0310)
Jun 01 - First notification of login issues (0628)
Jun 03 - Fix applied on test server and email sent out to affected users to test.
Jun 06 - Fix applied on production server

# Problems
Over the years, Gerrit has changed how it handles user accounts. In Gerrit 2.9,
the Github plugin allowed users to sign up and then set their username. As we
upgraded to 2.12, we discovered that your username defaults to your Github
username. In our instance, this affects a small subset of people. Additionally,
very few people who were affected by this bug actually tested out the staging
instance (only one person ran into the bug). Even in production, only those
users who signed out of review.gluster.org after the upgrade were actually
affected. There were quite a few users who were affected who did not realize
they were affected because they didnt log out. By the time the issues were
reported, we were a few hours into our upgrade and rollback wasn't an option
anymore.

# Solution
The first preference was given to checking if we had an easy fix. I looked at a
different plugin for Github authentication[1]. This plugin claims to allow
users to map different external identities onto on Gerrit user. I timeboxed
this testing down to a few hours and I found that it wasn't working during my
limited testing. Now that quick fixes were eliminated, I spent some time
diagnosing the issue in detail. In the meanwhile, I reached out to the gerrit
mailing list for help. From conversations with Michael and Raghavendra, I
learned that we've run into problems like this before and Justin has fixed
them. I reached out to Justin as well for help.

By the next morning (Jun 2), I had a good idea of what was wrong and a few
ideas on how to fix them. Justin had gotten back to me as well, so I had more
information to confirm my diagnosis. People who had Github username the same as
their gerrit username had no issues. Some people had a completely different
Github username from their gerrit username. And some people had multiple
usernames against the same account_id (one of them matching their Github
account). The older version of Gerrit + Github plugin seemed to tolerate both
of these situations. The newer version was less forgiving about this
inconsistency. When I removed the entry in accounts_external_ids which
corresponded to gerrit:<github-username>, on next login, a new account would be
created for those who had issues. This was the safest method of all. However,
this had the side effect that the new user would have none of the history of
the old one. I tried renaming the username, for which the side effects were
unknown, but also seemed to work. This meant that users would retain their
history, but their first git push/pull would fail until they changed the clone
path. I checked with the Gerrit mailing list about side effects of renaming
usernames. There are side effects, but it doesn't affect our particular use
case, so we were free to do so.

On Friday (Jun 3), I wrote a sql script to update everyone's accounts to a
consistent state. If you had different username in gerrit from Github, your
gerrit username would be changed to match your Github account. If you had
multiple usernames, only the one matching the Github username would be kept. I
ran it and I emailed everyone this affected to test logging in and doing
reviews. Huge thanks to Niels, Prashanth, Jiffin, and others for testing the
instance and reporting issues they came across.

On Monday (Jun 6), I backed up the database and ran this script in production.
We had a few people have issues pushing/pulling, but everyone has now figured
out the changes they need to make in the .git/config to get things working.

# What we Learned
* Gerrit's ssh-based flush-cache[2] command needs to be used after changing
  anything in the user table.
* After a Gerrit restart, it takes a bit for login to start working again. This
  time period depends on the machine's CPU/RAM. Much lower on production
  machine.
* We have a reasonably good idea about Gerrit's accounts_external_ids table.

# What Went Well
* We had a fix deployed within 3 working days from the reporting the issue.
* We've cleared out any repeat of this particular issue in the future.
* This instance is documented very well including the different approaches and
  their outcomes.

# What Went Badly
* We did not have documentation of previous Gerrit upgrade issues.
* Not enough testing of the new version of Gerrit and not enough time.
* When issues were noticed, the rollback plans were non-viable. We'd like to be
  in a place where we should catch these in staging or at least soon enough in
  production that we can rollback.

# Notes for Future
* Document previous issues and post-mortems. I will be working on creating a
  place for this. This post-mortem and any future ones will be available in a
  public place.
* Dogfood Gerrit. Most of the code other than our project code goes directly
  into Github. I would like for new projects that I maintain to be running and
  reviewed on Gerrit with a replication to Github.
* Establish an official staging site for gerrit.
* Establish a week long testing period before every upgrade with a small team
  of volunteers.
* Have a small team of developers be around during upgrades, so we can do
  immediate tests of the upgrade.

[1] https://github.com/davido/gerrit-oauth-provider
[2] https://gerrit-review.googlesource.com/Documentation/cmd-flush-caches.html


--
nigelb


More information about the Gluster-infra mailing list