<p dir="ltr">Infra team,</p>

<p dir="ltr">Given that the regression is always overloaded and spurious failures are always haunting us, can we try this out? We should also try to have a separate smoke verified flag just to give enough confidence to the reviewers as Jeff pointed out in this thread.</p>

<p dir="ltr">-Atin<br>

Sent from one plus one</p>

<div class="gmail_quote">On Jun 15, 2015 4:19 PM, &quot;Kaushal M&quot; &lt;<a href="mailto:kshlmster@gmail.com">kshlmster@gmail.com</a>&gt; wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>

<br>

The recent rush of reviews being sent due to the release of 3.7 was a<br>

cause of frustration for many of us because of the regression tests<br>

(gerrit troubles themselves are another thing).<br>

<br>

W.R.T regression 3 main sources of frustration were,<br>

1. Spurious test failures<br>

2. Long wait times<br>

3. Regression slave troubles<br>

<br>

We&#39;ve already tackled the spurious failure issue and are quite stable<br>

now. The trouble with the slave vms is related to the gerrit issues,<br>

and is mainly due to the network issues we are having between the<br>

data-centers hosting the slaves and gerrit/jenkins. People have been<br>

looking into this, but we haven&#39;t had much success. This leaves the<br>

issue of the long wait times.<br>

<br>

The long wait times are because of the long queues of pending jobs,<br>

some of which take days to get scheduled. Two things cause the long<br>

queues,<br>

1. Automatic regression job triggering for all submissions to gerrit<br>

2. Long run time for regression (~2h)<br>

<br>

The long queues coupled with the spurious failure and network<br>

problems, meant that jobs would fail for no reason after a long wait,<br>

and would have to be added to the back of the queue to be re-run. This<br>

meant that developers would have to wait days for their changes to get<br>

merged, and was one of the causes for the delay in the release of 3.7.<br>

<br>

The solution reduce wait times for regression runs. To reduce wait<br>

times we should,<br>

1. Trigger runs only when required<br>

2. Reduce regression run time.<br>

<br>

Raghavendra Talur (rtalur/RaSTar) will soon send out a mail with his<br>

findings on the regression run times, and we can continue discussion<br>

on it on that thread.<br>

<br>

Earlier, the regression runs used to be manually triggered by the<br>

maintainers once they had determined that a change was ready for<br>

submission. But as there were only two maintainers before (Vijay and<br>

Avati) auto triggering was brought in to reduce their load. Auto<br>

triggering worked fine when we had a lower volume of changes being<br>

submitted to gerrit. But now, with the large volumes we see during the<br>

release freeze dates, auto triggering just adds to problems.<br>

<br>

I propose that we move back to the old model of starting regression<br>

runs only once the maintainers are ready to merge. But instead of the<br>

maintainers manually tiggering the runs, we could automate it.<br>

<br>

We can model our new workflow on those of OpenStack[1] and<br>

Wikimedia[2]. The existing Gerrit plugin for Jenkins doesn&#39;t provide<br>

the features necessary to enable selective triggering based on Gerrit<br>

flags. Both OpenStack and Wikimedia use a project gating tool called<br>

Zuul[3], which provides a much better integration with Jenkins and<br>

Gerrit and more features on top.<br>

<br>

I propose the following work flow,<br>

<br>

- Developer pushes change to Gerrit.<br>

  - Zuul is notified by Gerrit of new change<br>

- Zuul runs pre-review checks on Jenkins. This will be the current smoke tests.<br>

  - Zuul reports back status of the checks to Gerrit.<br>

    - If checks fail, developer will need to resend the change after<br>

the required fixes. The process starts once more.<br>

    - If the checks pass, the change is now ready for review<br>

- The change is now reviewed by other developers and maintainers.<br>

Non-maintainers will be able to give only a +1 review.<br>

  - On a negative review, the developer will need to rework the change<br>

and resend it. The process starts once more.<br>

- The maintainer give a +2 review once he/she is satisfied. The<br>

maintainers work is done here.<br>

  - Zuul is notified of the +2 review<br>

- Zuul runs the regression runs and reports back the status.<br>

  - If the regression runs fail, the process starts over again.<br>

  - If the runs pass, the change is ready for acceptance.<br>

- Zuul will pick the change into the repository.<br>

  - If the pick fails, Zuul will report back the failure, and the<br>

process starts once again.<br>

<br>

Following this flow should,<br>

1. Reduce regression wait time<br>

2. Improve change acceptance time<br>

3. Reduce unnecessary  wastage of infra resources<br>

4. Improve infra stability.<br>

<br>

It also brings in drawbacks that we need to maintain one other piece<br>

of infra (Zuul). This would be an additional maintenance overhead on<br>

top of Gerrit, Jenkins and the current slaves. But I feel the<br>

reduction in the upkeep efforts of the slaves would be enough to<br>

offset this.<br>

<br>

tl;dr<br>

Current auto-triggering of regression runs is stupid and a waste of<br>

time and resources. Bring in a project gating system, Zuul, which can<br>

do a much more intelligent jobs triggering, and use it to<br>

automatically trigger regression only for changes with Reviewed+2 and<br>

automatically merge ones that pass.<br>

<br>

What does the community think of this?<br>

<br>

~kaushal<br>

<br>

[1]: <a href="http://docs.openstack.org/infra/manual/developers.html#automated-testing" rel="noreferrer" target="_blank">http://docs.openstack.org/infra/manual/developers.html#automated-testing</a><br>

[2]: <a href="https://www.mediawiki.org/wiki/Continuous_integration/Workflow" rel="noreferrer" target="_blank">https://www.mediawiki.org/wiki/Continuous_integration/Workflow</a><br>

[3]: <a href="http://docs.openstack.org/infra/zuul/" rel="noreferrer" target="_blank">http://docs.openstack.org/infra/zuul/</a><br>

_______________________________________________<br>

Gluster-devel mailing list<br>

<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>

<a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-devel</a><br>

</blockquote></div>