Scaling Yosemite

>> Friday, March 20, 2015

We migrated most of our Mac OS X 10.8 (Mountain Lion) test machines to 10.10.2 (Yosemite) this quarter.

This project had two major constraints:
1) Use the existing hardware pool (~100 r5 mac minis)
2) Keep wait times sane1.  (The machines are constantly running tests most of the day due to the distributed nature of the Mozilla community and this had to continue during the migration.)

So basically upgrade all the machines without letting people notice what you're doing!

Yosemite Valley - Tunnel View Sunrise by ©jeffkrause, Creative Commons by-nc-sa 2.0

Why didn't we just buy more minis and add them to the existing pool of test machines?
  1. We run performance tests and thus need to have all the machines running the same hardware within a pool so performance comparisons are valid.  If we buy new hardware, we need to replace the entire pool at once.  Machines with different hardware specifications = useless performance test comparisons.
  2. We tried to purchase some used machines with the same hardware specs as our existing machines.  However, we couldn't find a source for them.  As Apple stops production of old mini hardware each time they announce a new one, they are difficult and expensive to source.
Apple Pi by ©apionid, Creative Commons by-nc-sa 2.0

Given that Yosemite was released last October, why we are only upgrading our test pool now?  We wait until the population of users running a new platform2 surpass those the old one before switching.

Mountain Lion -> Yosemite is an easy upgrade on your laptop.  It's not as simple when you're updating production machines that run tests at scale.

The first step was to pull a few machines out of production and verify the Puppet configuration was working.  In Puppet, you can specify commands to only run certain operating system versions. So we implemented several commands to accommodate changes for Yosemite. For instance, changing the default scrollbar behaviour, new services that interfere with test runs needed to be disabled, debug tests required new Apple security permissions configured etc.

Once the Puppet configuration was stable, I updated our configs so the people could run tests on Try and allocated a few machines to this pool. We opened bugs for tests that failed on Yosemite but passed on other platforms.  This was a very iterative process.  Run tests on try.  Look at failures, file bugs, fix test manifests. Once we had to the opt (functional) tests in a green state on try, we could start the migration.

Migration strategy
  • Disable selected Mountain Lion machines from the production pool
  • Reimage as Yosemite, update DNS and let them puppetize
  • Land patches to disable Mountain Lion tests and enable corresponding Yosemite tests on selected branches
  • Enable Yosemite machines to take production jobs
  • Reconfig so the buildbot master enable new Yosemite builders and schedule jobs appropriately
  • Repeat this process in batches
    • Enable Yosemite opt and performance tests on trunk (gecko >= 39) (50 machines)
    • Enable Yosemite debug (25 more machines)
    • Enable Yosemite on mozilla-aurora (15 more machines)
We currently have 14 machines left on Mountain Lion for mozilla-beta and mozilla-release branches.

As a I mentioned earlier, the two constraints with this project were to use the existing hardware pool that constantly runs tests in production and keep the existing wait times sane.  We encountered two major problems that impeded that goal:

It's a compliment when people say things like "I didn't realize that you updated a platform" because it means the upgrade did not cause large scale fires for all to see.  So it was a nice to hear that from one of my colleagues this week.

Thanks to philor, RyanVM and jmaher for opening bugs with respect to failing tests and greening them up.  Thanks to coop for many code reviews. Thanks dividehex for reimaging all the machines in batches and to arr for her valiant attempts to source new-to-us minis!

References
1Wait times represent the time from when a job is added to the scheduler database until it actually starts running. We usually try to keep this to under 15 minutes but this really varies on how many machines we have in the pool.
2We run tests for our products on a matrix of operating systems and operating system versions. The terminology for operating system x version in many release engineering shops is a platform.  To add to this, the list of platform we support varies across branches.  For instance, if we're going to deprecate a platform, we'll let this change ride the trains to release.

Further reading
Bug 1121175: [Tracking] Fix failing tests on Mac OSX 10.10 
Bug 1121199: Green up 10.10 tests currently failing on try 
Bug 1126493: rollout 10.10 tests in a way that doesn't impact wait times
Bug 1144206: investigate what is causing frequent talos failures on 10.10
Bug 1125998: Debug tests initially took 1.5-2x longer to complete on Yosemite


Why don't you just run these tests in the cloud?
  1. The Apple EULA severely restricts virtualization on Mac hardware. 
  2. I don't know of any major cloud vendors that offer the Mac as a platform.  Those that claim they do are actually renting racks of Macs on a dedicated per host basis.  This does not have the inherent scaling and associated cost saving of cloud computing.  In addition, the APIs to manage the machines at scale aren't there.
  3. We manage ~350 Mac minis.  We have more experience scaling Apple hardware than many vendors. Not many places run CI at Mozilla scale :-) Hopefully this will change and we'll be able to scale testing on Mac products like we do for Android and Linux in a cloud.

Read more...

Mozilla pushes - February 2015

>> Tuesday, March 17, 2015

Here's February's 2015 monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

Trends
Although February is a shorter month, the number of pushes were close to those recorded in the previous month.  We had a higher average number of daily pushes (358) than in January (348).

Highlights
10015 pushes
358 pushes/day (average)
Highest number of pushes/day: 574 pushes on Feb 25, 2015
23.18 pushes/hour (highest)

General Remarks
Try had around 46% of all the pushes
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes

Records
August 2014 was the month with most pushes (13090  pushes)
August 2014 has the highest pushes/day average with 422 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
October 8, 2014 had the highest number of pushes in one day with 715 pushes 





Read more...

Release Engineering special issue now available

>> Wednesday, February 25, 2015

The release engineering special issue of IEEE software was published yesterday. (Download pdf here).  This issue focuses on the current state of release engineering, from both an industry and research perspective. Lots of exciting work happening in this field!

I'm interviewed in the roundtable article on the future of release engineering, along with Chuck Rossi of Facebook and Boris Debic of Google.  Interesting discussions on the current state of release engineering at organizations that scale large number of builds and tests, and release frequently.  As well,  the challenges with mobile releases versus web deployments are discussed. And finally, a discussion of how to find good release engineers, and what the future may hold.

Thanks to the other guest editors on this issue -  Stephany Bellomo, Tamara Marshall-Klein, Bram Adams, Foutse Khomh and Christian Bird - for all their hard work that make this happen!


As an aside, when I opened the issue, the image on the front cover made me laugh.  It's reminiscent of the cover on a mid-century science fiction anthology.  I showed Mr. Releng and he said "Robot birds? That is EXACTLY how I pictured working in releng."  Maybe it's meant to represent that we let software fly free.  In any case, I must go back to tending the flock of robotic avian overlords.

Read more...

Mozilla pushes - January 2015

>> Friday, February 13, 2015

Here's January 2015's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

Trends
We're back to regular volume after the holidays. Also, it's really cold outside in some parts of the of the Mozilla world.  Maybe committing code > going outside.


Highlights
10798 pushes
348 pushes/day (average)
Highest number of pushes/day: 562 pushes on Jan 28, 2015
18.65 pushes/hour (highest)

General Remarks
Try had around around 42% of all the pushes
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 24% of all of the pushes

Records
August 2014 was the month with most pushes (13,090  pushes)
August 2014 has the highest pushes/day average with 422 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
October 8, 2014 had the highest number of pushes in one day with 715 pushes 




Read more...

Reminder: Releng 2015 submissions due Friday, January 23

>> Wednesday, January 21, 2015

Just a reminder that submissions for the Releng 2015 conference are due this Friday, January 23. 

It will be held on May 19, 2015 in Florence Italy.

If you've done recent work like

  • migrating your build or test pipeline to the cloud
  • switching to a new build system
  • migrating to a new version control system
  • optimized your configuration management system or switched to a new one
  • implemented continuous integration for mobile devices
  • reduced end to end build times
  • or anything else build, release, configuration and test related
we'd love to hear from you.  Please consider submitting a talk!

In addition, if you have colleagues that work in this space that might have interesting topics to discuss at this workshop, please forward this information. I'm happy to talk to people about the submission process or possible topics if there are questions.

Il Duomo di Firenze by ©eddi_07, Creative Commons by-nc-sa 2.0


Sono nel comitato che organizza la conferenza Releng 2015 che si terrà il 19 Maggio 2015 a Firenze. La scadenza per l’invio dei paper è il 23 Gennaio 2015.

http://releng.polymtl.ca/RELENG2015/html/index.html

se avete competenze in:
  • migrazione del sistema di build o dei test nel cloud
  • aggiornamento del processo di build
  • migrazione ad un nuovo sistema di version control
  • ottimizzazione o aggiornamento del configuration management system
  • implementazione di un sistema di continuos integration per dispositivi mobili
  • riduzione dei tempi di build
  • qualsiasi cambiamento che abbia migliorato il sistema di build/test/release
e volete discutere della vostra esperienza, inviateci una proposta di talk!

Per favore inoltrate questa richiesta ai vostri colleghi e alle persone interessate a questi argomenti. Nel caso ci fossero domande sul processo di invio o sui temi di discussione, non esitate a contattarmi.

(Thanks Massimo for helping with the Italian translation).

More information
Releng 2015 web page
Releng 2015 CFP now open

Read more...

Mozilla pushes - December 2014

>> Thursday, January 08, 2015


Here's December 2014's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

Trends
There was a low number of pushes this month.  I expect this is due to the Mozilla all-hands in Portland in early December where we were encouraged to meet up with other teams instead of coding :-) and the holidays at the end of the month for many countries.
As as side node, in 2014 we had a total number of 124423 pushes, compared to 79233 in 2013 which represents a growth rate of 57% this year.

Highlights
7836 pushes
253 pushes/day (average)
Highest number of pushes/day: 706 pushes on Dec 17, 2014
15.25 pushes/hour (highest)

General Remarks
Try had around around 46% of all the pushes
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 23% of all of the pushes

Records
August 2014 was the month with most pushes (13,090  pushes)
August 2014 has the highest pushes/day average with 422 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
October 8, 2014 had the highest number of pushes in one day with 715 pushes 







Read more...

Releng 2015 CFP now open

>> Thursday, December 11, 2014

Florence, Italy.  Home of beautiful architecture.

Il Duomo di Firenze by ©runner310, Creative Commons by-nc-sa 2.0


Delicious food and drink.

Panzanella by © Pete Carpenter, Creative Commons by-nc-sa 2.0

Caffè ristretto by © Marcelo César Augusto Romeo, Creative Commons by-nc-sa 2.0


And next May, release engineering :-)

The CFP for Releng 2015 is now open.  The deadline for submissions is January 23, 2015.  It will be held on May 19, 2015 in Florence Italy and co-located with ICSE 2015.   We look forward to seeing your proposals about the exciting work you're doing in release engineering!

If you have questions about the submission process or anything else, please contact any of the program committee members. My email is kmoir and I work at mozilla.com.

Read more...

  © Blogger template Simple n' Sweet by Ourblogtemplates.com 2009

Back to TOP