Scaling capacity while saving cash

>> Wednesday, November 12, 2014

There was a very interesting release engineering summit this Monday held in concert with LISA in Seattle.  I was supposed fly there this past weekend so I could give a talk on Monday but late last week I became ill and was unable to go.   Which was very disappointing because the summit looked really great and I was looking forward to meeting the other release engineers and learning about the challenges they face.

Scale in the Market  ©Clint Mickel, Creative Commons by-nc-sa 2.0

Although I didn't have the opportunity to give the talk in person, the slides for it are available on slideshare and my mozilla people account   The talk describes how we scaled our continuous integration infrastructure on AWS to handle double the amount of pushes it handled in early 2013, all while reducing our AWS monthly bill by 2/3.

Cost per push from Oct 2012 until Oct 2014. This does not include costs for on premise equipment. It reflects our monthly AWS bill divided by the number of monthly pushes (commits).  The chart reflects costs from October 2012-2014.

Thank you to Dinah McNutt and the other program committee members for organizing this summit.  I look forward to watching the talks once they are online.


Mozilla pushes - October 2014

Here's the October 2014 monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.

We didn't have a record breaking month in terms of the number of pushes, however we did have a daily record on October 18 with 715 pushes. 

12821 pushes, up slightly from the previous month
414 pushes/day (average)
Highest number of pushes/day: 715 pushes on October 8
22.5 pushes/hour (average)

General Remarks
Try keeps had around 39% of all the pushes, and gaia-try has about 31%. The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 21% of all the pushes

August 2014 was the month with most pushes (13,090  pushes)
August 2014 has the highest pushes/day average with 422 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
October 8, 2014 had the highest number of pushes in one day with 715 pushes


Mozilla pushes - September 2014

>> Monday, October 27, 2014

Here's September 2014's monthly analysis of the pushes to our Mozilla development trees.
You can load the data as an HTML page or as a json file.

Suprise!  No records were broken this month.

12267 pushes
409 pushes/day (average)
Highest number of pushes/day: 646 pushes on September 10, 2014
22.6 pushes/hour (average)

General Remarks
Try has around 36% of pushes and Gaia-Try comprise about 32%.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes.

August 2014 was the month with most pushes (13,090  pushes)
August 2014 has the highest pushes/day average with 620 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
August 20, 2014 had the highest number of pushes in one day with 690 pushes


Release Engineering in the classroom

The second week of October, I had the pleasure of presenting lectures on release engineering to university students in Montreal as part of the PLOW lectures at École Polytechnique de Montréal.    Most of the students were MSc or PhD students in computer science, with a handful of postdocs and professors in the class as well. The students came from Montreal area universities and many were international students. The PLOW lectures consisted of several invited speakers from various universities and industry spread over three days.

View looking down from the university

Université de Montréal administration building

École Polytechnique building.  Each floor is painted a different colour to represent a differ layer of the earth.  So the ground floor is red, the next orange and finally green.

The first day, Jack Jiang from York University gave a talk about software performance engineering.
The second day, I gave a lecture on release engineering in the morning.  The rest of the day we did a lot of labs to configure a Jenkins server to build and run tests on an open source project. Earlier that morning, I had setup m3.large instances for the students on Amazon that they could ssh into and conduct their labs.  Along the way, I talked about some release engineering concepts.  It was really interesting and I learned a lot from their feedback.  Many of the students had not been exposed to release engineering concepts so it was fun to share the information.

Several students came up to me during the breaks and said "So, I'm doing my PhD in release engineering, and I have several questions for you" which was fun.  Also, some of the students were making extensive use of code bases for Mozilla or other open source projects so that was interesting to learn more about.  For instance one research project looking at the evolution of multi-threading in a Mozilla code bases, and another student was conducting bugzilla comment sentiment analysis.  Are angry bug comments correlated with fewer bug fixes?  Looking forward to the results of this research!

I ended the day by providing two challenge exercises to the students that they could submit answers to.  One exercise was to setup a build pipeline in Jenkins for another open source project.  The other challenge was to use a the Jenkins REST API to query the Apache projects Jenkins server and present some statistics on their build history.  The results were pretty impressive!

My slides are on GitHub and the readme file describes how I setup the Amazon instances so Jenkins and some other required packages were installed before hand.  Please use them and distribute them if you are interested in teaching release engineering in your classroom.

Lessons I learned from this experience:
  • Computer science classes focus on writing software, but not necessarily building it is a team environment. So complex branching strategies are not necessarily a familiar concept to some students.  Of course, this depends on the previous work experience of the students and the curriculum at the school they attend. One of students said to me "This is cool.  We write code, but we don't build software".
  • Concepts such as building a pipeline for compilation, correctness/performance/
    regression testing, packing and deployment can also be unfamiliar.   As I said in the class, the work of the release engineer starts when the rest of the development team things they are done :-)
  • When you're giving a lecture and people would point out typos, or ask for clarification I'd always update the repository and ask the students to pull a new version.  I really liked this because my slides were in reveal.js and I didn't have to export a new PDF and redistribute.  Instant bug fixes!
  • Add bonus labs to the material so students who are quick to complete the exercises have more to do while the other students complete the original material.  Your classroom will have people with wildly different experience levels.
The third day there was a lecture by Michel Dagenais of Polytechnique Montréal on tracing heterogeneous cloud instances using (tracing framework for Linux).  The Eclipse trace compass project also made an appearance in the talk. I always like to see Eclipse projects highlighted.  One of his interesting points was that none of the companies that collaborate on this project wanted to sign a bunch of IP agreements so they could collaborate on this project behind closed doors.  They all wanted collaborate via an open source community and source code repository.  Another thing he emphasized was that students should make their work available on the web, via GitHub or other repositories so they have a portfolio of work available.  It was fantastic to seem him promote the idea of students being involved in open source as a way to help their job prospects when they graduate!

Thank you Foutse and  Bram  for the opportunity to lecture at your university!  It was a great experience!  Also, thanks Mozilla for the opportunity to do this sort of outreach to our larger community on company time!

Also, I have a renewed respect for teachers and professors.  Writing these slides took so much time.  Many long nights for me especially in the days leading up to the class.  Kudos to you all who do teach everyday.

The slides are on GitHub and the readme file describes how I setup the Amazon instances for the labs


Beyond the Code 2014: a recap

I started this blog post about a month ago and didn't finish it because well, life is busy.  

I attended Beyond the Code last September 19.  I heard about it several months ago on twitter.  A one-day conference about celebrating women in computing, in my home town, with an fantastic speaker line up?  I signed up immediately.   In the opening remarks, we were asked for a show of hands to show how many of us were developers, in design,  product management, or students and there was a good representation from all those categories.  I was especially impressed to see the number of students in the audience, it was nice to see so many of them taking time out of their busy schedule to attend.

View of the Parliament Buildings and Chateau Laurier from the MacKenzie street bridge over the Rideau Canal
Ottawa Conference Centre, location of Beyond the Code
There were seven speakers, three workshop organizers, a lunch time activity, and a panel at the end. The speakers were all women.  The speakers were not all white women or all heterosexual women.  There were many young women, not all industry veterans :-) like me.  To see this level of diversity at a tech conference filled me with joy.  Almost every conference I go to is very homogenous in the make up of the speakers and the audience.  To to see ~200 tech women in at conference and 10% men (thank you for attending:-) was quite a role reversal.

I completely impressed by the caliber of the speakers.  They were simply exceptional.

The conference started out with Kronda Adair giving a talk on Expanding Your Empathy.  One of the things that struck me from this talk was that she talked about how everyone lives in a bubble, and they don't see things that everyone does due to privilege.  She gave the example of how privilege is like a browser, and colours how we see the world.  For a straight white guy a web age looks great when they're running the latest Chrome on MacOSx.  For a middle class black lesbian, the web page doesn't look as great because it's like she's running IE7.  There is less inherent privilege.  For a "differently abled trans person of color" the world is like running IE6 in quirks mode. This was a great example. She also gave a shout out to the the Ascend Project which she and Lukas Blakk are running in Mozilla Portland office. Such an amazing initiative.

The next speaker was Bridget Kromhout who gave talk about Platform Ops in the Public Cloud.
I was really interested in this talk because we do a lot of scaling of our build infrastructure in AWS and wanted to see if she had faced similar challenges. She works at DramaFever, which she described as Netflix for Asian soap operas.  The most interesting things to me were the fact that she used all AWS regions to host their instances, because they wanted to be able to have their users download from a region as geographically close to them as possible.  At Mozilla, we only use a couple of AWS regions, but more instances than Dramafever, so this was an interesting contrast in the services used. In addition, the monitoring infrastructure they use was quite complex.  Her slides are here.

I was going to summarize the rest of the speakers but Melissa Jean Clark did an exceptional job on her blog.  You should read it!

Thank you Shopify for organizing this conference.  It was great to meet some many brilliant women in the tech industry! I hope there is an event next year too!


Mozilla Releng: The ice cream

>> Wednesday, September 17, 2014

A week or so ago, I was commenting in IRC that I was really impressed that our interns had such amazing communication and presentation skills.  One of the interns, John Zeller said something like "The cream rises to the top", to which I replied "Releng: the ice cream of CS".  From there, the conversation went on to discuss what would be the best ice cream flavour to make that would capture the spirit of Mozilla releng.  The consensus at the end was was that Irish Coffee (coffee with whisky) with cookie dough chunks was the favourite.  Because a lot of people like on the team like coffee, whisky makes it better and who doesn't like cookie dough?

I made this recipe over the weekend with some modifications.  I used the coffee recipe from the Perfect Scoop.  After it was done churning in the ice cream maker,  instead of whisky, which I didn't have on hand, I added Kahlua for more coffee flavour.  I don't really like cookie dough in ice cream but cooked chocolate chip cookies cut up with a liberal sprinkling of Kahlua are tasty.

Diced cookies sprinkled with Kahlua

Ice cream ready to put in freezer

Finished product
I have to say, it's quite delicious :-) If I open source ever stops being fun, I'm going to start a dairy empire.  Not really. Now back to bugzilla...


Mozilla pushes - August 2014

>> Wednesday, September 10, 2014

Here's August 2014's monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.

It was another record breaking month.  No surprise here!


  • 13090 pushes
    • new record
  • 422 pushes/day (average)
    • new record
  • Highest number of pushes/day: 690 pushes on August 20.  This same day corresponded with our first day where we ran over 100,000 test jobs.
    • new record
  • 23.12 pushes/hour (average)

General Remarks
Both Try and Gaia-Try have about 36% each of the pushes.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 21% of all the pushes.

August 2014 was the month with most pushes (13,090  pushes)
August 2014 has the highest pushes/day average with 620 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
August 20, 2014 had the highest number of pushes in one day with 690 pushes


Mozilla pushes - July 2014

>> Friday, August 08, 2014

Here's the July 2014 monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.
Like every month for the past while, we had a new record number of pushes. In reality, given that July is one day longer than June, the numbers are quite similar.


  • 12,755 pushes
    • new record
  •  411 pushes/day (average)
  • Highest number of pushes/day: 625 pushes on July 3, 2014
  • Highest 23.51 pushes/hour (average)
    • new record

General remarks
Try keeps on having around 38% of all the pushes. Gaia-Try is in second place with around 31% of pushes.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes.

July 2014 was the month with most pushes (12,755 pushes)
June 2014 has the highest pushes/day average with 662 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
June 4th, 2014 had the highest number of pushes in one day with 662 


Scaling mobile testing on AWS

>> Thursday, August 07, 2014

Running tests for Android at Mozilla has typically meant running on reference devices.  Physical devices that run jobs on our continuous integration farm via test harnesses.  However, this leads to the same problem that we have for other tests that run on bare metal.  We can't scale up our capacity without going buying new devices, racking them, configuring them for the network and updating our configurations.  In addition, reference cards, rack mounted or not, are rather delicate creatures and have higher retry rates (tests fail due to infrastructure issues and need to be rerun) than those running on emulators (tests run on an Android emulator in a VM on bare metal or cloud)

Do Android's Dream of Electric Sheep?  ©Bill McIntyre, Creative Commons by-nc-sa 2.0
Recently, we started running Android 2.3 tests on emulators in AWS.  This works well for unit tests (correctness tests).  It's not really appropriate for performance tests, but that's another story.  This impetus behind this change was so we could decommission Tegras, the reference devices we used for running Android 2.2 tests. 

We run many Linux based tests, including Android emulators on AWS spot instances.  Spot instances are AWS excess capacity that you can bid on.  If someone outbids the price you have paid for your spot instance, you instance can be terminated.  But that's okay because we retry jobs if they fail for infrastructure reasons.  The overall percentage of spot instances that are terminated is quite small.  The huge advantage to using spot instances is price.  They are much cheaper than on-demand instances which has allowed us to increase our capacity while continuing to reduce our AWS bill

We have a wide variety of unit tests that run on emulators for mobile on AWS.  We encountered an issue where some of the tests wouldn't run on the default instance type (m1.medium), that we use for our spot instances.   Given the number of jobs we run, we want to run on the cheapest AWS instance type that where the tests will complete successfully.  At the time we first tested it, we couldn't find an instance type where certain CPU/memory intensive tests would run.  So when I first enabled Android 2.3 tests on emulators, I separated the tests so that some would run on AWS spot instances and the ones that needed a more powerful machine would run on our inhouse Linux capacity.  But this change consumed all of the capacity of that pool and we had very high number of pending jobs in that pool.  This meant that people had to wait a long time for their test results.  Not good.

To reduce the pending counts, we needed to buy some more in house Linux capacity or try to run a selected subset of the tests that need more resources or find a new AWS instance type where they would complete successfully.  Geoff from the ATeam ran the tests on the c3.xlarge instance type he had tried before and now it seemed to work.  In his earlier work the tests did not complete successfully on this instance type.  We are unsure as to the reasons why.  One of the things about working with AWS is that we don't have a window into the bugs that they fix at their end.  So this particular instance type didn't work before, but it does now.

The next steps for me were to create a new AMI (Amazon machine image) that would serve as as the "golden" version for instances that would be created in this pool.  Previously, we used Puppet to configure our AWS test machines but now just regenerate the AMI every night via cron and this is the version that's instantiated.  The AMI was a copy of the existing Ubuntu64 image that we have but it was configured to run on the c3.xlarge instance type instead of m1.medium. This was a bit tricky because I had to exclude regions where the c3.xlarge instance type was not available.  For redundancy (to still have capacity if an entire region goes down) and cost (some regions are cheaper than others), we run instances in multiple AWS regions

Once I had the new AMI up that would serve as the template for our new slave class, I created a slave with the AMI and verified running the tests we planned to migrate on my staging server.  I also enabled two new Linux64 buildbot masters in AWS to service these new slaves, one in us-east-1 and one in us-west-2.  When enabling a new pool of test machines, it's always good to look at the load on the current buildbot masters and see if additional masters are needed so the current masters aren't overwhelmed with too many slaves attached.

After the tests were all green, I modified our configs to run this subset of tests on a branch (ash), enabled the slave platform in Puppet and added a pool of devices to this slave platform in our production configs.  After the reconfig deployed these changes into production, I landed a regular expression to watch_pending.cfg to so that new tst-emulator64-spot pool of machines would be allocated to the subset of tests and branch I enabled them on. The script watches the number of pending jobs that on AWS and creates instances as required.  We also have scripts to terminate or stop idle instances when we don't get them.  Why pay for machines when you don't need them now?  After the tests ran successfully on ash, I enabled running the tests on the other relevant branches.

Royal Border Bridge.  Also, release engineers love to see green builds and tests.  ©Jonathan Combe, Creative Commons by-nc-sa 2.0
The end result is that some Android 2.3 tests run on m1.medium or (tst-linux64-spot instances), such as mochitests.

And some Android 2.3 tests run on c3.xlarge or (tst-emulator64-spot instances), such as crashtests.


In enabling this slave class within our configs, we were also able to reuse it for some b2g tests which also faced the same problem where they needed a more powerful instance type for the tests to complete.

Lessons learned:
Use the minimum (cheapest) instance type required to complete your tests
As usual, test on a branch before full deployment
Scaling mobile tests doesn't mean more racks of reference cards

Future work:
Bug 1047467 c3.xlarge instance types are expensive, let's test running those tests on a range of instance types that are cheaper

Further reading:
AWS instance types 
Chris Atlee wrote about how we Now Use AWS Spot Instances for Tests
Taras Glek wrote How Mozilla Amazon EC2 Usage Got 15X Cheaper in 8 months
Rail Aliiev 
Bug 980519 Experiment with other instance types for Android 2.3 jobs 
Bug 1024091 Address high pending count in in-house Linux64 test pool 
Bug 1028293 Increase Android 2.3 mochitest chunks, for aws 
Bug 1032268 Experiment with c3.xlarge for Android 2.3 jobs
Bug 1035863 Add two new Linux64 masters to accommodate new emulator slaves
Bug 1034055 Implement c3.xlarge slave class for Linux64 test spot instances
Bug 1031083 Buildbot changes to run selected b2g tests on c3.xlarge
Bug 1047467 c3.xlarge instance types are expensive, let's try running those tests on a range of instance types that are cheaper


2014 USENIX Release Engineering Summit CFP now open

>> Monday, July 28, 2014

The CFP for the 2014 Release Engineering summit (Western edition) is now open.  The deadline for submissions is September 5, 2014 and speakers will be notified by September 19, 2014.  The program will be announced in late September.  This one day summit on all things release engineering will be held in concert with LISA, in Seattle on November 10, 2014. 

Seattle skyline © Howard Ignatius, Creative Commons by-nc-sa 2.0

From the CFP

"Suggestions for topics include (but are not limited to):
  • Best practices for release engineering
  • Practical information on specific aspects of release engineering (e.g., source code management, dependency management, packaging, unit tests, deployment)
  • Future challenges and opportunities in release engineering
  • Solutions for scalable end-to-end release processes
  • Scaling infrastructure and tools for high-volume continuous integration farms
  • War and horror stories
  • Metrics
  • Specific problems and solutions for specific markets (mobile, financial, cloud)
URES '14 West is looking for relevant and engaging speakers and workshop facilitators for our event on November 10, 2014, in Seattle, WA. URES brings together people from all areas of release engineering—release engineers, developers, managers, site reliability engineers, and others—to identify and help propose solutions for the most difficult problems in release engineering today."

War and horror stories. I like to see that in a CFP.  Describing how you overcame problems with  infrastructure and tooling to ship software are the best kinds of stories.  They make people laugh. Maybe cry as they realize they are currently living in that situation.  Good times.  Also, I think talks around scaling high volume continuous integration farms will be interesting.  Scaling issues are a lot of fun and expose many issues you don't see when you're only running a few builds a day. 

If you have any questions surrounding the CFP, I'm happy to help as I'm on the program committee.   (my irc nick is kmoir (#releng) as is my email id at


  © Blogger template Simple n' Sweet by 2009

Back to TOP