Archive:
Subtopics:
Comments disabled |
Fri, 17 May 2019
How a good organization handles stuff
Will not appear in live blog formula provider: mathjax We try to push code from development to production most days, and responsibility for overseeing this is given to a “pushmeister”. This job rotates and yesterday it fell to me. There are actually two pushmeisters, one in the morning to get the code from development to the staging environment and into the hands of the QA team, and then an afternoon pushmeister who takes over resolving the problems found by QA and getting the code from staging to production. ZipRecruiter is in a time zone three hours behind mine, so I always take the morning duty. This works out well, because I get into work around 05:30 Pacific time and can get an early start. A large part of the morning pushmeister's job is to look at what test failures have been introduced overnight, and either fix them or track down the guilty parties and find out what needs to be done to get them fixed. The policy is that the staging release is locked down as soon as possible after 09:00 Pacific time. No regular commits are accepted into the day's release once the lockdown has occurred. At that point the release is packaged and sent to the staging environment, which is as much as possible like production. It is automatically deployed, and the test suite is run on it. This takes about 90 minutes total. If the staging deployment starts late, it throws off the whole schedule, and the QA team might have to stay late. The QA people are brave martyrs, and I hate to make them stay late if I can help it. Since I get in at 05:30, I have a great opportunity: I can package a staging release and send it for deployment without locking staging, and find out about problems early on. Often there is some test that fails in staging that wasn't failing in dev. This time there was a more interesting problem: the deployment itself failed! Great to find this out three hours ahead of time. The problem in this case was some process I didn't know about failing to parse some piece of Javascript I didn't know about. But the Git history told me that that Javascript had been published the previous day and who had published it, so I tracked down the author, Anthony Aardvark, to ask about it. What we eventually determined was, Anthony had taken some large,
duplicated code out of three web pages, turned it into a library in a
separate After a couple of attempts to fix Anthony's code myself, I reverted it. 09:00 came, and I locked down the staging release on time, with Anthony's branch neatly reverted. The deployment went out on time and I handed things over to the afternoon pushmeister with things in good order. The following day I checked back in the front-end Slack channel and saw they had had a discussion about how this problem could have been detected sooner. They were reviewing a set of changes which, once deployed, will prevent the problem from recurring. What went right here? Pretty much everything, I think. I had a list of details, but I can sum them all up:
Also, nobody yelled or lost their temper. Problems come up and we solve them. I do my job and I can count on my co-workers to help me. Everyone does their job, and people don't lose their tempers. We have a great company. [Other articles in category /tech] permanent link |