BMJ Hosting Migration Project

BzzZzzZZzz… *crackle* ffssssssshhhh hello? @####! Krrrrrhhhhhz …-…-  hello? *%*%*drrdrrrr HELLO? &*£$%ggnng This is the Ops team calling fzztttsshshhhss Hello, are you receiving?

[sounds of distant meteoroids colliding, dust clouds forming into solar systems, etc]

Yes. It has felt a little like that. Like we’ve been on a mission to a distant galaxy with the hope of finding a new home — and the mission of rendering any such find habitable. Exhausting, all-consuming, isolating . . .

The good news is: we’re back, we did find a suitable planet, we’ve rendered it habitable, and we’ve transported the majority of BMJ there already!

But enough of this romantic noodling . . . [Yes: get on with it – Ed]

It was almost exactly a year ago, after a rigorous selection process, that we decided to go with YYYY as our new hosting partner and migrate away from our hosts for the last decade, XXXX.  By August, we had signed the contract and were working with YYYY on defining the platform build. At the same time, we were examining the state of our systems and starting to plan how we would handle the migration.

Nightmare. [Is there any way we can make that word drip with blood?] “Migration” is a neat little word behind which hides a mass of seething alien tentacles. A mass of seething alien tentacles which someone had left behind the sofa ten years ago in a rambling old house that has been continually re-organised by a succession of bored housekeepers on amphetamines and where many of the rooms have been locked and abandoned years ago.  I would say, “you get the picture”, only I really hope you don’t, as it would burn into your retina and haunt you for years to come. [I thought we’d done with the florid? – Ed]

Now, I’m worried that all this metaphor might be dismissed as mere hyperbole — let’s just take a peek at the literal: we’ve been running our online products in XXXX since their inception. Does anyone recall the original Clinical Evidence site? Or Learning?  These applications have improved and transformed over the years but they have also clung onto tiny bits of their past and remained dependant on them. This past is undocumented, or the documentation is lost, fragmented, and/or woefully out of date. Alongside this, you have the fact that BMJ products have been produced somewhat organically over the years and, although there was some (though far from total) alignment of platform, there was barely any consistency. Over time, these separate products recognised the need to communicate between themselves and, as each need arose, a solution was designed to accommodate it. The wheel wasn’t re-invented every single time, but we do have a whole boatload[sic] of wheel designs. Then there are publication routes into these products and data feeds out of them — all similarly consistent; all held together with Gaffa tape and bits of string.

Incredible. But what amazes me more is that we’ve done it. And not just done it but done it without bringing all our products offline for a weekend (as we had to for the migration to XXXX new datacentre 2½ years ago), without having to take the heart-in-mouth risk of moving everything at once, and [sound of distant trumpets heralding a golden dawn] with zero-downtime.

So, how did we manage this? Well, there were two or three key things: automation; subtle re-architecting; and the removal of technical debt. One early, significant win was the move away from our “monolithic” Oracle database and the adoption of individual database servers for each application.  That was a lot of work for the Ops and Dev teams but resulted in a significantly more flexible architecture.  A second key enabler was the decision to move to a “Shared Nothing” architecture, where each product would have its own set of application servers (where previously each appserver would host a number of products) and where code and other files would be separately deployed to each appserver (rather than  sit on a shared drive).  Again, this gained us a huge amount of flexibility.

Automation has been fundamental to the migration project which, to some extent, we’ve used as a lever to get automation ‘properly’ in place across our estate.  Automation, in this case, is about automatically taking a ‘new’ server and installing and configuring all the software that it needs to turn it into a bit of BMJ infrastructure — maybe a Best Practice application server, or a database for Learning, or an email server. There’s quite a lot of up-front work to plan and structure the system (we use “Puppet”, a popular open-source automation tool) and to define the parts that we require but, once that’s done, deploying architecture becomes basically little more than checking a configuration file into source control. No more logging on to remote computers and installing stuff by hand. In fact, if we were to log into a server now and change something, we would find that, within 30 minutes, Puppet had removed that change and put things back to the way they “should be”.

And, thirdly, a lot of effort was put into the removal of technical debt: workarounds and “expedient” fixes that had been put in place over the years and which increased maintenance overhead, increased the risk of failure, and decreased the ability to understand and manipulate our own systems.  

The Ops team have done a fantastic job — it’s really something to have such a highly-skilled, competent bunch of people working on this. Certainly helps me sleep at night. The Devs, too, have rallied behind the migration and put a great deal into helping push it through. The whole process has, in fact, been a great story of emerging DevOps collaboration and has brought the teams much closer together and lent a great deal of understanding and transparency to working practices, and to the way in which our platforms, technology stacks, and applications are engineered. And that knowledge has been captured and preserved in our TechWiki.

What next? Well, although we have (almost) finished migrating away from XXXX, there is still a great deal to do. We’ll be moving our development and test environments out of BMA House and into YYYY and AWS (this will be a breeze, given the automation work noted above); then we’ll be rearranging components to allow for more cost-effective use of resource. And while all of this moves ahead, we’ll be continuing to improve levels of automation and working with the Dev teams to establish how we can tie all this together into a seamless, automated pipeline that renders the entire department redundant! [Er, are you sure that’s what you mean? – Ed]